The GIL, threads, and performance

Why CPython has a GIL, when it blocks scaling, and when threads still work well

The GIL (Global Interpreter Lock) is a single mutex in CPython that prevents two threads from executing Python bytecode simultaneously. It exists because CPython's memory management is not thread-safe by default. I/O-bound workloads release the GIL during blocking calls, so threads overlap effectively. CPU-bound workloads hold it, limiting parallelism to one thread at a time. `ProcessPoolExecutor` bypasses the GIL by using separate processes. Python 3.12 introduced per-interpreter GIL (PEP 684), where each sub-interpreter gets its own GIL. Python 3.13 added a free-threaded build mode (PEP 703) where the GIL can be disabled. Python 3.14 adds `InterpreterPoolExecutor` for running tasks across sub-interpreters. <a href="/async-foundations-awaitables">Compare GIL threading with async cooperative concurrency</a>. <a href="/asyncio-task-groups">See TaskGroup for structured async concurrency</a>.

Understand.
Visualize.
Master.

Python in Depth

An interactive engineering reference for Python internals

Quick note

CPU and I/O have different concurrency stories.

:)
Python version

Targets Python 3.10–3.14. Python 3.9 and below are End-of-Life.

TABLE OF CONTENTS
6.1The GIL, threads, and performance

Why CPython has a GIL, when it blocks scaling, and when threads still work well

The GIL is a CPython runtime constraint, not a blanket verdict on threads. It serializes Python bytecode execution in a regular interpreter while many waits and some native work still release it.

Core answer

Use threads for waiting-heavy synchronous work and shared-memory orchestration. Use processes, interpreter isolation, native code, or carefully evaluated free-threaded builds when CPU-bound Python throughput needs multiple cores.

# [CURRENT - 3.10-3.14] Works on Python 3.10+
from concurrent.futures import ThreadPoolExecutor
from dataclasses import dataclass
from time import sleep
@dataclass(frozen=True, slots=True)
class Request:
name: str
wait_seconds: float
def fetch(request: Request) -> str:
sleep(request.wait_seconds)
return request.name
with ThreadPoolExecutor(max_workers=2) as pool:
print(list(pool.map(fetch, [Request("a", 0.01), Request("b", 0.01)])))

Why this design exists

The classic GIL made CPython's object model, reference counting, and many runtime invariants simpler to keep safe while threads share objects. That convenience costs Python-bytecode parallelism. PEP 684 and PEP 703 show the modern direction: per-interpreter parallelism and optional free-threaded builds require deliberate runtime changes.

Mechanics and CPython internals

Regular CPython threads share one interpreter GIL. A thread executing Python bytecode must own it; blocking I/O and C extensions can release it. Pure Python CPU loops mostly time-slice rather than scale across cores. Free-threaded CPython changes that implementation story, but it is a runtime and ecosystem choice, not a retroactive guarantee for default builds.

# [CURRENT - 3.10-3.14] Works on Python 3.10+
from concurrent.futures import ThreadPoolExecutor
from dataclasses import dataclass
@dataclass(frozen=True, slots=True)
class CpuJob:
size: int
def crunch(job: CpuJob) -> int:
total = 0
for value in range(job.size):
total += value * value
return total
jobs = [CpuJob(20_000), CpuJob(20_000)]
with ThreadPoolExecutor(max_workers=2) as pool:
print(list(pool.map(crunch, jobs)))

Complexity and tradeoffs

Threads avoid pickling and share memory, but shared mutable state needs synchronization. Processes buy separate runtimes and multi-core execution for CPU-bound Python at the cost of serialization, startup, and memory. The right measurement classifies time spent in Python bytecode, native code that releases the GIL, or external waits.

Idiomatic patterns and refactoring

Refactor architecture around workload shape, not GIL slogans.

# [CURRENT - 3.10-3.14] Works on Python 3.10+
from concurrent.futures import ProcessPoolExecutor
from dataclasses import dataclass
@dataclass(frozen=True, slots=True)
class Batch:
values: tuple[int, ...]
def sum_squares(batch: Batch) -> int:
return sum(value * value for value in batch.values)
def run_parallel() -> list[int]:
batches = [Batch(tuple(range(10_000))), Batch(tuple(range(10_000, 20_000)))]
with ProcessPoolExecutor(max_workers=2) as pool:
return list(pool.map(sum_squares, batches))
if __name__ == "__main__":
print(run_parallel())

Common mistakes and edge cases

The GIL does not make application-level shared-state protocols safe. It does not make async irrelevant. It does not predict extension-module behavior unless you know where that extension releases the lock.

When to use / When NOT to use

Use threads for I/O waits and processes for CPU-bound Python when portable multi-core scaling is required today.

Do not rewrite code around the GIL before measuring the real bottleneck and the deployment runtime you actually use.

Further reading

  • Official docs: threading
  • Official docs: concurrent.futures
  • Official docs: free-threading HOWTO
  • PEP 684: per-interpreter GIL
  • PEP 703: making the GIL optional
  • CPython source: GIL support
MEASURED NOTEBOOKMeasured
Measured concurrency outcomes

This notebook separates two different questions: how CPython behaves on CPU-bound bytecode work and how it behaves when tasks mostly wait on blocking I/O. It shows which execution model wins for each case.

Winnerprocesses — 69.67 ms @ 4 workers, fastest CPU path
RELATED GUIDE
CPU-bound scaling
0.0 ms97 ms194 ms1 worker2 workers4 workers
sequential
threads
processes
CONTROLS
METRICS
Fastest CPU pathprocesses @ 4 workers
Threads vs sequential1.04x of sequential
Processes @ 4 workers69.67 ms
Largest worker set4 workers
NOTES

What this tests — CPU-bound integer arithmetic dispatched sequentially, across threads, and across processes. All paths do the same work; the question is whether more workers means more throughput under CPython's GIL.

Why processes won for CPU — the GIL allows only one thread at a time to execute Python bytecode per interpreter. Threads on CPU-bound work compete for the GIL, adding contention overhead without parallelism. Processes bypass the GIL entirely, each running its own interpreter.

The surprise — threads are 1.04x of sequential at 4 workers, not faster. Many developers expect threads to speed up CPU work automatically. The GIL prevents this — threads help for I/O, not CPU-bound Python loops.

Takeaway — for CPU-bound Python, use multiprocessing or write the hot path in a C extension that releases the GIL. Threads are for I/O overlap, not CPU parallelism in regular CPython.

TEST ENVIRONMENT
Python Version3.12.3
Machinex86_64
Contribute