The GIL, threads, and performance

The GIL is a CPython runtime constraint, not a blanket verdict on threads. It serializes Python bytecode execution in a regular interpreter while many waits and some native work still release it.

Core answer

Use threads for waiting-heavy synchronous work and shared-memory orchestration. Use processes, interpreter isolation, native code, or carefully evaluated free-threaded builds when CPU-bound Python throughput needs multiple cores.

# [CURRENT - 3.10-3.14] Works on Python 3.10+
from concurrent.futures import ThreadPoolExecutor
from dataclasses import dataclass
from time import sleep
@dataclass(frozen=True, slots=True)
class Request:
    name: str
    wait_seconds: float
def fetch(request: Request) -> str:
    sleep(request.wait_seconds)
    return request.name
with ThreadPoolExecutor(max_workers=2) as pool:
    print(list(pool.map(fetch, [Request("a", 0.01), Request("b", 0.01)])))

Why this design exists

The classic GIL made CPython's object model, reference counting, and many runtime invariants simpler to keep safe while threads share objects. That convenience costs Python-bytecode parallelism. PEP 684 and PEP 703 show the modern direction: per-interpreter parallelism and optional free-threaded builds require deliberate runtime changes.

Mechanics and CPython internals

Regular CPython threads share one interpreter GIL. A thread executing Python bytecode must own it; blocking I/O and C extensions can release it. Pure Python CPU loops mostly time-slice rather than scale across cores. Free-threaded CPython changes that implementation story, but it is a runtime and ecosystem choice, not a retroactive guarantee for default builds.

# [CURRENT - 3.10-3.14] Works on Python 3.10+
from concurrent.futures import ThreadPoolExecutor
from dataclasses import dataclass
@dataclass(frozen=True, slots=True)
class CpuJob:
    size: int
def crunch(job: CpuJob) -> int:
    total = 0
    for value in range(job.size):
        total += value * value
    return total
jobs = [CpuJob(20_000), CpuJob(20_000)]
with ThreadPoolExecutor(max_workers=2) as pool:
    print(list(pool.map(crunch, jobs)))

Complexity and tradeoffs

Threads avoid pickling and share memory, but shared mutable state needs synchronization. Processes buy separate runtimes and multi-core execution for CPU-bound Python at the cost of serialization, startup, and memory. The right measurement classifies time spent in Python bytecode, native code that releases the GIL, or external waits.

Idiomatic patterns and refactoring

Refactor architecture around workload shape, not GIL slogans.

# [CURRENT - 3.10-3.14] Works on Python 3.10+
from concurrent.futures import ProcessPoolExecutor
from dataclasses import dataclass
@dataclass(frozen=True, slots=True)
class Batch:
    values: tuple[int, ...]
def sum_squares(batch: Batch) -> int:
    return sum(value * value for value in batch.values)
def run_parallel() -> list[int]:
    batches = [Batch(tuple(range(10_000))), Batch(tuple(range(10_000, 20_000)))]
    with ProcessPoolExecutor(max_workers=2) as pool:
        return list(pool.map(sum_squares, batches))
if __name__ == "__main__":
    print(run_parallel())

Common mistakes and edge cases

The GIL does not make application-level shared-state protocols safe. It does not make async irrelevant. It does not predict extension-module behavior unless you know where that extension releases the lock.

When to use / When NOT to use

Use threads for I/O waits and processes for CPU-bound Python when portable multi-core scaling is required today.

Do not rewrite code around the GIL before measuring the real bottleneck and the deployment runtime you actually use.

# [CURRENT - 3.10-3.14] Works on Python 3.10+
from concurrent.futures import ThreadPoolExecutor
from dataclasses import dataclass
from time import sleep
@dataclass(frozen=True, slots=True)
class Request:
    name: str
    wait_seconds: float
def fetch(request: Request) -> str:
    sleep(request.wait_seconds)
    return request.name
with ThreadPoolExecutor(max_workers=2) as pool:
    print(list(pool.map(fetch, [Request("a", 0.01), Request("b", 0.01)])))

Why this design exists

Mechanics and CPython internals

# [CURRENT - 3.10-3.14] Works on Python 3.10+
from concurrent.futures import ThreadPoolExecutor
from dataclasses import dataclass
@dataclass(frozen=True, slots=True)
class CpuJob:
    size: int
def crunch(job: CpuJob) -> int:
    total = 0
    for value in range(job.size):
        total += value * value
    return total
jobs = [CpuJob(20_000), CpuJob(20_000)]
with ThreadPoolExecutor(max_workers=2) as pool:
    print(list(pool.map(crunch, jobs)))

Complexity and tradeoffs

Idiomatic patterns and refactoring

Refactor architecture around workload shape, not GIL slogans.

# [CURRENT - 3.10-3.14] Works on Python 3.10+
from concurrent.futures import ProcessPoolExecutor
from dataclasses import dataclass
@dataclass(frozen=True, slots=True)
class Batch:
    values: tuple[int, ...]
def sum_squares(batch: Batch) -> int:
    return sum(value * value for value in batch.values)
def run_parallel() -> list[int]:
    batches = [Batch(tuple(range(10_000))), Batch(tuple(range(10_000, 20_000)))]
    with ProcessPoolExecutor(max_workers=2) as pool:
        return list(pool.map(sum_squares, batches))
if __name__ == "__main__":
    print(run_parallel())

Python in Depth

Core answer

Why this design exists

Mechanics and CPython internals

Complexity and tradeoffs

Idiomatic patterns and refactoring

Common mistakes and edge cases

When to use / When NOT to use

Further reading

The GIL, threads, and performance

Python in Depth

Core answer

Why this design exists

Mechanics and CPython internals

Complexity and tradeoffs

Idiomatic patterns and refactoring

Common mistakes and edge cases

When to use / When NOT to use

Further reading