The GIL is a CPython runtime constraint, not a blanket verdict on threads. It serializes Python bytecode execution in a regular interpreter while many waits and some native work still release it.
Core answer
Use threads for waiting-heavy synchronous work and shared-memory orchestration. Use processes, interpreter isolation, native code, or carefully evaluated free-threaded builds when CPU-bound Python throughput needs multiple cores.
# [CURRENT - 3.10-3.14] Works on Python 3.10+from concurrent.futures import ThreadPoolExecutorfrom dataclasses import dataclassfrom time import sleep@dataclass(frozen=True, slots=True)class Request: name: str wait_seconds: floatdef fetch(request: Request) -> str: sleep(request.wait_seconds) return request.namewith ThreadPoolExecutor(max_workers=2) as pool: print(list(pool.map(fetch, [Request("a", 0.01), Request("b", 0.01)])))Why this design exists
The classic GIL made CPython's object model, reference counting, and many runtime invariants simpler to keep safe while threads share objects. That convenience costs Python-bytecode parallelism. PEP 684 and PEP 703 show the modern direction: per-interpreter parallelism and optional free-threaded builds require deliberate runtime changes.
Mechanics and CPython internals
Regular CPython threads share one interpreter GIL. A thread executing Python bytecode must own it; blocking I/O and C extensions can release it. Pure Python CPU loops mostly time-slice rather than scale across cores. Free-threaded CPython changes that implementation story, but it is a runtime and ecosystem choice, not a retroactive guarantee for default builds.
# [CURRENT - 3.10-3.14] Works on Python 3.10+from concurrent.futures import ThreadPoolExecutorfrom dataclasses import dataclass@dataclass(frozen=True, slots=True)class CpuJob: size: intdef crunch(job: CpuJob) -> int: total = 0 for value in range(job.size): total += value * value return totaljobs = [CpuJob(20_000), CpuJob(20_000)]with ThreadPoolExecutor(max_workers=2) as pool: print(list(pool.map(crunch, jobs)))Complexity and tradeoffs
Threads avoid pickling and share memory, but shared mutable state needs synchronization. Processes buy separate runtimes and multi-core execution for CPU-bound Python at the cost of serialization, startup, and memory. The right measurement classifies time spent in Python bytecode, native code that releases the GIL, or external waits.
Idiomatic patterns and refactoring
Refactor architecture around workload shape, not GIL slogans.
# [CURRENT - 3.10-3.14] Works on Python 3.10+from concurrent.futures import ProcessPoolExecutorfrom dataclasses import dataclass@dataclass(frozen=True, slots=True)class Batch: values: tuple[int, ...]def sum_squares(batch: Batch) -> int: return sum(value * value for value in batch.values)def run_parallel() -> list[int]: batches = [Batch(tuple(range(10_000))), Batch(tuple(range(10_000, 20_000)))] with ProcessPoolExecutor(max_workers=2) as pool: return list(pool.map(sum_squares, batches))if __name__ == "__main__": print(run_parallel())Common mistakes and edge cases
The GIL does not make application-level shared-state protocols safe. It does not make async irrelevant. It does not predict extension-module behavior unless you know where that extension releases the lock.
When to use / When NOT to use
Use threads for I/O waits and processes for CPU-bound Python when portable multi-core scaling is required today.
Do not rewrite code around the GIL before measuring the real bottleneck and the deployment runtime you actually use.