The GIL, threads, and performance

Why CPython has a GIL, when it blocks scaling, and when threads still work well

The GIL (Global Interpreter Lock) is a single mutex in CPython that prevents two threads from executing Python bytecode simultaneously. It exists because CPython's memory management is not thread-safe by default. I/O-bound workloads release the GIL during blocking calls, so threads overlap effectively. CPU-bound workloads hold it, limiting parallelism to one thread at a time. `ProcessPoolExecutor` bypasses the GIL by using separate processes. Python 3.12 introduced per-interpreter GIL (PEP 684), where each sub-interpreter gets its own GIL. Python 3.13 added a free-threaded build mode (PEP 703) where the GIL can be disabled. Python 3.14 adds `InterpreterPoolExecutor` for running tasks across sub-interpreters. <a href="/async-foundations-awaitables">Compare GIL threading with async cooperative concurrency</a>. <a href="/asyncio-task-groups">See TaskGroup for structured async concurrency</a>.

Understand.
Visualize.
Master.

Python in Depth

An interactive engineering reference for Python internals

Quick note

CPU and I/O have different concurrency stories.

:)
TABLE OF CONTENTS
6.1The GIL, threads, and performance

Why CPython has a GIL, when it blocks scaling, and when threads still work well

The GIL, or global interpreter lock, is one of the most important runtime constraints in CPython. It affects how threads behave, why some code scales across cores and some does not, and why you have heard "Python threading is slow" without ever getting the full story.

Think of the GIL like a single-lane bridge. Only one car (thread) crosses the bridge (executes Python bytecode) at a time. If a car stops to enjoy the view (waits on I/O), it pulls onto a rest area and releases the bridge for other cars. That is why I/O-bound threads work well but CPU-bound threads queue up. The bridge is not the problem — the bottleneck is what you are doing while on it.

Core answer

In a regular CPython build, the GIL allows only one thread at a time to execute Python bytecode in a given interpreter.

That has three immediate consequences:

  • pure-Python CPU-bound threads usually do not scale across cores the way people first expect
  • I/O-bound threads can still help a lot because the GIL is released while waiting on I/O
  • true multi-core Python execution usually requires another model:
    • processes
    • multiple interpreters with separate GILs
    • native code that releases the GIL
    • or a free-threaded CPython build
# [CURRENT - 3.10-3.14] Works on Python 3.x
from concurrent.futures import ThreadPoolExecutor
def cpu_task(n):
total = 0
for i in range(n):
total += i
return total
See What the GIL Changes

Compare CPU-bound threads, I/O-bound threads, process-based parallelism, and free-threaded CPython so the GIL story stays concrete.

Why the GIL exists

The official Python glossary describes the GIL as the mechanism used by CPython to ensure that only one thread executes Python bytecode at a time. The historical tradeoff is explicit:

  • CPython's object model and runtime become much simpler to implement correctly
  • critical built-in structures are protected from many forms of concurrent corruption
  • but Python bytecode execution loses much of the parallelism available on multi-core machines

This matters because CPython objects are deeply shared runtime structures:

  • reference counts change constantly
  • containers such as dict, list, and set have mutable internal state
  • many operations can allocate, deallocate, resize, and invoke arbitrary Python code

The GIL centralizes a large part of that safety story in the regular build.

Without it, CPython needs more fine-grained thread-safety machinery around:

  • reference counting
  • object memory management
  • container access
  • specialization caches

That is exactly why free-threaded CPython is a major interpreter project rather than a tiny switch.

CPython internals

In CPython 3.12+, the GIL implementation lives across the evaluation loop, thread-state code, and Python/ceval_gil.c. In the standard CPython build (Py_GIL_DISABLED not defined), a thread must hold the GIL while executing Python bytecode, and the runtime releases it around blocking operations.

Acquisition model. CPython uses a periodic-check approach. The GIL is not released after every bytecode instruction. Instead, the eval loop and GIL scheduler cooperate through internal eval-breaker state. The public tuning knob is sys.setswitchinterval(), which sets the ideal duration of a thread's timeslice (default: 5 milliseconds on CPython 3.12). The exact handoff timing is an implementation detail and can also be affected by blocking calls and operating-system scheduling.

Release points. The GIL is explicitly released in these situations (for example in CPython's eval, GIL, and thread support code):

  • before any blocking I/O call (file read/write, socket operations, time.sleep())
  • during computationally intensive native code in some extension modules (e.g., hashing, compression, regex matching)
  • when Py_BEGIN_ALLOW_THREADS / Py_END_ALLOW_THREADS macros are used in C extensions

This is why the right mental model is:

  • threads are still concurrent
  • but Python bytecode execution in one interpreter is serialized by the GIL in the regular build

The switch interval:

# [CURRENT - 3.10-3.14] Works on Python 3.x
import sys
print(sys.getswitchinterval())

The switch interval is a runtime tuning knob for how often Python threads are given a chance to switch. Actual scheduling depends on the operating system and on whether the current thread reaches a point where switching can happen — it is a best-effort interval, not a deterministic guarantee.

Per-interpreter GIL (PEP 684, Python 3.12+). Py_NewInterpreter() has existed since Python 1.5, but before 3.12 all sub-interpreters shared the same GIL. Starting in Python 3.12, each sub-interpreter created by Py_NewInterpreter() gets its own GIL. This is implemented in Python/pystate.c — the GIL is stored per-interpreter in the PyInterpreterState struct, so threads running in different interpreters do not contend for the same lock. InterpreterPoolExecutor (Python 3.14) provides a high-level API for this.

Distinguish "concurrency" (dealing with many things at once) from "parallelism" (doing many things at once). The GIL makes this distinction concrete: threads provide concurrency for I/O-bound work but not parallelism for CPU-bound Python bytecode. Use processes for CPU-bound parallelism in the regular CPython build.

GIL guarantees and limits

The GIL protects CPython runtime internals. What it guarantees:

  • individual bytecode-level operations are internally consistent
  • the interpreter itself avoids corruption from concurrent access
  • most Py_DECREF calls and allocation paths are safe without additional locking

What it does not replace:

  • x += 1 spans multiple bytecode steps and is not atomic
  • compound operations on shared data still need explicit synchronization
  • your threaded application code still needs locks for shared mutable state

Python only guarantees atomicity where it is explicitly documented. For everything else, use threading.Lock.

The GIL protects the interpreter from corruption. It does not protect your application's shared-state logic. Use explicit synchronization for shared mutable data.

How it affects performance

The main performance effect is simple:

  • CPU-bound pure-Python threads compete for one interpreter lock
  • I/O-bound threads often overlap effectively because the waiting thread releases the GIL

Representative local CPython 3.12.3 measurements on this machine:

WorkloadShapeElapsed timeMain reason
CPU loop twiceserial1.088 sNo thread coordination cost; just one thread running Python bytecode at a time anyway
CPU loop twice2 threads1.057 sThreads time-slice behind one GIL, so there is little or no multi-core gain
CPU loop twice2 processes0.591 sSeparate runtimes can execute on separate cores
sleep(0.25) twiceserial0.500 sWaiting happens one after another
sleep(0.25) twice2 threads0.251 sWaiting overlaps because blocked threads release the GIL

These are local measurements, not language guarantees. The point is the shape:

  • CPU-bound threads: usually no real Python-bytecode parallelism
  • waiting-heavy threads: often very effective
Threads, I/O, native code, and bytecode

The glossary states two important facts:

  • the GIL is always released during I/O
  • some extension modules release it during computationally intensive native work such as compression or hashing

That means the real question is not just "am I using threads?" The real question is:

  • where is the time spent?
  • in Python bytecode?
  • in blocking waits?
  • in native code that releases the GIL?

This is where becomes useful. If the hot path is dominated by Python bytecode, dis can help explain the execution shape. If the hot path is mostly C-level work or I/O, bytecode may matter much less than the native runtime boundary.

What to use instead when threads are the wrong tool

If the workload is CPU-bound Python code, your main alternatives are:

  1. ProcessPoolExecutor or multiprocessing
  2. multiple interpreters with separate GILs
  3. native extensions or vectorized libraries that release the GIL
  4. free-threaded CPython builds

Python 3.14 adds InterpreterPoolExecutor, which runs tasks in separate interpreters. The docs explicitly describe its main benefit: each interpreter has its own GIL, so code in one interpreter can run on one CPU core while code in another interpreter runs unblocked on a different core.

That is an important distinction:

  • threads in one interpreter share one GIL
  • separate interpreters can each have their own GIL

The tradeoff is stronger isolation and more deliberate data movement.

Free-threaded CPython

As of Python 3.13, CPython supports a free-threaded build based on PEP 703. The standard docs describe it as a separate configuration where the GIL is disabled.

Important caveats:

  • this is not the default build
  • compatibility and performance tradeoffs still matter
  • some extensions may re-enable the GIL at runtime or may not support the free-threaded build yet

The free-threaded story is therefore not:

  • "Python finally has no GIL everywhere"

The real story is:

  • CPython now has an evolving opt-in build configuration where the GIL can be disabled
  • making that safe requires substantial internal runtime changes, including per-object locking, biased reference counting, and quiescent-state-based reclamation (QSBR) in Objects/object.c
  • deployment, extension compatibility, and single-thread costs still matter
Version context

Current project guidance targets Python 3.10-3.14. Python 3.9 and below are End-of-Life.

Important version markers:

  • regular CPython in the supported range still has the classic GIL behavior by default
  • PEP 684 introduced the per-interpreter GIL groundwork in CPython 3.12
  • InterpreterPoolExecutor is new in Python 3.14
  • PEP 703 introduced free-threaded CPython support starting in Python 3.13

When teaching or optimizing, always say which world you are talking about:

  • default CPython build
  • per-interpreter parallelism
  • free-threaded build

Those are not the same runtime story.

Edge cases and gotchas

The GIL does not make async obsolete, and async does not remove the GIL. They solve different problems.

  • the GIL constrains threaded Python bytecode execution
  • asyncio is cooperative concurrency in one thread unless you explicitly offload work

See and .

Another trap is assuming that "thread-safe built-ins" means business-logic safety. A dict not corrupting itself is not the same as your threaded update protocol being correct.

Finally, do not generalize from one benchmark blindly. Threads can still win when:

  • the program waits on I/O
  • the work happens in native code that releases the GIL
  • the dominant cost is not Python bytecode execution
Production usage

Use this rule set:

  • choose threads for I/O-bound work and shared-memory coordination
  • choose processes for CPU-bound Python code when you need reliable multi-core scaling
  • consider multiple interpreters when interpreter isolation is acceptable and you want true parallelism without full process separation
  • evaluate free-threaded CPython deliberately, as a deployment/runtime choice rather than a default assumption

When performance matters:

  1. classify the workload as CPU-bound Python, native-code-heavy, or waiting-heavy
  2. measure the actual bottleneck
  3. choose the concurrency model that matches the bottleneck
  4. inspect bytecode only when interpreter overhead is plausibly relevant
Further depth
  • Glossary: global interpreter lock
  • threading — Thread-based parallelism
  • sys.getswitchinterval
  • concurrent.futures
  • Python support for free threading
  • PEP 684: A Per-Interpreter GIL
  • PEP 703: Making the Global Interpreter Lock Optional in CPython
  • CPython source: Python/ceval.c
  • CPython source: Python/ceval_gil.c
  • CPython source: Python/pystate.c
  • Python glossary: GIL
  • concurrent.futures docs
MEASURED NOTEBOOKMeasured
Measured concurrency outcomes

This notebook separates two different questions: how CPython behaves on CPU-bound bytecode work and how it behaves when tasks mostly wait on blocking I/O. It shows which execution model wins for each case.

Winnerprocesses — 69.67 ms @ 4 workers, fastest CPU path
RELATED GUIDE
CPU-bound scaling
0.0 ms97 ms194 ms1 worker2 workers4 workers
sequential
threads
processes
CONTROLS
METRICS
Fastest CPU pathprocesses @ 4 workers
Threads vs sequential1.04x of sequential
Processes @ 4 workers69.67 ms
Largest worker set4 workers
NOTES

What this tests — CPU-bound integer arithmetic dispatched sequentially, across threads, and across processes. All paths do the same work; the question is whether more workers means more throughput under CPython's GIL.

Why processes won for CPU — the GIL allows only one thread at a time to execute Python bytecode per interpreter. Threads on CPU-bound work compete for the GIL, adding contention overhead without parallelism. Processes bypass the GIL entirely, each running its own interpreter.

The surprise — threads are 1.04x of sequential at 4 workers, not faster. Many developers expect threads to speed up CPU work automatically. The GIL prevents this — threads help for I/O, not CPU-bound Python loops.

Takeaway — for CPU-bound Python, use multiprocessing or write the hot path in a C extension that releases the GIL. Threads are for I/O overlap, not CPU parallelism in regular CPython.

TEST ENVIRONMENT
Python Version3.12.3
Machinex86_64
Contribute