Python bytecode and the dis module

Read opcode output, inspect code objects, and understand when bytecode explains speed

Two snippets can produce the same result while one emits twice the opcodes. Bytecode shows you why. CPython compiles source into a code object containing an instruction stream. `dis.dis()` decompiles that stream into opcodes like `LOAD_FAST`, `LOAD_GLOBAL`, and `BINARY_OP`. `LOAD_FAST` is a direct C array access by index into the localsplus array, the fastest variable access. `LOAD_GLOBAL` looks up the global namespace dict, which is slower. `LOAD_DEREF` reads from closure cells. The frame object exposes `f_locals` as a mapping view and `f_lasti` as the last executed instruction offset, used for tracing and exception handling. PEP 659 (Python 3.11) introduced adaptive specialization, where the interpreter replaces generic opcodes with type-specific versions at runtime.

Understand.
Visualize.
Master.

Python in Depth

An interactive engineering reference for Python internals

Quick note

Inspect first, benchmark second.

:)
TABLE OF CONTENTS
2.6Python bytecode and the dis module

Read opcode output, inspect code objects, and understand when bytecode explains speed

Python bytecode is the instruction stream CPython executes after compiling source into a code object. It is like reading the assembly output of a C compiler — you see the exact steps the interpreter takes. Bytecode explains execution shape: which names are loaded as locals versus globals, whether a comprehension gets its own opcode, how Python dispatches a match statement. But it is not a substitute for measurement, algorithm analysis, or knowledge of C-level built-ins.

Core answer

Use the dis module when you need to answer questions like:

  • which names are loaded as locals, globals, or closure cells?
  • is this loop doing repeated attribute lookup and Python-level calls?
  • is there a dedicated opcode such as LIST_APPEND or MATCH_SEQUENCE?
  • did CPython compile this into one straightforward instruction path or several dispatch steps?
# [CURRENT - 3.10-3.14] Works on Python 3.x
import dis
def total(xs):
return [x * 2 for x in xs]
dis.dis(total)

Keep the boundary clear:

  • bytecode is CPython interpreter work, not machine code
  • fewer or simpler bytecode steps can reduce overhead
  • but real performance depends on data structure choice, object allocation, C-level built-ins, specialization, and algorithmic complexity
Inspect CPython Bytecode

Compare source code, representative disassembly, and the real engineering lesson behind a few common bytecode patterns.

What bytecode is

On CPython, source code is compiled into a code object. That code object contains:

  • constants
  • local-variable metadata
  • names
  • free-variable metadata
  • an instruction stream that CPython's bytecode interpreter executes

That is why functions expose attributes such as:

  • __code__.co_consts
  • __code__.co_varnames
  • __code__.co_names
  • __code__.co_freevars
# [CURRENT - 3.10-3.14] Works on Python 3.x
def make_adder(base):
def add(x):
return base + x
return add
fn = make_adder(10)
print(fn.__code__.co_varnames)
print(fn.__code__.co_freevars)
print(fn.__code__.co_consts)

This is practical information: it tells you what the interpreter thinks your function depends on:

  • fast locals
  • global names
  • captured outer-scope names
  • embedded constants

That helps explain why nearby-looking code can execute through meaningfully different paths.

Mechanism: source, code objects, and interpreter dispatch

The simplified CPython pipeline is:

  1. parse source into an abstract syntax tree
  2. compile that tree into a code object
  3. execute the code object's bytecode in the evaluation loop

In current CPython, the instruction set is version-sensitive and implementation-specific. The language guarantee is Python behavior; the exact opcode names are not portable promises.

For example, locals and closure cells are not loaded the same way:

# [CURRENT - 3.10-3.14] Works on Python 3.x
import dis
def outer():
base = 10
def inner(x):
return base + x
return inner
fn = outer()
dis.dis(fn)

On current CPython, that closure read shows up as LOAD_DEREF, while a plain local read shows up as LOAD_FAST. The difference is purely semantic — the interpreter follows a different path because the name lives in a closure cell, not the fast-locals array.

See also for the scope side of the same mechanism.

CPython internals

The evaluation loop (Python/ceval.c). CPython's bytecode interpreter runs a generated dispatch loop inside _PyEval_EvalFrameDefault. The exact dispatch machinery is a CPython implementation detail, but conceptually the loop:

  1. reads the next instruction from the internal instruction pointer (next_instr in _PyInterpreterFrame)
  2. decodes the opcode and its argument
  3. dispatches to the matching instruction handler
  4. advances the instruction pointer and repeats

Inline caching and adaptive specialization (CPython 3.11+, PEP 659). Starting in Python 3.11, CPython added a specializing adaptive interpreter. Opcodes like LOAD_ATTR, LOAD_GLOBAL, BINARY_OP, and CALL are "adaptive" — on their first execution, they behave as generic versions. After enough executions (warm-up), CPython replaces them with specialized versions (e.g., LOAD_ATTR becomes LOAD_ATTR_SLOT for slot-based attribute access). This is why CPython 3.11+ code can show different bytecode on the first run vs. after warm-up.

You can inspect specialization caches with dis.dis(func, show_caches=True, adaptive=True):

# [CURRENT - 3.11-3.14] Requires Python 3.11+ [PEP 659]
import dis
def total(xs):
return sum(xs)
dis.dis(total, show_caches=True, adaptive=True)

The cache entries show up as inline metadata between opcodes. They are not real opcodes — they are data slots that the specializing interpreter uses to track type observations and specialized replacement targets.

Opcode encoding. In CPython 3.12, each instruction is 2 bytes: 1 byte for the opcode, 1 byte for the argument (or a 2-byte extended argument if EXTENDED_ARG precedes it). The opcode range is 0–255, with HAVE_ARGUMENT (90) as the boundary between opcodes that take no argument and those that do.

The frame execution model. In CPython 3.11+, the interpreter executes calls with an internal _PyInterpreterFrame. That internal frame stores the code object's local slots, closure cells, and evaluation stack in a compact "locals plus" layout. A Python-level frame object (PyFrameObject, exposed as types.FrameType) is the public view used by debuggers, trace hooks, and introspection APIs.

The public frame object exposes attributes and C API accessors such as:

  • f_locals — a mapping view of local variable bindings
  • f_globals — a reference to the module's global dict
  • f_builtins — a reference to the builtins dict
  • f_lasti — the last-executed instruction offset (used for tracing, exception handling, and resume)

The internal _PyInterpreterFrame and its stack layout are CPython implementation details, not stable public API.

This is why local variable access is fast: LOAD_FAST is a direct C array access by index, while LOAD_GLOBAL requires a dict lookup and LOAD_DEREF requires following a chain of closure cell objects.

The dis module is the bridge between high-level syntax and the interpreter's execution model. Use it diagnostically — to answer a specific question about execution shape — rather than as a routine optimization tool.

How to use the dis module

The two most useful entry points are:

  • dis.dis(...) for human-readable disassembly
  • dis.get_instructions(...) for structured instruction objects
# [CURRENT - 3.10-3.14] Works on Python 3.x
import dis
def scale(x, y):
return x * y
dis.dis(scale)
for ins in dis.get_instructions(scale):
print(ins.opname, ins.argrepr)

dis.get_instructions is often the better choice if you want to:

  • inspect opcodes programmatically
  • filter specific instruction kinds
  • build your own reports or teaching tools

For plain debugging or education, dis.dis is usually enough.

On current CPython, dis.dis also exposes version-sensitive options for cache/specialization visibility:

# [CURRENT - 3.11-3.14] Requires Python 3.11+ [PEP 659]
import dis
def total(xs):
return sum(xs)
dis.dis(total)
dis.dis(total, show_caches=True, adaptive=True)

Those extra views are useful when you want to inspect modern CPython's specializing interpreter behavior. They are not portable language-level contracts.

Why bytecode can explain speed

Bytecode becomes especially informative when two snippets are semantically similar but one requires more interpreter work.

List comprehension vs manual append loop is the standard example:

# [CURRENT - 3.10-3.14] Works on Python 3.x
import dis
def manual(rows):
out = []
for row in rows:
out.append(row * 2)
return out
def comp(rows):
return [row * 2 for row in rows]
dis.dis(manual)
dis.dis(comp)

On current CPython 3.12, the manual loop performs repeated:

  • LOAD_ATTR for append
  • CALL to invoke the method

while the comprehension uses a dedicated LIST_APPEND path inside its compiled loop.

That helps explain why simple comprehensions often benchmark better. The key point is not "comprehensions are magic." The key point is:

  • fewer Python-level dispatch steps
  • less repeated lookup/call overhead
  • better interpreter-level execution shape for that narrow case

The same style of reasoning helps with:

  • local vs global name access (LOAD_FAST vs LOAD_GLOBAL)
  • closure access (LOAD_DEREF)
  • dedicated matching opcodes in structural pattern matching

See ──────────────────────────────────────────────

Why bytecode does not settle performance by itself

This is the production trap. Bytecode can explain overhead, but it does not fully explain runtime cost.

Reasons ──────────────────────────────────────────

  • the expensive part may be C-level work inside a builtin or extension
  • allocation and object creation may dominate
  • hash-table behavior may dominate
  • branch predictability and cache locality may dominate
  • algorithmic complexity may dwarf opcode overhead
  • adaptive specialization in CPython 3.11+ can change the effective execution path after warm-up
# [CURRENT - 3.10-3.14] Works on Python 3.x
import timeit
print(timeit.timeit("[x * 2 for x in range(1000)]", number=10000))
print(timeit.timeit("""
out = []
for x in range(1000):
out.append(x * 2)
""", number=10000))

The right workflow is:

  1. identify a hot path
  2. benchmark or profile it
  3. inspect bytecode if the question is interpreter overhead or execution shape
  4. change the code only if the result still makes sense at the API/readability level

"This has fewer opcodes" is not a complete performance argument. If the hot cost is inside object creation, hashing, I/O, or a C extension, bytecode counts can be almost irrelevant.

Version context

Current project guidance targets Python 3.10-3.14. Python 3.9 and below are End-of-Life.

Important version facts:

  • dis has existed for a long time, so basic disassembly examples are stable Python 3 material
  • CPython 3.11 introduced the specializing adaptive interpreter PEP 659
  • exact opcode names, cache layout, jump shapes, and disassembly formatting changed materially in 3.11+
  • code that inspects bytecode text output should be treated as version-sensitive tooling

This means two things:

  • teaching at the level of LOAD_FAST, LOAD_GLOBAL, LOAD_DEREF, LIST_APPEND, and CALL is still useful
  • copying exact disassembly screenshots across versions is risky

When you compare bytecode, compare it on the Python version you actually deploy.

Edge cases and gotchas

Do not confuse bytecode with:

  • AST structure
  • machine code
  • JIT output from another implementation

Also do not present CPython opcode behavior as a Python language guarantee. PyPy, MicroPython, and other implementations do not have to expose or optimize the same instruction stream the same way.

Another trap: dis output is observational, not normative. It tells you what this interpreter version emitted, not what future versions must emit.

# [CURRENT - 3.10-3.14] Works on Python 3.x
import dis
def classify(x):
return x + 1
print(classify.__code__.co_varnames)
print(classify.__code__.co_names)
dis.dis(classify)

That output is great for diagnosis. It is not a contract you should build hard production logic around unless you own the version lock and the maintenance cost.

Production usage

Use bytecode inspection when:

  • a hot path seems dominated by Python-level overhead
  • you need to explain a benchmark result
  • you want to understand closure/global/local behavior
  • you are teaching or debugging interpreter-level execution shape

Do not use bytecode inspection as a reflex for every optimization question. Reach for it when the question is specifically about:

  • dispatch overhead
  • repeated method lookup
  • call boundaries
  • specialized/dedicated interpreter paths

Good production practice:

  • write the clearest correct code first
  • measure
  • inspect bytecode if the measurement suggests interpreter overhead matters
  • keep the optimized version only if the gain is real and the code stays defensible
Further depth
  • dis module
  • Data model: code objects
  • Execution model
  • CPython Developer Guide
  • CPython source: Python/ceval.c
  • CPython source: Objects/frameobject.c
  • PEP 659: Specializing Adaptive Interpreter
  • dis module docs
MEASURED NOTEBOOKMeasured
Measured name-resolution loop cost

This notebook compares three equivalent loops that differ mainly in name-resolution path: local, closure, and global. It shows which path runs fastest as the loop body scales and how much that gap matters in absolute terms.

WinnerLOAD_FAST local — 1.33x faster than LOAD_GLOBAL global @ 1M
RELATED GUIDE
Loop cost by name-resolution path
0.00 µs35913.6 µs71827.1 µs10k100k1M
LOAD_FAST local
LOAD_DEREF closure
LOAD_GLOBAL global
METRICS
Fastest pathLOAD_FAST local
Slowest pathLOAD_GLOBAL global
Gap @ 1M1.33x
Opcode countslocal 21 / closure 20 / global 19
NOTES

What this tests — three loops that do the same work but bind 'bias' via different name-resolution paths: a local variable (LOAD_FAST), a closure cell (LOAD_DEREF), and a module-level global (LOAD_GLOBAL).

Why LOAD_FAST won — LOAD_FAST is a simple array index into the fast-locals array (f->f_localsplus). No dict lookup, no scope traversal. LOAD_GLOBAL must search the module namespace dict and possibly the builtins dict. LOAD_DEREF walks the closure cell chain.

The surprise — the gap is only 1.33x at 1M iterations. Bytecode differences matter, but they are dwarfed by algorithmic choices (C-level work, allocation, I/O) in most production code.

Takeaway — local variable access is fastest, but micro-optimizing name resolution is rarely the bottleneck. Use `dis` to diagnose execution shape, not to chase micro-optimizations prematurely.

TEST ENVIRONMENT
Python Version3.12.3
Machinex86_64
Contribute