Python bytecode and the dis module

Read opcode output, inspect code objects, and understand when bytecode explains speed

Two snippets can produce the same result while one emits twice the opcodes. Bytecode shows you why. CPython compiles source into a code object containing an instruction stream. `dis.dis()` decompiles that stream into opcodes like `LOAD_FAST`, `LOAD_GLOBAL`, and `BINARY_OP`. `LOAD_FAST` is a direct C array access by index into the localsplus array, the fastest variable access. `LOAD_GLOBAL` looks up the global namespace dict, which is slower. `LOAD_DEREF` reads from closure cells. The frame object exposes `f_locals` as a mapping view and `f_lasti` as the last executed instruction offset, used for tracing and exception handling. PEP 659 (Python 3.11) introduced adaptive specialization, where the interpreter replaces generic opcodes with type-specific versions at runtime.

Understand.
Visualize.
Master.

Python in Depth

An interactive engineering reference for Python internals

Quick note

Inspect first, benchmark second.

:)
Python version

Targets Python 3.10–3.14. Python 3.9 and below are End-of-Life.

TABLE OF CONTENTS
2.6Python bytecode and the dis module

Read opcode output, inspect code objects, and understand when bytecode explains speed

Bytecode is a diagnostic view into one Python implementation. It explains execution shape; it does not replace measurement or turn CPython internals into language guarantees.

Core answer

Use dis when the question is about compiled control flow, name lookup, closure cells, calls, comprehensions, or interpreter specialization. Pair it with profiling before treating an opcode difference as a performance conclusion.

# [CURRENT - 3.10-3.14] Works on Python 3.10+
from dataclasses import dataclass
from dis import dis
@dataclass(frozen=True, slots=True)
class LineItem:
quantity: int
unit_cents: int
def totals(items: list[LineItem]) -> list[int]:
return [item.quantity * item.unit_cents for item in items]
sample = [LineItem(2, 1250), LineItem(1, 499)]
print(totals(sample))
dis(totals)
Inspect CPython Bytecode

Compare source code, representative disassembly, and the real engineering lesson behind a few common bytecode patterns.

Why this design exists

Python source compiles into code objects before the evaluation loop runs it. Bytecode keeps that executable form compact and gives the interpreter a stable internal instruction stream to optimize. Modern CPython also specializes execution at runtime; PEP 659 is the key design reference for the specializing adaptive interpreter.

The relevant teaching boundary is strict: compilation and code-object structure are CPython-facing tools for reasoning, while source-level semantics come from the language reference.

Mechanics and CPython internals

dis reads a code object and renders representative instructions. Code objects store constants, local names, free variables, flags, and instruction data. In CPython 3.11+, adaptive specialization and inline caches mean one static disassembly view may not show every runtime detail that matters after warm-up.

# [CURRENT - 3.10-3.14] Works on Python 3.10+
from dataclasses import dataclass
from dis import Bytecode
@dataclass(frozen=True, slots=True)
class Refund:
amount_cents: int
fee_cents: int
def net_amount(refund: Refund) -> int:
base = refund.amount_cents
return base - refund.fee_cents
for instruction in Bytecode(net_amount):
print(instruction.opname, instruction.argrepr)
print(net_amount.__code__.co_varnames)
print(net_amount(Refund(4000, 125)))

Complexity and tradeoffs

Disassembly cost is development-time inspection. Runtime opcode count can matter in Python-level hot loops, but algorithmic complexity, allocation, C-level work, cache effects, I/O, and adaptive optimization often dominate. A list comprehension can remove repeated Python-level append dispatch; it does not rescue an O(n^2) algorithm.

Idiomatic patterns and refactoring

Use disassembly to explain a refactor only after the source-level refactor is already defensible.

# [CURRENT - 3.10-3.14] Works on Python 3.10+
from dataclasses import dataclass
from dis import dis
@dataclass(frozen=True, slots=True)
class Row:
value: int
def collect_loop(rows: list[Row]) -> list[int]:
output: list[int] = []
for row in rows:
output.append(row.value * 2)
return output
def collect_comprehension(rows: list[Row]) -> list[int]:
return [row.value * 2 for row in rows]
sample = [Row(1), Row(2)]
print(collect_loop(sample), collect_comprehension(sample))
dis(collect_comprehension)

Common mistakes and edge cases

Do not compare bytecode from different Python versions as if opcodes were a compatibility contract. Do not infer exact nanosecond costs from opcode count alone. Do not mistake dis output for proof that a global lookup, descriptor access, or call path is slow in a real workload.

When to use / When NOT to use

Use dis when it resolves a question about CPython compilation or interpreter work. Use benchmarks and profilers when the question is latency or throughput.

Do not teach bytecode as portable Python semantics, and do not contort clean code for an opcode-level micro-win before measuring the real workload.

Further reading

  • Official docs: dis
  • Official docs: code objects
  • PEP 659: specializing adaptive interpreter
  • CPython source: bytecode definitions
  • CPython source: evaluation loop
MEASURED NOTEBOOKMeasured
Measured name-resolution loop cost

This notebook compares three equivalent loops that differ mainly in name-resolution path: local, closure, and global. It shows which path runs fastest as the loop body scales and how much that gap matters in absolute terms.

WinnerLOAD_FAST local — 1.33x faster than LOAD_GLOBAL global @ 1M
RELATED GUIDE
Loop cost by name-resolution path
0.00 µs35913.6 µs71827.1 µs10k100k1M
LOAD_FAST local
LOAD_DEREF closure
LOAD_GLOBAL global
METRICS
Fastest pathLOAD_FAST local
Slowest pathLOAD_GLOBAL global
Gap @ 1M1.33x
Opcode countslocal 21 / closure 20 / global 19
NOTES

What this tests — three loops that do the same work but bind 'bias' via different name-resolution paths: a local variable (LOAD_FAST), a closure cell (LOAD_DEREF), and a module-level global (LOAD_GLOBAL).

Why LOAD_FAST won — LOAD_FAST is a simple array index into the fast-locals array (f->f_localsplus). No dict lookup, no scope traversal. LOAD_GLOBAL must search the module namespace dict and possibly the builtins dict. LOAD_DEREF walks the closure cell chain.

The surprise — the gap is only 1.33x at 1M iterations. Bytecode differences matter, but they are dwarfed by algorithmic choices (C-level work, allocation, I/O) in most production code.

Takeaway — local variable access is fastest, but micro-optimizing name resolution is rarely the bottleneck. Use `dis` to diagnose execution shape, not to chase micro-optimizations prematurely.

TEST ENVIRONMENT
Python Version3.12.3
Machinex86_64
Contribute