Python bytecode and the dis module

Python bytecode is the instruction stream CPython executes after compiling source into a code object. It is like reading the assembly output of a C compiler — you see the exact steps the interpreter takes. Bytecode explains execution shape: which names are loaded as locals versus globals, whether a comprehension gets its own opcode, how Python dispatches a match statement. But it is not a substitute for measurement, algorithm analysis, or knowledge of C-level built-ins.

Core answer

Use the dis module when you need to answer questions like:

which names are loaded as locals, globals, or closure cells?
is this loop doing repeated attribute lookup and Python-level calls?
is there a dedicated opcode such as LIST_APPEND or MATCH_SEQUENCE?
did CPython compile this into one straightforward instruction path or several dispatch steps?

# [CURRENT - 3.10-3.14] Works on Python 3.x
import dis
def total(xs):
    return [x * 2 for x in xs]
dis.dis(total)

Keep the boundary clear:

bytecode is CPython interpreter work, not machine code
fewer or simpler bytecode steps can reduce overhead
but real performance depends on data structure choice, object allocation, C-level built-ins, specialization, and algorithmic complexity

Inspect CPython Bytecode

Compare source code, representative disassembly, and the real engineering lesson behind a few common bytecode patterns.

What bytecode is

On CPython, source code is compiled into a code object. That code object contains:

constants
local-variable metadata
names
free-variable metadata
an instruction stream that CPython's bytecode interpreter executes

That is why functions expose attributes such as:

__code__.co_consts
__code__.co_varnames
__code__.co_names
__code__.co_freevars

# [CURRENT - 3.10-3.14] Works on Python 3.x
def make_adder(base):
    def add(x):
        return base + x
    return add
fn = make_adder(10)
print(fn.__code__.co_varnames)
print(fn.__code__.co_freevars)
print(fn.__code__.co_consts)

This is practical information: it tells you what the interpreter thinks your function depends on:

fast locals
global names
captured outer-scope names
embedded constants

That helps explain why nearby-looking code can execute through meaningfully different paths.

Mechanism: source, code objects, and interpreter dispatch

The simplified CPython pipeline is:

parse source into an abstract syntax tree
compile that tree into a code object
execute the code object's bytecode in the evaluation loop

In current CPython, the instruction set is version-sensitive and implementation-specific. The language guarantee is Python behavior; the exact opcode names are not portable promises.

For example, locals and closure cells are not loaded the same way:

# [CURRENT - 3.10-3.14] Works on Python 3.x
import dis
def outer():
    base = 10
    def inner(x):
        return base + x
    return inner
fn = outer()
dis.dis(fn)

On current CPython, that closure read shows up as LOAD_DEREF, while a plain local read shows up as LOAD_FAST. The difference is purely semantic — the interpreter follows a different path because the name lives in a closure cell, not the fast-locals array.

See also for the scope side of the same mechanism.

CPython internals

The evaluation loop (Python/ceval.c). CPython's bytecode interpreter runs a generated dispatch loop inside _PyEval_EvalFrameDefault. The exact dispatch machinery is a CPython implementation detail, but conceptually the loop:

reads the next instruction from the internal instruction pointer (next_instr in _PyInterpreterFrame)
decodes the opcode and its argument
dispatches to the matching instruction handler
advances the instruction pointer and repeats

Inline caching and adaptive specialization (CPython 3.11+, PEP 659). Starting in Python 3.11, CPython added a specializing adaptive interpreter. Opcodes like LOAD_ATTR, LOAD_GLOBAL, BINARY_OP, and CALL are "adaptive" — on their first execution, they behave as generic versions. After enough executions (warm-up), CPython replaces them with specialized versions (e.g., LOAD_ATTR becomes LOAD_ATTR_SLOT for slot-based attribute access). This is why CPython 3.11+ code can show different bytecode on the first run vs. after warm-up.

You can inspect specialization caches with dis.dis(func, show_caches=True, adaptive=True):

# [CURRENT - 3.11-3.14] Requires Python 3.11+ [PEP 659]
import dis
def total(xs):
    return sum(xs)
dis.dis(total, show_caches=True, adaptive=True)

The cache entries show up as inline metadata between opcodes. They are not real opcodes — they are data slots that the specializing interpreter uses to track type observations and specialized replacement targets.

Opcode encoding. In CPython 3.12, each instruction is 2 bytes: 1 byte for the opcode, 1 byte for the argument (or a 2-byte extended argument if EXTENDED_ARG precedes it). The opcode range is 0–255, with HAVE_ARGUMENT (90) as the boundary between opcodes that take no argument and those that do.

The frame execution model. In CPython 3.11+, the interpreter executes calls with an internal _PyInterpreterFrame. That internal frame stores the code object's local slots, closure cells, and evaluation stack in a compact "locals plus" layout. A Python-level frame object (PyFrameObject, exposed as types.FrameType) is the public view used by debuggers, trace hooks, and introspection APIs.

The public frame object exposes attributes and C API accessors such as:

f_locals — a mapping view of local variable bindings
f_globals — a reference to the module's global dict
f_builtins — a reference to the builtins dict
f_lasti — the last-executed instruction offset (used for tracing, exception handling, and resume)

The internal _PyInterpreterFrame and its stack layout are CPython implementation details, not stable public API.

This is why local variable access is fast: LOAD_FAST is a direct C array access by index, while LOAD_GLOBAL requires a dict lookup and LOAD_DEREF requires following a chain of closure cell objects.

The dis module is the bridge between high-level syntax and the interpreter's execution model. Use it diagnostically — to answer a specific question about execution shape — rather than as a routine optimization tool.

How to use the dis module

The two most useful entry points are:

dis.dis(...) for human-readable disassembly
dis.get_instructions(...) for structured instruction objects

# [CURRENT - 3.10-3.14] Works on Python 3.x
import dis
def scale(x, y):
    return x * y
dis.dis(scale)
for ins in dis.get_instructions(scale):
    print(ins.opname, ins.argrepr)

dis.get_instructions is often the better choice if you want to:

inspect opcodes programmatically
filter specific instruction kinds
build your own reports or teaching tools

For plain debugging or education, dis.dis is usually enough.

On current CPython, dis.dis also exposes version-sensitive options for cache/specialization visibility:

# [CURRENT - 3.11-3.14] Requires Python 3.11+ [PEP 659]
import dis
def total(xs):
    return sum(xs)
dis.dis(total)
dis.dis(total, show_caches=True, adaptive=True)

Those extra views are useful when you want to inspect modern CPython's specializing interpreter behavior. They are not portable language-level contracts.

Why bytecode can explain speed

Bytecode becomes especially informative when two snippets are semantically similar but one requires more interpreter work.

List comprehension vs manual append loop is the standard example:

# [CURRENT - 3.10-3.14] Works on Python 3.x
import dis
def manual(rows):
    out = []
    for row in rows:
        out.append(row * 2)
    return out
def comp(rows):
    return [row * 2 for row in rows]
dis.dis(manual)
dis.dis(comp)

On current CPython 3.12, the manual loop performs repeated:

LOAD_ATTR for append
CALL to invoke the method

while the comprehension uses a dedicated LIST_APPEND path inside its compiled loop.

That helps explain why simple comprehensions often benchmark better. The key point is not "comprehensions are magic." The key point is:

fewer Python-level dispatch steps
less repeated lookup/call overhead
better interpreter-level execution shape for that narrow case

The same style of reasoning helps with:

local vs global name access (LOAD_FAST vs LOAD_GLOBAL)
closure access (LOAD_DEREF)
dedicated matching opcodes in structural pattern matching

See ──────────────────────────────────────────────

Why bytecode does not settle performance by itself

This is the production trap. Bytecode can explain overhead, but it does not fully explain runtime cost.

Reasons ──────────────────────────────────────────

the expensive part may be C-level work inside a builtin or extension
allocation and object creation may dominate
hash-table behavior may dominate
branch predictability and cache locality may dominate
algorithmic complexity may dwarf opcode overhead
adaptive specialization in CPython 3.11+ can change the effective execution path after warm-up

# [CURRENT - 3.10-3.14] Works on Python 3.x
import timeit
print(timeit.timeit("[x * 2 for x in range(1000)]", number=10000))
print(timeit.timeit("""
out = []
for x in range(1000):
    out.append(x * 2)
""", number=10000))

The right workflow is:

identify a hot path
benchmark or profile it
inspect bytecode if the question is interpreter overhead or execution shape
change the code only if the result still makes sense at the API/readability level

"This has fewer opcodes" is not a complete performance argument. If the hot cost is inside object creation, hashing, I/O, or a C extension, bytecode counts can be almost irrelevant.

Version context

Current project guidance targets Python 3.10-3.14. Python 3.9 and below are End-of-Life.

Important version facts:

dis has existed for a long time, so basic disassembly examples are stable Python 3 material
CPython 3.11 introduced the specializing adaptive interpreter PEP 659
exact opcode names, cache layout, jump shapes, and disassembly formatting changed materially in 3.11+
code that inspects bytecode text output should be treated as version-sensitive tooling

This means two things:

teaching at the level of LOAD_FAST, LOAD_GLOBAL, LOAD_DEREF, LIST_APPEND, and CALL is still useful
copying exact disassembly screenshots across versions is risky

When you compare bytecode, compare it on the Python version you actually deploy.

Edge cases and gotchas

Do not confuse bytecode with:

AST structure
machine code
JIT output from another implementation

Also do not present CPython opcode behavior as a Python language guarantee. PyPy, MicroPython, and other implementations do not have to expose or optimize the same instruction stream the same way.

Another trap: dis output is observational, not normative. It tells you what this interpreter version emitted, not what future versions must emit.

# [CURRENT - 3.10-3.14] Works on Python 3.x
import dis
def classify(x):
    return x + 1
print(classify.__code__.co_varnames)
print(classify.__code__.co_names)
dis.dis(classify)

That output is great for diagnosis. It is not a contract you should build hard production logic around unless you own the version lock and the maintenance cost.

Production usage

Use bytecode inspection when:

a hot path seems dominated by Python-level overhead
you need to explain a benchmark result
you want to understand closure/global/local behavior
you are teaching or debugging interpreter-level execution shape

Do not use bytecode inspection as a reflex for every optimization question. Reach for it when the question is specifically about:

dispatch overhead
repeated method lookup
call boundaries
specialized/dedicated interpreter paths

Good production practice:

write the clearest correct code first
measure
inspect bytecode if the measurement suggests interpreter overhead matters
keep the optimized version only if the gain is real and the code stays defensible

Further depth