Welcome to LWN.net

The following subscription-only content has been made available to you
by an LWN subscriber. Thousands of subscribers depend on LWN for the
best news from the Linux and free software communities. If you enjoy this
article, please consider subscribing to LWN. Thank you
for visiting LWN.net!

By Jake Edge

May 9, 2023


PyCon

Two members of the Faster
CPython team, which was put together at Microsoft at the behest of Guido
van Rossum
to work on major performance improvements for CPython, came
to PyCon 2023 to report on what the
team has been working on—and its plans for the future. PEP 659 (“Specializing
Adaptive Interpreter”) describes the foundation of the current work, some
of which
has already been released as part of Python 3.11. Brandt Bucher, who
gave a
popular talk on structural pattern matching
at last year’s PyCon, was up first, with a talk on what “adaptive” and
“specializing” mean in the context of Python, which we cover here in part
one. Mark Shannon, whose proposed plan
for performance improvements
in 2020 was a major impetus for this work,
presented on the past, present, and future of the Python performance
enhancements,
which will be covered in part two.

Bucher started out by talking a bit about Python bytecode, but said that
developers do not need to know anything about it to get faster code:
“just upgrade to 3.11 and you’ll probably see a performance
improvement”. In order to show how the team sped up the interpreter,
though, it would help to look inside the Python code. He put up an
example Point class that has two floating point attributes for its
position (xy) and a single shifted()
method that
takes two offsets to
apply to the position of an instance, and returns a new instance of the
class at the shifted position. He focused on two lines from
the method:

    y = self.y + dy
    cls = type(self)

The first line applies the offset for the

y

axis, while the second
gets a reference to the

Point

class (in preparation for returning
a new instance by calling

cls(x, y)

). He used the

dis module

to disassemble the method; the relevant piece of bytecode is as follows:

    # part of the output of dis.dis(Point.shifted)
    LOAD_FAST      (dy)
    LOAD_FAST      (self)
    LOAD_ATTR      (y)
    BINARY_OP      (+)
    STORE_FAST     (y)

    LOAD_GLOBAL    (type)
    LOAD_FAST      (self)
    PRECALL        (1)
    CALL           (1)
    STORE_FAST     (cls)

Some of the binary opcodes (and some operations, such as method
calls
) have changed from those in Python 3.10 and earlier, so your
output may be different than what he showed. The
bytecode is
a set of instructions for the Python stack-based virtual machine.
In the first part above, the

dy

and

self

values are
pushed onto the stack,
then the

LOAD_ATTR

instruction retrieves the

y

attribute
from the value popped from the
stack
(

self

), then pushes it. Next, the two values (

self.y

and

dy

) on top of the stack
are added (and the result is pushed) and that result is popped to be
stored in the variable

y

. In the second part,

type

and

self

are pushed, then

PRECALL

and

CALL

are two separate steps to perform the

type(self)

call; its result is popped to store into

cls

.

Adaptive instructions

If that code is run more than a handful of times, Bucher said, it becomes a
candidate for optimization in 3.11. The first step is something
called “quickening”, which the PEP calls “the process of replacing slow
instructions with faster variants
“. “Superinstructions” that combine
two related instructions into a single instruction are substituted into the
code. For example, the first two LOAD_FAST instructions can be
replaced with a single one:

    LOAD_FAST__LOAD_FAST   (dy, self)

It is a simple change that results in a real performance boost, he said.
The more interesting change is to replace some instructions with their
adaptive counterparts.
During the quickening process, five bytecodes are replaced with adaptive
versions:

    # part of the output of dis.dis(Point.shifted, adaptive=True)
    LOAD_FAST__LOAD_FAST   (dy, self)
    LOAD_ATTR_ADAPTIVE     (y)
    BINARY_OP_ADAPTIVE     (+)
    STORE_FAST             (y)

    LOAD_GLOBAL_ADAPTIVE   (type)
    LOAD_FAST              (self)
    PRECALL_ADAPTIVE       (1)
    CALL_ADAPTIVE          (1)
    STORE_FAST             (cls)

The adaptive versions perform the same operation as their counterpart
except that they can specialize themselves depending on how they are being
used. For example, loading an attribute is “actually surprisingly complex”,
because the attribute could come
from a number of places. It could be a name from a module, a class
variable from a class, a
method of a class, and so on, so the standard load-attribute code needs to
be prepared for those possibilities. The simplest case is getting an
attribute from an instance dictionary, which is exactly what is being done here.


[Brandt Bucher]

The LOAD_ATTR_ADAPTIVE operation can recognize that it is in
the simple case and can ignore all of the rest of possibilities, so the
adaptive instruction changes to LOAD_ATTR_INSTANCE_VALUE, which is
a specialized instruction that only contains the code fast path for this
common case.

The specialized instruction will then check to see if the
class is unchanged from the last time this attribute was accessed and if
the keys in
the object’s __dict__ are the same. Those much faster checks can
be done in lieu of two dictionary lookups (on the class dictionary for a
descriptor and on the object’s __dict__ for the name); “dictionary
lookups are fast, but they still cost something”, while the other two
checks are trivial.
If the conditions
hold true, which is the normal situation, the code can simply return the
entry from the __dict__ at the same offset that was used
previously; it does not need to do any hashing or collision resolution that
comes when doing a dictionary lookup.

If either of the checks fails, the code falls back to the regular load
attribute operation. Python is a dynamic language and the new interpreter
needs to respect that, but the dynamic features are not being used all of
the time. The idea is to not pay the cost of dynamism when it is not being
used, which is rather frequent in many Python programs.

Similarly, the BINARY_OP_ADAPTIVE instruction specializes to
BINARY_OP_ADD_FLOAT because floating-point values are being used.
That operation checks that both operands are of type float and
falls back to the (also surprisingly complex) machinery for an add
operation if they are not. Otherwise, it can just add the raw floating
point values together in C.

Normally, when a global name is being loaded, it requires two dictionary
lookups; for example, when type() is being loaded, the global
dictionary must be checked in case the function has been shadowed, if not
(which is likely), then it must be looked up in the builtins dictionary.
So LOAD_GLOBAL_ADAPTIVE takes a page from the attribute-loading
specialization to check if the global dictionary or builtins dictionary
have changed; if not, it simply grabs the value at the same index it used
before.

It turns out that type() is called often enough that it gets its
own specialized bytecode. It checks that the code for type() has not
changed (by way of a monkey patch or similar) and, if not, simply returns the
argument’s class. There is a call in the C API to do so and “it’s
much much cheaper than making the call”.

If the specialized instructions fail their tests enough times, they will
revert back to the adaptive versions in order to change course. For example,
if the Point class starts being used with integer values,
BINARY_OP_ADD_FLOAT will revert to BINARY_OP_ADAPTIVE,
which would replace itself with BINARY_OP_ADD_INT after a few
uses. “This is what
we mean by specializing adaptive interpreter”.

It may seem like a “bunch of small localized changes”—and it is—but they add
up to something substantial. For example, the shifted() method is
nearly twice as fast in 3.11 versus 3.10. He obviously chose
the example because it specializes well; a 2x performance increase is
probably at the upper end of what can be expected. But it does show how
replacing the generalized versions of these Python bytecode operations with
their more specialized counterparts can lead to large improvements.

Bucher said that the various pieces of information that are being
stored (e.g. the index of a dictionary entry) are actually being placed
into the bytecode itself. He called those “inline caches” and they can be
seen using the show_caches=True parameter to dis.dis().
But Python programmers should not really need to look at the caches, or
even at the adaptive instructions, because all of that should not matter.
The idea is that the interpreter will still do what it always did, but it
can now do it faster in many cases.

For those who do want to dig under the covers more, though, Bucher
recommended his specialist tool. Running
some code with specialist will generate a web page with the source
code of the program color-coded to show where the interpreter is optimizing
well—and where it is not.

Coming soon

He then shifted gears to looking at things that will be coming in
CPython 3.12. To start with, the dedicated quickening step has been
eliminated and the superinstructions are simply emitted at compile time.
In addition, there will be only a single call instruction rather than the
two-step dance in 3.11.

Instead of having separate adaptive versions of
the operations, the standard bytecodes implement the adaptivity. A bytecode
only needs to be executed twice in order to become fully specialized. That
is measured on individual bytecodes, rather than a function as a whole, so
a loop will specialize immediately even if the surrounding code only gets
executed once.

The team has also added more specialized instructions.
“We have gotten better at specializing dynamic attribute accesses”, so
there are
specializations for calling an object’s __getattribute__()
and for loading properties specified using the property()
builtin
. There are specializations for iterating over the four
most-common objects in a for loop: list, tuple, range, and
generator. There is also a single specialization for
yield from in
generators and await in coroutines. There are two others in the
works that may make it into CPython 3.12.

Meanwhile, the inline caches are being reduced. Having more cached
information makes the bytecode longer, which means longer jumps and more
memory use, so shrinking the amount of cached data is welcome. The team
has been able to eliminate
some cache entries or reorganize the caches to make some significant
reductions. All of that is coming in Python 3.12—and beyond.

Stay tuned for our coverage of Shannon’s talk, which came on the second day
of the conference. It will be the subject of part two, which is coming
soon to an LWN near you.

[I would like to thank LWN subscribers for supporting my travel to Salt
Lake City for PyCon.]








(Log in to post comments)

Read More