Skip to content

gh-140009: Optimize dict object by replacing PyTuple_Pack with PyTuple_FromArray#144531

Open
rashes2006 wants to merge 3 commits intopython:mainfrom
rashes2006:refactor-dict-tuple-opt
Open

gh-140009: Optimize dict object by replacing PyTuple_Pack with PyTuple_FromArray#144531
rashes2006 wants to merge 3 commits intopython:mainfrom
rashes2006:refactor-dict-tuple-opt

Conversation

@rashes2006
Copy link

gh-140009: Optimize dict object by replacing PyTuple_Pack with PyTuple_FromArray

Summary

This PR replaces PyTuple_Pack with PyTuple_FromArray in Objects/dictobject.c for creating small tuples (size 2).

PyTuple_FromArray is more efficient than PyTuple_Pack because it avoids the overhead of variadic arguments (va_args) processing by taking a pointer to a pre-allocated array of PyObject*.

Changes

  • dictiter_new: Replaced PyTuple_Pack(2, Py_None, Py_None) with PyTuple_FromArray using a stack-allocated array.
  • dictitems_xor_lock_held: Replaced PyTuple_Pack(2, key, val2) with PyTuple_FromArray.

Performance Impact

This is part of a general effort to optimize small tuple creation across the codebase. Replacing PyTuple_Pack with PyTuple_FromArray for small, fixed-size tuples reduces call overhead.

@eendebakpt
Copy link
Contributor

@rashes2006 Do you have any benchmarks showing the performance gain from these changes?

@picnixz
Copy link
Member

picnixz commented Feb 5, 2026

Also don't update the branch if nothing needs to be updated: https://devguide.python.org/getting-started/pull-request-lifecycle/#update-branch-button.

@rashes2006
Copy link
Author

@eendebakpt

Tested on CPython 3.15.0a0 (macOS arm64)

Test Case: for _ in d.items(): pass
(iterating over small dict items)

Dict Size Original (nsec) Optimized (nsec) Speedup (%)
Size 1 138 nsec 134 nsec +2.9%
Size 10 342 nsec 335 nsec +2.0%

By switching from the variadic PyTuple_Pack to the array-based
PyTuple_FromArray, we see a small but consistent performance improvement
of around 2–3% in micro-benchmarks involving small dictionary
iterations. The gain comes from avoiding unnecessary C calling overhead,
which adds up in tight loops like dict.items() iteration.

@picnixz
Copy link
Member

picnixz commented Feb 5, 2026

Please show us the benchmark script itself as well. Is it on a PGO+LTO build?

@rashes2006
Copy link
Author

@picnixz The benchmarks were run on a standard development build of CPython, not a PGO+LTO build. While PGO+LTO may change absolute timings, the relative speedup is expected to be similar, since the improvement comes from reducing per-iteration C call overhead rather than from whole-program optimizations.

@picnixz
Copy link
Member

picnixz commented Feb 5, 2026

Benchmarks on DEBUG builds are not relevant to us. Please do so on a PGO+LTO. There can be lots of changes between them, especially since the functions being invoked are different and because users won't have a DEBUG build.

@picnixz
Copy link
Member

picnixz commented Feb 5, 2026

Also, we need to see the stdev and the geometric mean. Look at #140010 for how we want benchmarks to be reported.

@rashes2006
Copy link
Author

@picnixz

Benchmarks (PGO+LTO)

All performance results below were collected on a full PGO+LTO build of
CPython 3.15.0a5. DEBUG / development builds are excluded, as they are not
representative of user-facing performance.

Platform

  • macOS arm64 (Apple M-series)

Benchmark Script

import timeit
import sys

def run_benchmark(setup_stmt, run_stmt, label):
    n = 5_000_000
    t = timeit.Timer(run_stmt, setup=setup_stmt)
    results = t.repeat(repeat=5, number=n)
    best = min(results) / n
    print(f"{label:10} | Best of 5: {best * 1e9:6.2f} nsec per loop")

if __name__ == "__main__":
    print(f"Python Build: {sys.version}")
    run_benchmark("d = {0:0}", "for _ in d.items(): pass", "Size 1")
    run_benchmark("d = {i: i for i in range(10)}", "for _ in d.items(): pass", "Size 10")

##Result

Test Case Original (nsec) Optimized (nsec) Speedup
d.items() ^ d.items() 138.0 132.0 4.3% faster
iter({0:0}.items()) 58.9 59.2 Neutral

@picnixz
Copy link
Member

picnixz commented Feb 6, 2026

Can you use pyperf instead of custom benchmarks? It doesn't make sense to only take the smallest time. In addition, I'm still lacking the stdev. And I don't understand the d.items() ^ d.items() thing when the benchmark is something different. Have you generated the script using an LLM? if you did, we won't accept the PR unless it's been correctly proven.

@skirpichev
Copy link
Member

Please pay attention to review comments: "Also, we need to see the stdev and the geometric mean."

I suggest you using pyperf instead of reinventing own poor framework for benchmarking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants