Skip to content

Conversation

@hildebrandmw
Copy link
Contributor

Optimize SIMDVector::load_simd_first for u8, i8 and u16 data type on the x86_64::V3 architecture.

These types use the __load_first* algorithms since AVX2 does not have masked loads for 8/16-bit types. The current implementation uses a cascaded load-chain to ensure the safety contract is upheld. This results in a lot of fiddly conditional logic.

This new implementation uses at most 2 data loads (plus sometimes one more load from a const variable for the shuffle-mask) to avoid the data dependent chain and avoids using the u128 type directly, which saves a bunch of LLVM register shenanigans.

These functions are called in the epilogue handling of many distance function implementations.

Performance results are below. This is a pretty clear win for the 8-bit case. It appears to be kind of a wash for the 16-bit case though.

uint8 x uint8 -- squared_l2
Dim Before Min (ns) After Min (ns) Delta Min
100 6.528 6.044 -7.4%
101 7.660 6.052 -21.0%
102 7.728 6.068 -21.5%
103 9.000 6.084 -32.4%
104 5.812 5.668 -2.5%
105 6.724 6.024 -10.4%
128 6.224 6.260 +0.6%
160 7.544 7.532 -0.2%
float16 x float16 -- squared_l2
Dim Before Min (ns) After Min (ns) Delta Min
100 7.816 7.548 -3.4%
101 8.084 8.036 -0.6%
102 7.916 8.032 +1.5%
103 8.092 8.020 -0.9%
104 7.128 7.316 +2.6%
105 8.860 8.100 -8.6%
128 8.684 8.632 -0.6%
160 10.696 10.464 -2.2%
float16 x float16 -- inner_product
Dim Before Min (ns) After Min (ns) Delta Min
100 6.756 6.480 -4.1%
101 6.988 7.004 +0.2%
102 6.804 6.968 +2.4%
103 6.988 7.004 +0.2%
104 6.140 6.112 -0.5%
105 7.492 6.932 -7.5%
128 7.556 7.564 +0.1%
160 9.408 9.348 -0.6%

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Optimizes partial SIMD loads on x86_64::V3 for u8/i8 and u16 element types by replacing the previous cascaded load-chain logic with overlapping-load strategies that preserve the “no out-of-bounds access” safety contract while improving throughput in distance-function epilogues.

Changes:

  • Added a new helper to efficiently load (8, 16) bytes using two 8-byte loads + pshufb (_mm_shuffle_epi8).
  • Reworked __load_first_of_16_bytes to use the new helper for first > 8 and overlapping GP-register reads for first <= 8.
  • Reworked __load_first_u16_of_16_bytes to use the new helper for bytes > 8 and GP-register reads for bytes <= 8, removing prior masked-load/insert logic.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hildebrandmw
Copy link
Contributor Author

hildebrandmw commented Feb 10, 2026

The particular benchmark results can be run locally by using the following input to diskann-benchmark-simd.

JSON file
{
"search_directories": [],
"jobs": [
  {
    "type": "simd-op",
    "content": {
      "query_type": "uint8",
      "data_type": "uint8",
      "arch": "x86-64-v3",
      "runs": [
        {
          "distance": "squared_l2",
          "dim": 100,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 101,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 102,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 103,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 104,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 105,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 128,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 160,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        }
      ]
    }
  },
  {
    "type": "simd-op",
    "content": {
      "query_type": "float16",
      "data_type": "float16",
      "arch": "x86-64-v3",
      "runs": [
        {
          "distance": "squared_l2",
          "dim": 100,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 101,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 102,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 103,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 104,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 105,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 128,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 160,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "inner_product",
          "dim": 100,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "inner_product",
          "dim": 101,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "inner_product",
          "dim": 102,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "inner_product",
          "dim": 103,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "inner_product",
          "dim": 104,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "inner_product",
          "dim": 105,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "inner_product",
          "dim": 128,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "inner_product",
          "dim": 160,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        }
      ]
    }
  }
]
}

@codecov-commenter
Copy link

codecov-commenter commented Feb 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.00%. Comparing base (a7aa13c) to head (3ce5d4c).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #747      +/-   ##
==========================================
- Coverage   89.01%   89.00%   -0.01%     
==========================================
  Files         428      428              
  Lines       78294    78295       +1     
==========================================
- Hits        69691    69687       -4     
- Misses       8603     8608       +5     
Flag Coverage Δ
miri 89.00% <100.00%> (-0.01%) ⬇️
unittests 89.00% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
diskann-wide/src/arch/x86_64/algorithms.rs 100.00% <100.00%> (ø)

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

@arkrishn94 arkrishn94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, cool trick with the _load_8_to_16_bytes logic.

@hildebrandmw hildebrandmw merged commit e873811 into main Feb 10, 2026
26 checks passed
@hildebrandmw hildebrandmw deleted the mhildebr/epilogue branch February 10, 2026 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants