[diskann-wide] Optimize `load_simd_first` for 8-bit and 16-bit element types. #747

hildebrandmw · 2026-02-10T02:54:00Z

Optimize SIMDVector::load_simd_first for u8, i8 and u16 data type on the x86_64::V3 architecture.

These types use the __load_first* algorithms since AVX2 does not have masked loads for 8/16-bit types. The current implementation uses a cascaded load-chain to ensure the safety contract is upheld. This results in a lot of fiddly conditional logic.

This new implementation uses at most 2 data loads (plus sometimes one more load from a const variable for the shuffle-mask) to avoid the data dependent chain and avoids using the u128 type directly, which saves a bunch of LLVM register shenanigans.

These functions are called in the epilogue handling of many distance function implementations.

Performance results are below. This is a pretty clear win for the 8-bit case. It appears to be kind of a wash for the 16-bit case though.

uint8 x uint8 -- squared_l2
Dim	Before Min (ns)	After Min (ns)	Delta Min
100	6.528	6.044	-7.4%
101	7.660	6.052	-21.0%
102	7.728	6.068	-21.5%
103	9.000	6.084	-32.4%
104	5.812	5.668	-2.5%
105	6.724	6.024	-10.4%
128	6.224	6.260	+0.6%
160	7.544	7.532	-0.2%

float16 x float16 -- squared_l2
Dim	Before Min (ns)	After Min (ns)	Delta Min
100	7.816	7.548	-3.4%
101	8.084	8.036	-0.6%
102	7.916	8.032	+1.5%
103	8.092	8.020	-0.9%
104	7.128	7.316	+2.6%
105	8.860	8.100	-8.6%
128	8.684	8.632	-0.6%
160	10.696	10.464	-2.2%

float16 x float16 -- inner_product
Dim	Before Min (ns)	After Min (ns)	Delta Min
100	6.756	6.480	-4.1%
101	6.988	7.004	+0.2%
102	6.804	6.968	+2.4%
103	6.988	7.004	+0.2%
104	6.140	6.112	-0.5%
105	7.492	6.932	-7.5%
128	7.556	7.564	+0.1%
160	9.408	9.348	-0.6%

Copilot

Pull request overview

Optimizes partial SIMD loads on x86_64::V3 for u8/i8 and u16 element types by replacing the previous cascaded load-chain logic with overlapping-load strategies that preserve the “no out-of-bounds access” safety contract while improving throughput in distance-function epilogues.

Changes:

Added a new helper to efficiently load (8, 16) bytes using two 8-byte loads + pshufb (_mm_shuffle_epi8).
Reworked __load_first_of_16_bytes to use the new helper for first > 8 and overlapping GP-register reads for first <= 8.
Reworked __load_first_u16_of_16_bytes to use the new helper for bytes > 8 and GP-register reads for bytes <= 8, removing prior masked-load/insert logic.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

hildebrandmw · 2026-02-10T02:59:11Z

The particular benchmark results can be run locally by using the following input to diskann-benchmark-simd.

JSON file

{
"search_directories": [],
"jobs": [
  {
    "type": "simd-op",
    "content": {
      "query_type": "uint8",
      "data_type": "uint8",
      "arch": "x86-64-v3",
      "runs": [
        {
          "distance": "squared_l2",
          "dim": 100,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 101,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 102,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 103,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 104,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 105,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 128,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 160,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        }
      ]
    }
  },
  {
    "type": "simd-op",
    "content": {
      "query_type": "float16",
      "data_type": "float16",
      "arch": "x86-64-v3",
      "runs": [
        {
          "distance": "squared_l2",
          "dim": 100,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 101,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 102,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 103,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 104,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 105,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 128,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "squared_l2",
          "dim": 160,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "inner_product",
          "dim": 100,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "inner_product",
          "dim": 101,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "inner_product",
          "dim": 102,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "inner_product",
          "dim": 103,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "inner_product",
          "dim": 104,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "inner_product",
          "dim": 105,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "inner_product",
          "dim": 128,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        },
        {
          "distance": "inner_product",
          "dim": 160,
          "num_points": 50,
          "loops_per_measurement": 5000,
          "num_measurements": 100
        }
      ]
    }
  }
]
}

codecov-commenter · 2026-02-10T03:07:04Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.00%. Comparing base (a7aa13c) to head (3ce5d4c).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #747      +/-   ##
==========================================
- Coverage   89.01%   89.00%   -0.01%     
==========================================
  Files         428      428              
  Lines       78294    78295       +1     
==========================================
- Hits        69691    69687       -4     
- Misses       8603     8608       +5

Flag	Coverage Δ
miri	`89.00% <100.00%> (-0.01%)`	⬇️
unittests	`89.00% <100.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
diskann-wide/src/arch/x86_64/algorithms.rs	`100.00% <100.00%> (ø)`

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

arkrishn94

LGTM, cool trick with the _load_8_to_16_bytes logic.

diskann-wide/src/arch/x86_64/algorithms.rs

Mark Hildebrand added 3 commits February 9, 2026 16:53

Optimize load_simd_first for 8-bit and 16-bit element types.

8ef151e

Add more explanation for the new technique.

0463ed0

More doc cleanups.

3ce5d4c

hildebrandmw requested review from a team and Copilot February 10, 2026 02:54

Copilot started reviewing on behalf of hildebrandmw February 10, 2026 02:54 View session

Copilot AI reviewed Feb 10, 2026

View reviewed changes

harsha-simhadri approved these changes Feb 10, 2026

View reviewed changes

arkrishn94 approved these changes Feb 10, 2026

View reviewed changes

diskann-wide/src/arch/x86_64/algorithms.rs Show resolved Hide resolved

hildebrandmw merged commit e873811 into main Feb 10, 2026
26 checks passed

hildebrandmw deleted the mhildebr/epilogue branch February 10, 2026 19:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[diskann-wide] Optimize `load_simd_first` for 8-bit and 16-bit element types. #747

[diskann-wide] Optimize `load_simd_first` for 8-bit and 16-bit element types. #747

Uh oh!

hildebrandmw commented Feb 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

hildebrandmw commented Feb 10, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Feb 10, 2026 •

edited

Loading

Uh oh!

arkrishn94 left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[diskann-wide] Optimize load_simd_first for 8-bit and 16-bit element types. #747

[diskann-wide] Optimize load_simd_first for 8-bit and 16-bit element types. #747

Uh oh!

Conversation

hildebrandmw commented Feb 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

hildebrandmw commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

arkrishn94 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[diskann-wide] Optimize `load_simd_first` for 8-bit and 16-bit element types. #747

[diskann-wide] Optimize `load_simd_first` for 8-bit and 16-bit element types. #747

hildebrandmw commented Feb 10, 2026 •

edited

Loading

codecov-commenter commented Feb 10, 2026 •

edited

Loading