Skip to content

Conversation

@suri-kumkaran
Copy link
Contributor

Summary

Introduces the rfcs/ directory with the RFC process and the first RFC proposing high-performance Chamfer distance for multi-vector representations in DiskANN.

Changes

New Files

File Description
rfcs/README.md RFC process guide: when to write one, lifecycle, submission steps, index
rfcs/0000-template.md RFC template with recommended structure
rfcs/0001-multi-vector-distance-functions.md RFC proposing Chamfer distance for ColBERT-style multi-vector search

RFC 0001 — Multi-Vector Distance Functions

Proposes a query-transposed tiling approach for Chamfer distance computation:

  • Transposes queries into a block layout, keeps documents in row-major format (no index migration)
  • Processes pairs of document vectors together, amortizing query memory loads (~50% bandwidth reduction)
  • Reuses a MaxSim scratch buffer across distance calls (zero hot-path allocation)
  • Achieves up to 2.67x speedup over SIMD baseline (avg 1.93x across 10 configurations)

Builds on existing types from diskann-quantization (Mat, MatRef, MaxSim, Chamfer) and proposes MaxSim implementing DistanceFunctionMut from diskann-vector for ecosystem integration.

Experimental benchmarks available in PR #730.

Review Focus

This PR is the RFC itself — please review the proposal and leave comments directly on the RFC file:

  • Is query-transposed tiling the right default over document-transposed approaches?
  • API shape: MaxSim implementing DistanceFunctionMut directly
  • TransposedMultiVector type design and Repr trait integration
  • Trade-offs between proposed and alternative approaches

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an RFC framework to the DiskANN repository and introduces the first RFC describing a proposed high-performance Chamfer distance implementation for ColBERT-style multi-vector representations.

Changes:

  • Introduces rfcs/ with an RFC process guide and an index.
  • Adds an RFC template (0000-template.md) to standardize future proposals.
  • Adds RFC 0001 proposing “query-transposed tiling” for faster multi-vector Chamfer distance.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
rfcs/README.md Documents the RFC process, lifecycle, and index of RFCs.
rfcs/0000-template.md Provides a reusable RFC template and metadata format.
rfcs/0001-multi-vector-distance-functions.md Proposes the multi-vector Chamfer distance approach, API shape, and benchmarks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov-commenter
Copy link

codecov-commenter commented Feb 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.00%. Comparing base (f3fcaae) to head (b418f39).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #731      +/-   ##
==========================================
- Coverage   89.01%   89.00%   -0.01%     
==========================================
  Files         428      428              
  Lines       78294    78294              
==========================================
- Hits        69692    69686       -6     
- Misses       8602     8608       +6     
Flag Coverage Δ
miri 89.00% <ø> (?)
unittests 89.00% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.
see 7 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@suri-kumkaran
Copy link
Contributor Author

@microsoft-github-policy-service agree company="Microsoft"


## How to Submit an RFC

1. Copy [0000-template.md](0000-template.md) to `NNNN-short-title.md` (use the next available number).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can use PR numbers instead of total ordering? I'm worried that ensuring all RFCs are sequentially ordered otherwise would be a giant pain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are two concerns with using PR numbers:

  1. We would need to update file names after opening PRs, which is not a very smooth process.
  2. Having a total ordering improves readability and overall cleanliness in my opinion, and we only need to ensure the ordering is correct at the time the PR is merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I buy the argument on a sequential ordering, though to be fair it's just a number. But, is ensuring the ordering is correct at merge time just as much (if not more) of a process headache than updating the name once with the PR number at the beginning and then never having to worry about it again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ensuring the ordering is correct at merge time may be more of a headache than updating the PR number at the beginning, but the benefit of maintaining sequential order makes it worthwhile.

Also, since RFCs merge will not happen very frequently, we are unlikely to need to update the sequential ordering second time before merge.

2. Fill in all sections. Remove instructional comments.
3. Open a pull request with the RFC file. The PR description should summarize the proposal.
4. Discuss in the PR. Update the RFC based on feedback.
5. Once accepted, the RFC is merged and the status is updated to **Accepted**.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging an RFC should be considered acceptance. This way, there's no need for additional book-keeping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we not planning to merge rejected RFCs as well, so that we can document proposals that were discussed and ultimately rejected?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use an RFC label to tag RFC pull requests. Then rejected RFC's will still be available as closed PRs, the final text will not bloat the repo, and the full discussion will be available. The latter is arguable more valuable than just the text.

2. Achieve 2x+ speedup over baseline SIMD through memory layout optimization
3. Maintain compatibility with DiskANN's `DistanceFunctionMut` trait
4. Provide a clean API that enables standalone distance function usage without full index integration
5. Achieve performance within 10–20% of `faer` SGEMM-based Chamfer computation, when both our implementation and `faer` are restricted to AVX2 (no AVX-512 on either side)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use AVX-512 if available? We have support, and it could likely really benefit dense operations like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would like to park this for now. We can pick it up in a follow up with a more thorough design for dynamic dispatching that also takes the machine architecture into account.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferring is fine, but I don't think there need to be a super thorough design planning/review - we already have the infrastructure for dynamic dispatch and micro-architecture specific code generation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, so what's your suggestion? Should we include the brief design around dynamic dispatch in this RFC itself?

I was thinking we can work on it in separate PRs and RFCs if required in future.


### Alternative Approaches

The experimental crate explored six approaches total. The query-transposed tiling approach was selected as the proposal, but the alternatives remain available and may be better for specific workloads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing not covered here is how this behaves under multi-threading. it's possible that single threaded tests may look fine, but if effort is not applied to restrict working set sizes to the size of the L1/L2 as much as possible, then multiple-threads could stomp on one-another's usage of L3 when running multiple computations in parallel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a great point. While the distance kernel itself is single-threaded, I agree that minimizing the working set is crucial to prevent cache thrashing when multiple threads are running parallel searches (e.g., in a Rayon thread pool).

I feel Query-Transposed Tiling approach tackles this problem fairly well:

  1. Reduced Bandwidth Pressure: By holding a query block in vector registers (or L1) and iterating over multiple document vectors, we maximize arithmetic intensity. This reduces the frequency of fetches from L3/Memory compared to the baseline, which theoretically leaves more bandwidth available for other threads.

  2. Small Working Set: The scratch buffer (MaxSim) and the query tiles are designed to stay hot in L1/L2. The only data streaming through L3 is the document vectors themselves, which is unavoidable.

That said, theoretical scaling often differs from practice. I can add a 'Throughput Benchmark' to the RFC (running the kernel on all available cores simultaneously) to verify that we don't see sub-linear scaling due to resource contention. Does that sound like a sufficient verification?

| dim=384, doc=32, query=8 | 8,412 | 1.41x | **1.65x** | 1.30x | 1.24x |
| dim=384, doc=64, query=16 | 38,162 | 1.30x | 1.47x | **1.70x** | 1.66x |
| dim=384, doc=128, query=32 | 171,431 | 1.53x | 1.94x | 2.04x | **2.16x** |
| **Average** | — | **1.38x** | **1.79x** | **1.93x** | **1.59x** |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also see situations where there are, say, 900+ documents?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bit far from a real use case because the Chamfer distance is very costly, but I can still benchmark it if you would like. Let me know if you want me to do that.

### Speedup vs SIMD Baseline (Median, Lower Latency = Better)

| Configuration | SIMD (µs) | transposed_simd | transposed_tiling | query_transposed_tiling | sgemm |
|--------------|-----------|-----------------|-------------------|------------------------|-------|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit - can this table be adjusted to it's readable without rendering the markdown?

Copy link
Contributor Author

@suri-kumkaran suri-kumkaran Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, yet to push.

@harsha-simhadri
Copy link
Contributor

Let us post the RFCs we already have that have shaped code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants