dipy · 36000 · Mar 11, 2026 · Mar 3, 2026 · Mar 3, 2026 · Mar 3, 2026
diff --git a/.gitignore b/.gitignore
@@ -4,3 +4,12 @@
 *.pyo
 *.pyd
 
+# Build artifacts
+*.egg-info/
+dist/
+build/
+
+# Test outputs
+*.trk
+*.trx
+*.nii.gz
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,122 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+GPUStreamlines (`cuslines`) is a GPU-accelerated tractography package for diffusion MRI. It supports **three GPU backends**: NVIDIA CUDA, Apple Metal (Apple Silicon), and WebGPU (cross-platform via wgpu-py). Backend is auto-detected at import time in `cuslines/__init__.py` (priority: Metal → CUDA → WebGPU). Kernels are compiled at runtime (NVRTC for CUDA, `MTLDevice.newLibraryWithSource` for Metal, `device.create_shader_module` for WebGPU/WGSL).
+
+## Build & Run
+
+```bash
+# Install (pick your backend)
+pip install ".[cu13]"    # CUDA 13
+pip install ".[cu12]"    # CUDA 12
+pip install ".[metal]"   # Apple Metal (Apple Silicon)
+pip install ".[webgpu]"  # WebGPU (cross-platform: NVIDIA, AMD, Intel, Apple)
+
+# From PyPI
+pip install "cuslines[cu13]"
+pip install "cuslines[metal]"
+pip install "cuslines[webgpu]"
+
+# GPU run (downloads HARDI dataset if no data passed)
+python run_gpu_streamlines.py --output-prefix small --nseeds 1000 --ngpus 1
+
+# Force a specific backend
+python run_gpu_streamlines.py --device=webgpu --output-prefix small --nseeds 1000
+
+# CPU reference run (for comparison/debugging)
+python run_gpu_streamlines.py --device=cpu --output-prefix small --nseeds 1000
+
+# Docker
+docker build -t gpustreamlines .
+```
+
+There is no dedicated test or lint suite. Validate by comparing CPU vs GPU outputs on the same seeds.
+
+## Architecture
+
+**Two-layer design**: Python orchestration + GPU kernels compiled at runtime. Three parallel backend implementations share the same API surface.
+
+```
+run_gpu_streamlines.py          # CLI entry: DIPY model fitting → CPU or GPU tracking
+cuslines/
+  __init__.py                   # Auto-detects Metal → CUDA → WebGPU backend at import
+  boot_utils.py                 # Shared bootstrap matrix preparation (OPDT/CSA) for all backends
+  cuda_python/                  # CUDA backend
+    cu_tractography.py          # GPUTracker: context manager, multi-GPU allocation
+    cu_propagate_seeds.py       # SeedBatchPropagator: chunked seed processing
+    cu_direction_getters.py     # Direction getter ABC + Boot/Prob/PTT implementations
+    cutils.py                   # REAL_DTYPE, REAL3_DTYPE, checkCudaErrors(), ModelType enum
+    _globals.py                 # AUTO-GENERATED from globals.h (never edit manually)
+  cuda_c/                       # CUDA kernel source
+    globals.h                   # Source-of-truth for constants (REAL_SIZE, thread config)
+    generate_streamlines_cuda.cu, boot.cu, ptt.cu, tracking_helpers.cu, utils.cu
+    cudamacro.h, cuwsort.cuh, ptt.cuh, disc.h
+  metal/                        # Metal backend (mirrors cuda_python/)
+    mt_tractography.py, mt_propagate_seeds.py, mt_direction_getters.py, mutils.py
+  metal_shaders/                # MSL kernel source (mirrors cuda_c/)
+    globals.h, types.h, philox_rng.h
+    generate_streamlines_metal.metal, boot.metal, ptt.metal
+    tracking_helpers.metal, utils.metal, warp_sort.metal
+  webgpu/                       # WebGPU backend (mirrors metal/)
+    wg_tractography.py, wg_propagate_seeds.py, wg_direction_getters.py, wgutils.py
+    benchmark.py                # Cross-backend benchmark: python -m cuslines.webgpu.benchmark
+  wgsl_shaders/                 # WGSL kernel source (mirrors metal_shaders/)
+    globals.wgsl, types.wgsl, philox_rng.wgsl
+    utils.wgsl, warp_sort.wgsl, tracking_helpers.wgsl
+    generate_streamlines.wgsl   # Prob/PTT buffer bindings + Prob getNum/gen kernels
+    boot.wgsl                   # Boot direction getter kernels (standalone module)
+    disc.wgsl, ptt.wgsl         # PTT support
+```
+
+**Data flow**: DIPY preprocessing → seed generation → GPUTracker context → SeedBatchPropagator chunks seeds across GPUs → kernel launch → stream results to TRK/TRX output.
+
+**Direction getters** (subclasses of `GPUDirectionGetter`):
+- `BootDirectionGetter` — bootstrap sampling from SH coefficients (OPDT/CSA models)
+- `ProbDirectionGetter` — probabilistic selection from ODF/PMF (CSD model)
+- `PttDirectionGetter` — Probabilistic Tracking with Turning (CSD model)
+
+Each has `from_dipy_*()` class methods for initialization from DIPY models.
+
+## Critical Conventions
+
+- **`_globals.py` is auto-generated** from `cuslines/cuda_c/globals.h` during `setup.py` build via `defines_to_python()`. Never edit it manually; change `globals.h` and rebuild.
+- **GPU arrays must be C-contiguous** — always use `np.ascontiguousarray()` and project scalar types (`REAL_DTYPE`, `REAL_SIZE` from `cutils.py` or `mutils.py`).
+- **All CUDA API calls must be wrapped** with `checkCudaErrors()`.
+- **Angle units**: CLI accepts degrees, internals convert to radians before the GPU layer.
+- **Multi-GPU**: CUDA uses explicit `cudaSetDevice()` calls; Metal and WebGPU are single-GPU only.
+- **CPU/GPU parity**: `run_gpu_streamlines.py` maintains parallel CPU and GPU code paths — keep both in sync when changing arguments or model-selection logic.
+- **Logger**: use `logging.getLogger("GPUStreamlines")`.
+- **Kernel compilation**: CUDA uses `cuda.core.Program` with NVIDIA headers. Metal uses `MTLDevice.newLibraryWithSource_options_error_()` with MSL source concatenated from `metal_shaders/`. WebGPU uses `device.create_shader_module()` with WGSL source concatenated from `wgsl_shaders/`.
+
+## Metal Backend Notes
+
+- **Unified memory**: Metal buffers use `storageModeShared` — numpy arrays are directly GPU-accessible (zero memcpy per batch, vs ~6 in CUDA).
+- **float3 alignment**: All buffers use `packed_float3` (12 bytes) with `load_f3()`/`store_f3()` helpers. Metal `float3` is 16 bytes in registers.
+- **Page alignment**: Use `aligned_array()` from `mutils.py` for arrays passed to `newBufferWithBytesNoCopy`.
+- **No double precision**: Only `REAL_SIZE=4` (float32) is ported.
+- **Warp primitives**: `__shfl_sync` → `simd_shuffle`, `__ballot_sync` → `simd_ballot`. SIMD width = 32.
+- **SH basis**: Always use `real_sh_descoteaux(legacy=True)` for all matrices. See `boot_utils.py`.
+
+## WebGPU Backend Notes
+
+- **Cross-platform**: wgpu-py maps to Metal (macOS), Vulkan (Linux/Windows), D3D12 (Windows). Install: `pip install "cuslines[webgpu]"`.
+- **Explicit readbacks**: `device.queue.read_buffer()` for GPU→CPU (~3 per seed batch, matching CUDA's cudaMemcpy pattern).
+- **WGSL shaders**: Concatenated in dependency order by `compile_program()`. Boot compiles standalone; Prob/PTT share `generate_streamlines.wgsl`.
+- **Buffer binding**: Boot needs 17 buffers across 3 bind groups. Prob/PTT use 2 bind groups. `layout="auto"` only includes reachable bindings.
+- **Subgroups required**: Device feature `"subgroup"` (singular, not `"subgroups"`). Naga does NOT support `enable subgroups;` directive.
+- **WGSL constraints**: No `ptr<storage>` parameters (use module-scope accessors). `var<workgroup>` sizes must be compile-time constants. PhiloxState is pass-by-value (return result structs).
+- **Boot standalone module**: `_kernel_files()` returns `[]` to avoid `params` struct redefinition.
+- **Benchmark**: `python -m cuslines.webgpu.benchmark --nseeds 10000` — auto-detects all backends.
+
+## Key Dependencies
+
+- `dipy` — diffusion models, CPU direction getters, seeding, stopping criteria
+- `nibabel` — NIfTI/TRK file I/O (`StatefulTractogram`)
+- `trx-python` — TRX format support (memory-mapped, for large outputs)
+- `cuda-python` / `cuda-core` / `cuda-cccl` — CUDA Python bindings, kernel compilation, C++ headers
+- `pyobjc-framework-Metal` / `pyobjc-framework-MetalPerformanceShaders` — Metal Python bindings (macOS only)
+- `wgpu` — WebGPU Python bindings (wgpu-native, cross-platform)
+- `numpy` — array operations throughout
diff --git a/Dockerfile b/Dockerfile
@@ -5,7 +5,7 @@ SHELL ["/bin/bash", "-c"]
 
 ENV DEBIAN_FRONTEND=noninteractive
 
-RUN apt-get update && apt-get install --assume-yes curl
+RUN apt-get update && apt-get install --assume-yes curl git
 
 RUN curl -L "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh" \
     -o "/tmp/Miniconda3.sh"

diff --git a/README.md b/README.md
@@ -1,14 +1,26 @@
 # GPUStreamlines
 
 ## Installation
-To install from pypi, simply run `pip install "cuslines[cu13]"` or `pip install "cuslines[cu12]"` depending on your CUDA version.
+To install from pypi:
+```
+pip install "cuslines[cu13]"    # CUDA 13 (NVIDIA)
+pip install "cuslines[cu12]"    # CUDA 12 (NVIDIA)
+pip install "cuslines[metal]"   # Apple Metal (Apple Silicon)
+pip install "cuslines[webgpu]"  # WebGPU (cross-platform: NVIDIA, AMD, Intel, Apple)
+```
 
-To install from dev, simply run `pip install ".[cu13]"` or `pip install ".[cu12]"` in the top-level repository directory.
+To install from dev:
+```
+pip install ".[cu13]"     # CUDA 13
+pip install ".[cu12]"     # CUDA 12
+pip install ".[metal]"    # Apple Metal
+pip install ".[webgpu]"   # WebGPU (any GPU)
+```
 
 ## Running the examples
 This repository contains several example usage scripts.
 
-The script `run_gpu_streamlines.py` demonstrates how to run any diffusion MRI dataset on the GPU. It can also run on the CPU for reference, if the argument `--device=cpu` is used. If not data is passed, it will donaload and use the HARDI dataset.
+The script `run_gpu_streamlines.py` demonstrates how to run any diffusion MRI dataset on the GPU. It can also run on the CPU for reference, if the argument `--device=cpu` is used. If no data is passed, it will download and use the HARDI dataset.
 
 To run the baseline CPU example on a random set of 1000 seeds, this is the command and example output:
 ```
@@ -52,6 +64,12 @@ Note that if you experience memory errors, you can adjust the `--chunk-size` fla
 
 To run on more seeds, we suggest setting the `--write-method trx` flag in the GPU script to not get bottlenecked by writing files.
 
+## GPU vs CPU differences
+
+GPU backends (CUDA, Metal, and WebGPU) operate in float32 while DIPY uses float64. This causes slightly different peak selection at fiber crossings where ODF peaks have similar magnitudes. In practice the GPU produces comparable streamline counts and commissural fiber density, with modestly longer fibers on average. See [cuslines/webgpu/README.md](cuslines/webgpu/README.md) for cross-platform benchmarks and [cuslines/metal/README.md](cuslines/metal/README.md) for Metal-specific details.
+
+The WebGPU backend runs on any GPU (NVIDIA, AMD, Intel, Apple) via [wgpu-py](https://github.com/pygfx/wgpu-py). It is auto-detected when no vendor-specific backend is available. See `python -m cuslines.webgpu.benchmark` for a self-contained benchmark across all available backends.
+
 ## Running on AWS with Docker
 First, set up an AWS instance with GPU and ssh into it (we recommend a P3 instance with at least 1 V100 16 GB GPU and a Deep Learning AMI Ubuntu 18.04 v 33.0.). Then do the following:
 1. Log in to GitHub docker registry:

diff --git a/cuslines/__init__.py b/cuslines/__init__.py
@@ -1,13 +1,71 @@
-from .cuda_python import (
-    GPUTracker,
-    ProbDirectionGetter,
-    PttDirectionGetter,
-    BootDirectionGetter
-)
+import platform as _platform
+
+
+def _detect_backend():
+    """Auto-detect the best available GPU backend."""
+    system = _platform.system()
+    if system == "Darwin":
+        try:
+            import Metal
+
+            if Metal.MTLCreateSystemDefaultDevice() is not None:
+                return "metal"
+        except ImportError:
+            pass
+    try:
+        from cuda.bindings import runtime
+
+        count = runtime.cudaGetDeviceCount()
+        if count[1] > 0:
+            return "cuda"
+    except (ImportError, Exception):
+        pass
+    try:
+        import wgpu
+
+        adapter = wgpu.gpu.request_adapter_sync()
+        if adapter is not None:
+            return "webgpu"
+    except (ImportError, Exception):
+        pass
+    return None
+
+
+BACKEND = _detect_backend()
+
+if BACKEND == "metal":
+    from cuslines.metal import (
+        MetalGPUTracker as GPUTracker,
+        MetalProbDirectionGetter as ProbDirectionGetter,
+        MetalPttDirectionGetter as PttDirectionGetter,
+        MetalBootDirectionGetter as BootDirectionGetter,
+    )
+elif BACKEND == "cuda":
+    from cuslines.cuda_python import (
+        GPUTracker,
+        ProbDirectionGetter,
+        PttDirectionGetter,
+        BootDirectionGetter,
+    )
+elif BACKEND == "webgpu":
+    from cuslines.webgpu import (
+        WebGPUTracker as GPUTracker,
+        WebGPUProbDirectionGetter as ProbDirectionGetter,
+        WebGPUPttDirectionGetter as PttDirectionGetter,
+        WebGPUBootDirectionGetter as BootDirectionGetter,
+    )
+else:
+    raise ImportError(
+        "No GPU backend available. Install either:\n"
+        "  - CUDA: pip install 'cuslines[cu13]' (NVIDIA GPU)\n"
+        "  - Metal: pip install 'cuslines[metal]' (Apple Silicon)\n"
+        "  - WebGPU: pip install 'cuslines[webgpu]' (cross-platform)"
+    )
 
 __all__ = [
     "GPUTracker",
     "ProbDirectionGetter",
     "PttDirectionGetter",
-    "BootDirectionGetter"
+    "BootDirectionGetter",
+    "BACKEND",
 ]
diff --git a/cuslines/boot_utils.py b/cuslines/boot_utils.py
@@ -0,0 +1,72 @@
+"""Shared utilities for bootstrap direction getters (CUDA and Metal).
+
+Extracts DIPY model matrices (H, R, delta_b, delta_q, sampling_matrix)
+for OPDT and CSA models.  Both backends need the same matrices — only
+the GPU dispatch differs.
+"""
+
+from dipy.reconst import shm
+
+
+def prepare_opdt(gtab, sphere, sh_order_max=6, full_basis=False,
+                 sh_lambda=0.006, min_signal=1):
+    """Build bootstrap matrices for the OPDT model.
+
+    Returns dict with keys: model_type, min_signal, H, R, delta_b,
+    delta_q, sampling_matrix, b0s_mask.
+    """
+    sampling_matrix, _, _ = shm.real_sh_descoteaux(
+        sh_order_max, sphere.theta, sphere.phi,
+        full_basis=full_basis, legacy=True,
+    )
+    model = shm.OpdtModel(
+        gtab, sh_order_max=sh_order_max, smooth=sh_lambda,
+        min_signal=min_signal,
+    )
+    delta_b, delta_q = model._fit_matrix
+
+    H, R = _hat_and_lcr(gtab, model, sh_order_max)
+
+    return dict(
+        model_type="OPDT", min_signal=min_signal,
+        H=H, R=R, delta_b=delta_b, delta_q=delta_q,
+        sampling_matrix=sampling_matrix, b0s_mask=gtab.b0s_mask,
+    )
+
+
+def prepare_csa(gtab, sphere, sh_order_max=6, full_basis=False,
+                sh_lambda=0.006, min_signal=1):
+    """Build bootstrap matrices for the CSA model.
+
+    Returns dict with keys: model_type, min_signal, H, R, delta_b,
+    delta_q, sampling_matrix, b0s_mask.
+    """
+    sampling_matrix, _, _ = shm.real_sh_descoteaux(
+        sh_order_max, sphere.theta, sphere.phi,
+        full_basis=full_basis, legacy=True,
+    )
+    model = shm.CsaOdfModel(
+        gtab, sh_order_max=sh_order_max, smooth=sh_lambda,
+        min_signal=min_signal,
+    )
+    delta_b = model._fit_matrix
+    delta_q = model._fit_matrix
+
+    H, R = _hat_and_lcr(gtab, model, sh_order_max)
+
+    return dict(
+        model_type="CSA", min_signal=min_signal,
+        H=H, R=R, delta_b=delta_b, delta_q=delta_q,
+        sampling_matrix=sampling_matrix, b0s_mask=gtab.b0s_mask,
+    )
+
+
+def _hat_and_lcr(gtab, model, sh_order_max):
+    """Compute hat matrix H and leveraged centered residuals matrix R."""
+    dwi_mask = ~gtab.b0s_mask
+    x, y, z = model.gtab.gradients[dwi_mask].T
+    _, theta, phi = shm.cart2sphere(x, y, z)
+    B, _, _ = shm.real_sh_descoteaux(sh_order_max, theta, phi, legacy=True)
+    H = shm.hat(B)
+    R = shm.lcr_matrix(H)
+    return H, R