Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,12 @@
*.pyo
*.pyd

# Build artifacts
*.egg-info/
dist/
build/

# Test outputs
*.trk
*.trx
*.nii.gz
122 changes: 122 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

GPUStreamlines (`cuslines`) is a GPU-accelerated tractography package for diffusion MRI. It supports **three GPU backends**: NVIDIA CUDA, Apple Metal (Apple Silicon), and WebGPU (cross-platform via wgpu-py). Backend is auto-detected at import time in `cuslines/__init__.py` (priority: Metal → CUDA → WebGPU). Kernels are compiled at runtime (NVRTC for CUDA, `MTLDevice.newLibraryWithSource` for Metal, `device.create_shader_module` for WebGPU/WGSL).

## Build & Run

```bash
# Install (pick your backend)
pip install ".[cu13]" # CUDA 13
pip install ".[cu12]" # CUDA 12
pip install ".[metal]" # Apple Metal (Apple Silicon)
pip install ".[webgpu]" # WebGPU (cross-platform: NVIDIA, AMD, Intel, Apple)

# From PyPI
pip install "cuslines[cu13]"
pip install "cuslines[metal]"
pip install "cuslines[webgpu]"

# GPU run (downloads HARDI dataset if no data passed)
python run_gpu_streamlines.py --output-prefix small --nseeds 1000 --ngpus 1

# Force a specific backend
python run_gpu_streamlines.py --device=webgpu --output-prefix small --nseeds 1000

# CPU reference run (for comparison/debugging)
python run_gpu_streamlines.py --device=cpu --output-prefix small --nseeds 1000

# Docker
docker build -t gpustreamlines .
```

There is no dedicated test or lint suite. Validate by comparing CPU vs GPU outputs on the same seeds.

## Architecture

**Two-layer design**: Python orchestration + GPU kernels compiled at runtime. Three parallel backend implementations share the same API surface.

```
run_gpu_streamlines.py # CLI entry: DIPY model fitting → CPU or GPU tracking
cuslines/
__init__.py # Auto-detects Metal → CUDA → WebGPU backend at import
boot_utils.py # Shared bootstrap matrix preparation (OPDT/CSA) for all backends
cuda_python/ # CUDA backend
cu_tractography.py # GPUTracker: context manager, multi-GPU allocation
cu_propagate_seeds.py # SeedBatchPropagator: chunked seed processing
cu_direction_getters.py # Direction getter ABC + Boot/Prob/PTT implementations
cutils.py # REAL_DTYPE, REAL3_DTYPE, checkCudaErrors(), ModelType enum
_globals.py # AUTO-GENERATED from globals.h (never edit manually)
cuda_c/ # CUDA kernel source
globals.h # Source-of-truth for constants (REAL_SIZE, thread config)
generate_streamlines_cuda.cu, boot.cu, ptt.cu, tracking_helpers.cu, utils.cu
cudamacro.h, cuwsort.cuh, ptt.cuh, disc.h
metal/ # Metal backend (mirrors cuda_python/)
mt_tractography.py, mt_propagate_seeds.py, mt_direction_getters.py, mutils.py
metal_shaders/ # MSL kernel source (mirrors cuda_c/)
globals.h, types.h, philox_rng.h
generate_streamlines_metal.metal, boot.metal, ptt.metal
tracking_helpers.metal, utils.metal, warp_sort.metal
webgpu/ # WebGPU backend (mirrors metal/)
wg_tractography.py, wg_propagate_seeds.py, wg_direction_getters.py, wgutils.py
benchmark.py # Cross-backend benchmark: python -m cuslines.webgpu.benchmark
wgsl_shaders/ # WGSL kernel source (mirrors metal_shaders/)
globals.wgsl, types.wgsl, philox_rng.wgsl
utils.wgsl, warp_sort.wgsl, tracking_helpers.wgsl
generate_streamlines.wgsl # Prob/PTT buffer bindings + Prob getNum/gen kernels
boot.wgsl # Boot direction getter kernels (standalone module)
disc.wgsl, ptt.wgsl # PTT support
```

**Data flow**: DIPY preprocessing → seed generation → GPUTracker context → SeedBatchPropagator chunks seeds across GPUs → kernel launch → stream results to TRK/TRX output.

**Direction getters** (subclasses of `GPUDirectionGetter`):
- `BootDirectionGetter` — bootstrap sampling from SH coefficients (OPDT/CSA models)
- `ProbDirectionGetter` — probabilistic selection from ODF/PMF (CSD model)
- `PttDirectionGetter` — Probabilistic Tracking with Turning (CSD model)

Each has `from_dipy_*()` class methods for initialization from DIPY models.

## Critical Conventions

- **`_globals.py` is auto-generated** from `cuslines/cuda_c/globals.h` during `setup.py` build via `defines_to_python()`. Never edit it manually; change `globals.h` and rebuild.
- **GPU arrays must be C-contiguous** — always use `np.ascontiguousarray()` and project scalar types (`REAL_DTYPE`, `REAL_SIZE` from `cutils.py` or `mutils.py`).
- **All CUDA API calls must be wrapped** with `checkCudaErrors()`.
- **Angle units**: CLI accepts degrees, internals convert to radians before the GPU layer.
- **Multi-GPU**: CUDA uses explicit `cudaSetDevice()` calls; Metal and WebGPU are single-GPU only.
- **CPU/GPU parity**: `run_gpu_streamlines.py` maintains parallel CPU and GPU code paths — keep both in sync when changing arguments or model-selection logic.
- **Logger**: use `logging.getLogger("GPUStreamlines")`.
- **Kernel compilation**: CUDA uses `cuda.core.Program` with NVIDIA headers. Metal uses `MTLDevice.newLibraryWithSource_options_error_()` with MSL source concatenated from `metal_shaders/`. WebGPU uses `device.create_shader_module()` with WGSL source concatenated from `wgsl_shaders/`.

## Metal Backend Notes

- **Unified memory**: Metal buffers use `storageModeShared` — numpy arrays are directly GPU-accessible (zero memcpy per batch, vs ~6 in CUDA).
- **float3 alignment**: All buffers use `packed_float3` (12 bytes) with `load_f3()`/`store_f3()` helpers. Metal `float3` is 16 bytes in registers.
- **Page alignment**: Use `aligned_array()` from `mutils.py` for arrays passed to `newBufferWithBytesNoCopy`.
- **No double precision**: Only `REAL_SIZE=4` (float32) is ported.
- **Warp primitives**: `__shfl_sync` → `simd_shuffle`, `__ballot_sync` → `simd_ballot`. SIMD width = 32.
- **SH basis**: Always use `real_sh_descoteaux(legacy=True)` for all matrices. See `boot_utils.py`.

## WebGPU Backend Notes

- **Cross-platform**: wgpu-py maps to Metal (macOS), Vulkan (Linux/Windows), D3D12 (Windows). Install: `pip install "cuslines[webgpu]"`.
- **Explicit readbacks**: `device.queue.read_buffer()` for GPU→CPU (~3 per seed batch, matching CUDA's cudaMemcpy pattern).
- **WGSL shaders**: Concatenated in dependency order by `compile_program()`. Boot compiles standalone; Prob/PTT share `generate_streamlines.wgsl`.
- **Buffer binding**: Boot needs 17 buffers across 3 bind groups. Prob/PTT use 2 bind groups. `layout="auto"` only includes reachable bindings.
- **Subgroups required**: Device feature `"subgroup"` (singular, not `"subgroups"`). Naga does NOT support `enable subgroups;` directive.
- **WGSL constraints**: No `ptr<storage>` parameters (use module-scope accessors). `var<workgroup>` sizes must be compile-time constants. PhiloxState is pass-by-value (return result structs).
- **Boot standalone module**: `_kernel_files()` returns `[]` to avoid `params` struct redefinition.
- **Benchmark**: `python -m cuslines.webgpu.benchmark --nseeds 10000` — auto-detects all backends.

## Key Dependencies

- `dipy` — diffusion models, CPU direction getters, seeding, stopping criteria
- `nibabel` — NIfTI/TRK file I/O (`StatefulTractogram`)
- `trx-python` — TRX format support (memory-mapped, for large outputs)
- `cuda-python` / `cuda-core` / `cuda-cccl` — CUDA Python bindings, kernel compilation, C++ headers
- `pyobjc-framework-Metal` / `pyobjc-framework-MetalPerformanceShaders` — Metal Python bindings (macOS only)
- `wgpu` — WebGPU Python bindings (wgpu-native, cross-platform)
- `numpy` — array operations throughout
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ SHELL ["/bin/bash", "-c"]

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install --assume-yes curl
RUN apt-get update && apt-get install --assume-yes curl git

RUN curl -L "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh" \
-o "/tmp/Miniconda3.sh"
Expand Down
24 changes: 21 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,26 @@
# GPUStreamlines

## Installation
To install from pypi, simply run `pip install "cuslines[cu13]"` or `pip install "cuslines[cu12]"` depending on your CUDA version.
To install from pypi:
```
pip install "cuslines[cu13]" # CUDA 13 (NVIDIA)
pip install "cuslines[cu12]" # CUDA 12 (NVIDIA)
pip install "cuslines[metal]" # Apple Metal (Apple Silicon)
pip install "cuslines[webgpu]" # WebGPU (cross-platform: NVIDIA, AMD, Intel, Apple)
```

To install from dev, simply run `pip install ".[cu13]"` or `pip install ".[cu12]"` in the top-level repository directory.
To install from dev:
```
pip install ".[cu13]" # CUDA 13
pip install ".[cu12]" # CUDA 12
pip install ".[metal]" # Apple Metal
pip install ".[webgpu]" # WebGPU (any GPU)
```

## Running the examples
This repository contains several example usage scripts.

The script `run_gpu_streamlines.py` demonstrates how to run any diffusion MRI dataset on the GPU. It can also run on the CPU for reference, if the argument `--device=cpu` is used. If not data is passed, it will donaload and use the HARDI dataset.
The script `run_gpu_streamlines.py` demonstrates how to run any diffusion MRI dataset on the GPU. It can also run on the CPU for reference, if the argument `--device=cpu` is used. If no data is passed, it will download and use the HARDI dataset.

To run the baseline CPU example on a random set of 1000 seeds, this is the command and example output:
```
Expand Down Expand Up @@ -52,6 +64,12 @@ Note that if you experience memory errors, you can adjust the `--chunk-size` fla

To run on more seeds, we suggest setting the `--write-method trx` flag in the GPU script to not get bottlenecked by writing files.

## GPU vs CPU differences

GPU backends (CUDA, Metal, and WebGPU) operate in float32 while DIPY uses float64. This causes slightly different peak selection at fiber crossings where ODF peaks have similar magnitudes. In practice the GPU produces comparable streamline counts and commissural fiber density, with modestly longer fibers on average. See [cuslines/webgpu/README.md](cuslines/webgpu/README.md) for cross-platform benchmarks and [cuslines/metal/README.md](cuslines/metal/README.md) for Metal-specific details.

The WebGPU backend runs on any GPU (NVIDIA, AMD, Intel, Apple) via [wgpu-py](https://github.com/pygfx/wgpu-py). It is auto-detected when no vendor-specific backend is available. See `python -m cuslines.webgpu.benchmark` for a self-contained benchmark across all available backends.

## Running on AWS with Docker
First, set up an AWS instance with GPU and ssh into it (we recommend a P3 instance with at least 1 V100 16 GB GPU and a Deep Learning AMI Ubuntu 18.04 v 33.0.). Then do the following:
1. Log in to GitHub docker registry:
Expand Down
72 changes: 65 additions & 7 deletions cuslines/__init__.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,71 @@
from .cuda_python import (
GPUTracker,
ProbDirectionGetter,
PttDirectionGetter,
BootDirectionGetter
)
import platform as _platform


def _detect_backend():
"""Auto-detect the best available GPU backend."""
system = _platform.system()
if system == "Darwin":
try:
import Metal

if Metal.MTLCreateSystemDefaultDevice() is not None:
return "metal"
except ImportError:
pass
try:
from cuda.bindings import runtime

count = runtime.cudaGetDeviceCount()
if count[1] > 0:
return "cuda"
except (ImportError, Exception):
pass
try:
import wgpu

adapter = wgpu.gpu.request_adapter_sync()
if adapter is not None:
return "webgpu"
except (ImportError, Exception):
pass
return None


BACKEND = _detect_backend()

if BACKEND == "metal":
from cuslines.metal import (
MetalGPUTracker as GPUTracker,
MetalProbDirectionGetter as ProbDirectionGetter,
MetalPttDirectionGetter as PttDirectionGetter,
MetalBootDirectionGetter as BootDirectionGetter,
)
elif BACKEND == "cuda":
from cuslines.cuda_python import (
GPUTracker,
ProbDirectionGetter,
PttDirectionGetter,
BootDirectionGetter,
)
elif BACKEND == "webgpu":
from cuslines.webgpu import (
WebGPUTracker as GPUTracker,
WebGPUProbDirectionGetter as ProbDirectionGetter,
WebGPUPttDirectionGetter as PttDirectionGetter,
WebGPUBootDirectionGetter as BootDirectionGetter,
)
else:
raise ImportError(
"No GPU backend available. Install either:\n"
" - CUDA: pip install 'cuslines[cu13]' (NVIDIA GPU)\n"
" - Metal: pip install 'cuslines[metal]' (Apple Silicon)\n"
" - WebGPU: pip install 'cuslines[webgpu]' (cross-platform)"
)

__all__ = [
"GPUTracker",
"ProbDirectionGetter",
"PttDirectionGetter",
"BootDirectionGetter"
"BootDirectionGetter",
"BACKEND",
]
72 changes: 72 additions & 0 deletions cuslines/boot_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
"""Shared utilities for bootstrap direction getters (CUDA and Metal).

Extracts DIPY model matrices (H, R, delta_b, delta_q, sampling_matrix)
for OPDT and CSA models. Both backends need the same matrices — only
the GPU dispatch differs.
"""

from dipy.reconst import shm


def prepare_opdt(gtab, sphere, sh_order_max=6, full_basis=False,
sh_lambda=0.006, min_signal=1):
"""Build bootstrap matrices for the OPDT model.

Returns dict with keys: model_type, min_signal, H, R, delta_b,
delta_q, sampling_matrix, b0s_mask.
"""
sampling_matrix, _, _ = shm.real_sh_descoteaux(
sh_order_max, sphere.theta, sphere.phi,
full_basis=full_basis, legacy=True,
)
model = shm.OpdtModel(
gtab, sh_order_max=sh_order_max, smooth=sh_lambda,
min_signal=min_signal,
)
delta_b, delta_q = model._fit_matrix

H, R = _hat_and_lcr(gtab, model, sh_order_max)

return dict(
model_type="OPDT", min_signal=min_signal,
H=H, R=R, delta_b=delta_b, delta_q=delta_q,
sampling_matrix=sampling_matrix, b0s_mask=gtab.b0s_mask,
)


def prepare_csa(gtab, sphere, sh_order_max=6, full_basis=False,
sh_lambda=0.006, min_signal=1):
"""Build bootstrap matrices for the CSA model.

Returns dict with keys: model_type, min_signal, H, R, delta_b,
delta_q, sampling_matrix, b0s_mask.
"""
sampling_matrix, _, _ = shm.real_sh_descoteaux(
sh_order_max, sphere.theta, sphere.phi,
full_basis=full_basis, legacy=True,
)
model = shm.CsaOdfModel(
gtab, sh_order_max=sh_order_max, smooth=sh_lambda,
min_signal=min_signal,
)
delta_b = model._fit_matrix
delta_q = model._fit_matrix

H, R = _hat_and_lcr(gtab, model, sh_order_max)

return dict(
model_type="CSA", min_signal=min_signal,
H=H, R=R, delta_b=delta_b, delta_q=delta_q,
sampling_matrix=sampling_matrix, b0s_mask=gtab.b0s_mask,
)


def _hat_and_lcr(gtab, model, sh_order_max):
"""Compute hat matrix H and leveraged centered residuals matrix R."""
dwi_mask = ~gtab.b0s_mask
x, y, z = model.gtab.gradients[dwi_mask].T
_, theta, phi = shm.cart2sphere(x, y, z)
B, _, _ = shm.real_sh_descoteaux(sh_order_max, theta, phi, legacy=True)
H = shm.hat(B)
R = shm.lcr_matrix(H)
return H, R
Loading