Skip to content

Conversation

@d33bs
Copy link
Member

@d33bs d33bs commented Feb 6, 2026

Description

This PR provides tensor-friendly methods for loading images in torch or jax by using OMEArrow.to_dlpack. The goal is to provide zero-copy capabilities in tensor environments through Arrow, avoiding unnecessary conversions where possible.

Closes #33

What kind of change(s) are included?

  • Documentation (changes docs or other related content)
  • Bug fix (fixes an issue).
  • Enhancement (adds functionality).
  • Breaking change (these changes would cause existing functionality to not work as expected).

Checklist

Please ensure that all boxes are checked before indicating that this pull request is ready for review.

  • I have read and followed the CONTRIBUTING.md guidelines.
  • I have searched for existing content to ensure this is not a duplicate.
  • I have performed a self-review of these additions (including spelling, grammar, and related).
  • These changes pass all pre-commit checks.
  • I have added comments to my code to help provide understanding
  • I have added a test which covers the code changes found within this PR
  • I have deleted all non-relevant text in this pull request template.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added DLPack tensor export functionality for zero-copy PyTorch and JAX integration
    • Added tensor_view() method for tensor-like access to pixel data with rich indexing support
    • Added optional dependencies for PyTorch and JAX integration
  • Documentation

    • Added comprehensive DLPack guide with usage examples and compatibility notes
    • Added examples demonstrating PyTorch and JAX workflows
  • Tests

    • Added extensive test coverage for DLPack export and tensor operations

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@coderabbitai
Copy link

coderabbitai bot commented Feb 6, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR implements DLPack export and streaming tensor views for OME-Arrow, enabling zero-copy consumption of pixel data by PyTorch and JAX. A new TensorView class supports tensor-like indexing (t, z, c, ROI, tile), layout management, and iterative batch/tile-based DLPack export. Ingestion is enhanced with optional struct array returns for zero-copy access.

Changes

Cohort / File(s) Summary
Core Tensor Implementation
src/ome_arrow/tensor.py
New 829-line TensorView class exposing OME-Arrow pixel data as a tensor container with rich indexing (t, z, c, roi, tile), layout management (CHW, HWC, TZCHW), DLPack export (to_dlpack), framework shims (to_torch, to_jax), iterative batch/tile-based DLPack export (iter_dlpack), and materialization helpers. Handles chunked pixel data, ROI/tile cropping, dtype inference, and device placement (CPU/CUDA).
Core API Extension
src/ome_arrow/core.py
Adds tensor_view() public method to OMEArrow returning a TensorView from cached struct array (when available) or DataFrame. Stores struct array internally via _struct_array attribute populated during ingest. Validates scene parameter (None or 0 only).
Public API Export
src/ome_arrow/__init__.py
Exports TensorView class as part of public API surface.
Ingest Layer Enhancement
src/ome_arrow/ingest.py
Introduces optional return_array: bool parameter to _ome_arrow_from_table, from_ome_parquet, and from_ome_vortex enabling dual-return mode (StructScalar, StructArray tuple) for zero-copy struct array access. Implements zero-copy slice extraction with schema-mismatch fallback to Python object normalization. Expands metadata handling with warnings for mismatched Parquet schemas.
Optional Dependencies
pyproject.toml
Adds three optional dependency groups: dlpack (jax ≥0.4, torch ≥2.1), dlpack-jax (jax ≥0.4), dlpack-torch (torch ≥2.1).
CI Configuration
.github/workflows/run-tests.yml
Updates Sync dependencies step to include both --extra viz and --extra dlpack in uv sync command.
Documentation
docs/src/dlpack.md, docs/src/index.md, docs/src/python-api.md
Adds comprehensive DLPack documentation covering default layouts (2D: CHW, 5D: TZCHW), optional overrides, PyTorch/JAX usage examples, iteration semantics, ownership/lifetime guidance, mode semantics (arrow vs numpy), and optional GPU dependency instructions. Registers dlpack in toctree and adds automodule section for ome_arrow.tensor.
Example Updates
docs/src/examples/learning_to_fly_with_ome-arrow.py
Adds optional imports for jax and torch with safe fallbacks. Introduces conditional DLPack example block demonstrating zero-copy workflows for Arrow, NumPy, and Arrow modes when libraries are available.
Test Coverage
tests/test_tensor.py
Adds 300+ lines of comprehensive tests covering tensor layout defaults/overrides, DLPack round-trips with PyTorch and JAX, device error handling, batch/tile iteration semantics, Parquet zero-copy verification, and pointer consistency checks.
Existing Test Updates
tests/test_core.py
Adds deprecation/schema-related warning filters and conditional warning assertions in test_ome_arrow_base_expectations and test_ome_parquet_specific_col_and_row to handle expected schema/metadata warnings during OMEArrow initialization.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant OMEArrow
    participant TensorView
    participant DLPack
    participant PyTorch
    participant JAX

    User->>OMEArrow: tensor_view(t, z, c, roi, layout)
    OMEArrow->>TensorView: __init__(struct_array/data, selections)
    TensorView-->>OMEArrow: TensorView instance
    OMEArrow-->>User: TensorView

    User->>TensorView: to_dlpack(device="cpu"|"cuda", contiguous=True|False)
    TensorView->>TensorView: _materialize() if needed
    TensorView->>DLPack: PyCapsule (zero-copy or materialized)
    DLPack-->>TensorView: DLPack capsule
    TensorView-->>User: capsule

    User->>PyTorch: from_dlpack(capsule)
    PyTorch-->>User: torch.Tensor
    Note over User,PyTorch: or

    User->>JAX: from_dlpack(capsule)
    JAX-->>User: jax.Array
Loading
sequenceDiagram
    participant User
    participant TensorView
    participant iter_dlpack
    participant DLPack
    participant Consumer

    User->>TensorView: iter_dlpack(batch_size, tiles, shuffle, seed)
    TensorView->>iter_dlpack: __init__(enumerate batches/tiles)
    iter_dlpack-->>TensorView: iterator

    TensorView-->>User: iterator

    loop For each batch/tile
        User->>iter_dlpack: next()
        iter_dlpack->>iter_dlpack: _slice_batch() or _slice_tile()
        iter_dlpack->>DLPack: to_dlpack() for subset
        DLPack-->>iter_dlpack: capsule
        iter_dlpack-->>User: capsule
        User->>Consumer: from_dlpack(capsule)
        Consumer-->>User: tensor/array
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

Possibly related PRs

  • Add pixel chunking to specification #24: Modifies the same src/ome_arrow/ingest.py function signatures (_ome_arrow_from_table, from_ome_parquet, from_ome_vortex) and return-path logic, creating potential merge/conflict concerns and shared design impact on struct array extraction.

Poem

🐰 A tensor hops through pixel fields so bright,
DLPack capsules glowing in the night,
Zero-copy dreams from Arrow's core so deep,
To PyTorch, JAX—no extra buffers to keep! ✨
CHW layouts dance, tiles and batches flow,
One hop, one view, one magic show!

🚥 Pre-merge checks | ✅ 3 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Provide tensor-friendly methods for torch and jax through OMEArrow.to_dlpack' clearly summarizes the main change but is technically incomplete: the actual implementation provides TensorView.to_dlpack() and OMEArrow.tensor_view(), not OMEArrow.to_dlpack() directly. Clarify whether the title should reference TensorView.to_dlpack() or OMEArrow.tensor_view() for accuracy, or confirm the title is sufficiently descriptive as written.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed The PR implements all core requirements from #33: TensorView class with indexing (t, z, c, roi, tile), DLPack export via to_dlpack(), shape/dtype/device/layout/strides metadata, iter_dlpack() for batching, thin to_torch() and to_jax() wrappers, comprehensive tests, and documentation with examples.
Out of Scope Changes check ✅ Passed All changes align with #33 requirements: TensorView and DLPack functionality, documentation, optional dependencies (jax, torch), and test coverage. The workflow update to include dlpack extra appropriately supports the new feature dependencies.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
pyproject.toml (1)

33-42: Consider composing the dlpack extra from sub-extras to avoid version duplication.

The jax>=0.4 and torch>=2.1 bounds are duplicated between dlpack and the individual dlpack-jax/dlpack-torch extras. If a bound is updated in one place but not the other, they'll silently diverge. You can use self-referencing extras instead:

♻️ Suggested change
 optional-dependencies.dlpack = [
-  "jax>=0.4",
-  "torch>=2.1",
+  "ome-arrow[dlpack-jax,dlpack-torch]",
 ]
 optional-dependencies.dlpack-jax = [
   "jax>=0.4",

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@d33bs d33bs marked this pull request as ready for review February 7, 2026 00:00
Copilot AI review requested due to automatic review settings February 7, 2026 00:00
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In `@docs/src/dlpack.md`:
- Around line 92-93: The fenced code block containing the pip command (the block
with pip install "ome-arrow[dlpack]") lacks a language specifier; update that
block to use a shell language tag (e.g., change the opening ``` to ```bash or
```shell) so the snippet complies with markdownlint MD040 and renders correctly.
- Around line 46-68: The examples call np.from_dlpack(...) but never import
numpy; add "import numpy as np" at the top of each snippet (above OMEArrow
usage) so the examples using OMEArrow, tensor_view(), and view.iter_dlpack(...)
work as shown; ensure the import appears before any np.from_dlpack calls in both
the batch (iter_dlpack(batch_size=2, ...)) and tile
(iter_dlpack(tiles=(256,256), ...)) examples.

In `@src/ome_arrow/core.py`:
- Around line 509-520: The type annotations for tensor_view's t, z, and c
parameters are too broad (Iterable[int]) and must be changed to Sequence[int] to
match TensorView.__init__ and the runtime check in _normalize_index; update the
signature of tensor_view (the parameters t, z, c) from "int | slice |
Iterable[int] | None" to "int | slice | Sequence[int] | None" and add/import
typing.Sequence where needed so callers and the isinstance(index, Sequence)
check in _normalize_index behave correctly.

In `@src/ome_arrow/tensor.py`:
- Around line 482-488: _pixels_meta currently only checks _data_py and
_struct_array and raises if both are None; add handling for the
StructScalar-only case by checking for self._struct_scalar and returning the
pixels_meta from it (e.g., call
self._struct_scalar.field("pixels_meta").as_py()) so the method works when the
object was constructed from a StructScalar; keep existing behavior for _data_py
and _struct_array paths and only raise if none of the three sources (_data_py,
_struct_array, _struct_scalar) are available.
- Around line 120-131: The current shape property materializes the full array
via self.to_numpy(contiguous=False).shape which is expensive; change the shape
property (method name: shape) to compute and return the tensor shape directly
from the Tensor's selection/metadata (e.g., use whatever selection, bounds, and
layout fields on self such as self.selection, self.bounds, self.layout or
equivalent) without calling to_numpy, while leaving the strides property
unchanged (keep strides calling to_numpy for now); ensure the new shape logic
mirrors the existing layout ordering used by to_numpy so callers get the same
tuple but computed cheaply.
- Around line 67-80: In __init__, stop eagerly calling data.as_py() for
StructScalar: when isinstance(data, pa.StructScalar) set self._struct_scalar =
data and self._data_py = None (don’t materialize pixels); ensure _data_py_dict()
still calls as_py() on-demand to populate _data_py. Update _pixels_meta(),
_has_chunks(), and _chunk_grid() to handle the case where both self._data_py and
self._struct_array are None by reading necessary fields directly from
self._struct_scalar (extract struct fields from the scalar rather than relying
on _data_py), and only fall back to calling _data_py_dict() if you need the full
Python dict. Keep identifiers: __init__, _struct_scalar, _data_py,
_data_py_dict, _struct_array, _pixels_meta, _has_chunks, and _chunk_grid when
making these changes.
🧹 Nitpick comments (9)
pyproject.toml (1)

34-37: Consider splitting jax and torch into separate extras.

Bundling both frameworks into a single dlpack extra forces users who only need one (e.g., PyTorch-only workflows) to install the other. JAX and PyTorch are both large dependencies. Consider offering finer-grained extras (e.g., dlpack-torch, dlpack-jax) alongside the combined dlpack group.

docs/src/python-api.md (1)

12-21: Consider whether :private-members: is intentional for the public API page.

This mirrors the existing ome_arrow.meta block, but :private-members: will surface internal helpers (e.g., _DLPackWrapper, _normalize_device, _read_plane) in the public API docs. If these are implementation details not meant for consumers, dropping :private-members: would reduce noise.

tests/test_tensor.py (1)

60-80: Zero-copy assertion is fragile: depends on internal caching order.

Line 80 asserts pointer equality between a torch tensor (from DLPack) and a numpy array obtained from a separate to_numpy call. This works because _materialize caches the result and to_dlpack(mode="numpy") builds the capsule from the same cached array. However, if to_dlpack is called before to_numpy, the cache is populated by to_dlpack; calling to_numpy afterwards returns the same object — pointer equality holds.

But if caching behavior in _materialize changes (e.g., layout-dependent invalidation), this test would fail for non-obvious reasons. Consider adding a brief comment clarifying the dependency on the shared internal cache, so future maintainers understand the coupling.

src/ome_arrow/ingest.py (2)

109-128: Zero-copy is lost in the non-strict schema path (line 121-128).

When strict_schema=False and the column type doesn't exactly match OME_ARROW_STRUCT, the code falls back to to_pylist() (line 121) and reconstructs both the scalar and struct_array from a Python dict (line 127-128). This materializes all data into Python objects and back, defeating the zero-copy goal for tensor_view.

Since OMEArrow.__init__ calls with strict_schema=False (the default), this path will be hit whenever the on-disk schema has extra/differently-ordered fields — which is common with evolving schemas.

This isn't necessarily a bug (correctness is preserved), but it's worth documenting that zero-copy DLPack export from Parquet only works when the on-disk schema exactly matches OME_ARROW_STRUCT. Consider logging a warning or adding a note in the tensor_view docs.


130-136: Pre-existing: soft validation comparisons are no-ops.

Lines 133-134 evaluate equality but discard the results. This is pre-existing code and not introduced by this PR, but it's worth noting for a future cleanup.

    meta.get(b"ome.arrow.type", b"").decode() == str(OME_ARROW_TAG_TYPE)      # result unused
    meta.get(b"ome.arrow.version", b"").decode() == str(OME_ARROW_TAG_VERSION) # result unused
src/ome_arrow/tensor.py (4)

646-660: _batched prefetch path uses list.pop(0) — O(n) per dequeue.

When prefetch > 0, queue.pop(0) is O(n) since list elements must be shifted. For large iteration counts this degrades to O(n²). Use collections.deque instead:

Proposed fix
+from collections import deque
 ...
 def _batched(items: List[Any], size: int, *, prefetch: int) -> Iterator[List[Any]]:
     if size <= 0:
         raise ValueError("batch size must be positive")
     if prefetch <= 0:
         for i in range(0, len(items), size):
             yield items[i : i + size]
         return
 
-    queue: List[List[Any]] = []
+    queue: deque[List[Any]] = deque()
     idx = 0
     while idx < len(items) or queue:
         while idx < len(items) and len(queue) <= prefetch:
             queue.append(items[idx : idx + size])
             idx += size
-        yield queue.pop(0)
+        yield queue.popleft()

690-702: _ensure_struct_array from StructScalar calls as_py() — not zero-copy.

Line 698: pa.array([data.as_py()], type=OME_ARROW_STRUCT) deserializes and re-serializes the entire record. This means mode="arrow" DLPack export from a StructScalar input won't be zero-copy. The path works for Parquet/Vortex where _data is already a StructArray, but for other inputs (TIFF, numpy, dict), it will materialize.

This is correctly documented via the mode='arrow' requires Arrow-backed data error on line 517, but that error is only raised when _ensure_struct_array returns None (i.e., the input is not one of the known types). For StructScalar, a new array IS created — so the code silently works but without zero-copy semantics.

Consider either documenting this behavior or raising when true zero-copy can't be guaranteed.


510-534: _arrow_values silently creates a non-zero-copy array for StructScalar inputs.

When the input is a StructScalar (e.g., from TIFF/NumPy paths), _ensure_struct_array (line 515) will create a new StructArray via as_py() round-trip. The subsequent _select_plane_values call on line 534 then returns an Arrow array from this copy, not from the original data. The DLPack capsule will point to the copy's buffer.

This is functionally correct, but it may surprise users who expect mode="arrow" to mean zero-copy. Consider adding a note in the docstring for to_dlpack that zero-copy Arrow mode is only available for Parquet/Vortex-ingested data.


168-212: to_dlpack ownership semantics should be documented.

The DLPack protocol specifies that __dlpack__() should be called at most once on a capsule (ownership transfer). _DLPackWrapper stores the capsule and returns it on every __dlpack__() call, which works correctly because the consumer (torch/jax) marks the capsule as consumed. However, a second call would fail silently or raise.

Consider adding a brief docstring note on _DLPackWrapper or to_dlpack that the returned object is single-use per the DLPack protocol. The linked issue also specifically requests documenting "ownership and lifetime semantics for returned DLPack capsules."

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new tensor-oriented API to OME-Arrow that enables exporting pixel data via DLPack for consumption by PyTorch and JAX (aiming for zero-copy on Arrow-backed CPU data), along with tests and documentation.

Changes:

  • Introduce TensorView with to_numpy(), to_dlpack(), to_torch(), to_jax(), and iter_dlpack() for batch/tile iteration.
  • Extend Parquet/Vortex ingest paths to optionally return a 1-row StructArray to preserve Arrow buffers for downstream zero-copy exports.
  • Add docs and tests for DLPack export and iteration; add a new dlpack extra (jax/torch) and update CI dependency sync.

Reviewed changes

Copilot reviewed 11 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/ome_arrow/tensor.py New TensorView implementation and DLPack export/iteration support.
src/ome_arrow/core.py Add OMEArrow.tensor_view() and preserve an Arrow StructArray for zero-copy paths.
src/ome_arrow/ingest.py Add return_array option to return (StructScalar, StructArray) for Parquet/Vortex.
src/ome_arrow/__init__.py Export TensorView from the package namespace.
tests/test_tensor.py New tests for layout, DLPack round-trips, and iterators.
docs/src/dlpack.md New documentation page describing DLPack export and examples.
docs/src/index.md Add dlpack page to the docs TOC.
docs/src/python-api.md Add autodoc section for the tensor module.
docs/src/examples/learning_to_fly_with_ome-arrow.py Add example usage snippets for DLPack export with torch/jax.
pyproject.toml Add dlpack optional dependency extra (jax, torch).
.github/workflows/run-tests.yml Install dlpack extra in CI test job.
uv.lock Lockfile updates to include dlpack extra dependencies and transitive packages.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/ome_arrow/tensor.py`:
- Around line 728-734: The _DLPackWrapper class is returning the same capsule on
every call to __dlpack__, violating the DLPack single-use rule; update the
__dlpack__ method to return the stored capsule once, then immediately invalidate
it (set self._capsule to None) and raise a clear error (e.g., RuntimeError) if
__dlpack__ is called again when _capsule is None; touch the __init__ only to
keep the _capsule and _device fields, but add the post-return invalidation and a
guard at the top of __dlpack__ that checks _capsule and raises on reuse so
consumers cannot receive the capsule twice.
🧹 Nitpick comments (2)
src/ome_arrow/tensor.py (2)

690-704: _batched prefetch is synchronous — the name is misleading.

The prefetch parameter pre-slices items into a deque but doesn't perform any actual async precomputation. The iterator still materializes each batch lazily when yielded. This is functionally correct but won't deliver the performance benefit users expect from a "prefetch" parameter (as documented in iter_dlpack: "Number of items to precompute ahead").

Consider either documenting this as a placeholder for future async prefetch, or removing the parameter until real prefetch (e.g., via concurrent.futures) is implemented.


430-441: Triple-nested loop in _build_tzchw reads planes one at a time.

For selections with many (t, z, c) combinations this is O(T×Z×C) individual plane reads, each potentially calling plane_from_chunks which iterates all chunks. This is acceptable for the initial implementation but could become a bottleneck for large 5D datasets.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@src/ome_arrow/tensor.py`:
- Around line 505-516: The _has_chunks method currently treats non-null but
empty Arrow lists as present, causing inconsistency with the dict path and
downstream errors in _plane_map/_read_plane and plane_from_chunks; update
_has_chunks to check for empty list lengths on both Arrow paths: when using
self._struct_array, inspect the 'chunks' field's length (and ensure it's not
zero) rather than only testing is_null, and when using self._struct_scalar,
obtain the 'chunks' scalar, verify it's valid and also check its list length (or
empty-ness) before returning True so that an empty list yields False
consistently with the dict path.
- Around line 505-516: The _has_chunks method returns true for a StructScalar
when the scalar is valid but contains an empty list; update _has_chunks to
mirror the dict/array checks by first verifying
self._struct_scalar["chunks"].is_valid and then checking the list length (e.g.,
via len(chunks_scalar.as_py()) > 0) before returning True so that empty lists
are treated as no-chunks; this will prevent _read_plane from calling
plane_from_chunks with no actual chunks.
🧹 Nitpick comments (2)
src/ome_arrow/tensor.py (1)

708-722: _batched with prefetch > 0 doesn't provide actual prefetching.

The prefetch parameter fills a synchronous deque before yielding, but since there's no background thread or async I/O, buffering items into a deque before yielding them one at a time is functionally equivalent to simple sequential iteration — it just adds overhead. The docstring on iter_dlpack (line 288–289) does note this is not async, which is good, but the implementation could be simplified to just the prefetch <= 0 path for now.

Consider documenting this as a placeholder for future async prefetch, or removing the deque-based path until true async prefetch is implemented.

tests/test_tensor.py (1)

13-28: _Wrapper inside _from_dlpack_capsule doesn't enforce single-use semantics.

This is a test-only helper, so it's not a real concern — each capsule is consumed exactly once by np.from_dlpack. Just flagging for awareness since _DLPackWrapper in production code was specifically fixed for this.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/ome_arrow/tensor.py`:
- Around line 674-680: The _dtype_from_meta function currently falls back
silently to np.uint16 on errors; change the try/except around np.dtype(dtype) to
emit a warning when the conversion fails (include the original dtype value and
the exception message) before returning np.dtype(np.uint16). Use the warnings
module (or processLogger.warn if project logging convention prefers) so callers
are informed that an unrecognized dtype (e.g., "floa32") caused a fallback to
uint16; reference the function name _dtype_from_meta and the dtype parameter to
locate and modify the code.
🧹 Nitpick comments (4)
src/ome_arrow/tensor.py (2)

406-419: _has_chunks() called repeatedly inside the triple loop — minor performance concern.

_has_chunks() is invoked on every _read_plane call inside the (T × Z × C) loop. For non-dict paths it re-queries Arrow fields each time. Consider caching the result in a self._chunks_present flag computed once during __init__ or on first access.


356-394: Minor naming ambiguity between tile (index) and tiles (size).

The __init__ parameter tile refers to a tile index (tile_y, tile_x) into the chunk grid, while iter_dlpack's tiles parameter specifies tile size (tile_h, tile_w) in pixels. Both are tuple[int, int] with similar names but very different semantics. Consider renaming one (e.g., tile_size for iter_dlpack) to reduce confusion.

tests/test_tensor.py (2)

256-284: Test helper _selected_values_for_arrow_mode duplicates filtering logic from tensor.py.

This helper mirrors the logic in _select_plane_values and _select_chunk_values. While acceptable in tests to independently verify behavior, consider documenting its purpose more explicitly or noting it's intentionally reimplemented for test independence.


235-253: _jax_buffer_ptr is brittle across JAX versions.

The helper tries multiple JAX internal attributes (device_buffer, device_buffers, addressable_data, unsafe_buffer_pointer). These are implementation details that change across JAX versions. The overall zero-copy test already uses pytest.skip as a safety net, so this is acceptable, but worth noting for future maintenance.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@pyproject.toml`:
- Around line 130-131: Replace the TOML table using a dotted key
(`[tool.pytest]` with `ini_options.pythonpath`) with the canonical pytest table
`[tool.pytest.ini_options]` and move the pythonpath setting into that table
(i.e., set `pythonpath = ["." ]`) so pytest picks up the configuration
correctly; update the section header and the key accordingly where
`ini_options.pythonpath` appears.

In `@src/ome_arrow/tensor.py`:
- Around line 71-72: Handle the zero-chunk case before calling
combine_chunks()/data[0]: inside the existing ChunkedArray branch (the code that
checks isinstance(data, pa.ChunkedArray) and uses data.chunk(0),
data.num_chunks, and data.combine_chunks()), add a guard for data.num_chunks ==
0 and produce/assign an appropriate empty StructArray (or return an empty
tensor) matching the expected schema, otherwise keep the current logic (use
chunk(0) when num_chunks == 1 and combine_chunks() when >1); also ensure any
subsequent indexing like data[0] only runs when the resulting array has length >
0.
- Around line 261-267: The code currently calls jax.dlpack.from_dlpack(...)
after ensuring JAX is installed; replace that public API call with the
NumPy-compatible jax.numpy.from_dlpack(...) to follow the recommended interface.
In the method that builds dlpack via self.to_dlpack(...), import or reference
jax.numpy (e.g., jax.numpy.from_dlpack) and call jax.numpy.from_dlpack(dlpack)
instead of jax.dlpack.from_dlpack(dlpack) so the returned array uses the public,
supported API.
🧹 Nitpick comments (3)
src/ome_arrow/tensor.py (1)

68-72: self._data retains the raw ChunkedArray, causing repeated combine_chunks() in iteration.

Line 68 stores the original data (potentially a multi-chunk ChunkedArray) in self._data. Since _iter_batches and _iter_tiles pass self._data to each new TensorView, every batch/tile re-runs the combine_chunks() call. Consider storing the unwrapped array instead:

Suggested fix
-        self._data = data
-        self._struct_array: pa.StructArray | None = None
+        self._struct_array: pa.StructArray | None = None
         self._struct_scalar: pa.StructScalar | None = None
         if isinstance(data, pa.ChunkedArray):
             data = data.chunk(0) if data.num_chunks == 1 else data.combine_chunks()
         if isinstance(data, pa.StructArray):
             self._struct_array = data
             self._struct_scalar = data[0]
             self._data_py: dict[str, Any] | None = None
         elif isinstance(data, pa.StructScalar):
             self._struct_scalar = data
             self._data_py = None
         else:
             self._data_py = data
+        self._data = data
tests/test_tensor.py (2)

89-96: Note: jax.dlpack.from_dlpack usage mirrors the deprecation concern in tensor.py.

If the source is updated to use jax.numpy.from_dlpack, this test should be updated accordingly (line 96).


235-259: _jax_buffer_ptr — fragile but appropriately guarded.

The helper probes version-specific JAX internals with multiple fallback paths. The docstring clearly states the intent to skip rather than hard-fail. This is acceptable for best-effort zero-copy tests. Consider adding a pytest.skip call in the AssertionError raises (lines 247, 252, 259) if you'd prefer tests to skip rather than fail when JAX internals change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add DLPack export + streaming tensor views to OME-Arrow

1 participant