Skip to content

Feature/app separation#13

Draft
NetZissou wants to merge 26 commits intomainfrom
feature/app-separation
Draft

Feature/app separation#13
NetZissou wants to merge 26 commits intomainfrom
feature/app-separation

Conversation

@NetZissou
Copy link
Collaborator

This PR breaks our old monolithic app.py into two focused Streamlit apps under apps/

  • Embed & Explore — upload (not technically upload, more like import from local) your own images, embed and cluster them
  • Precalculated Embeddings — jump straight into exploring precomputed data

Additionally, we (who's we? me and claude lol) made these improvements:

  • Improved clustering visualization by enabling zoom-in/out and heatmap options
  • Improved metadata filtering interface for the precalculated embedding app. Dynamic filtering based on parquet schema
  • Added GPU to CPU fallback for clustering — if you hit an OOM or CUDA error, it'll automatically retry on CPU with sklearn and let you know what happened (in the console output)
  • Pulled shared code into common modules (components, services) so both apps stay DRY!
  • Updated the README to clearly explain both workflows with simpler install/usage instructions

NetZissou and others added 20 commits January 26, 2026 14:13
Introduces a new `shared/` module structure to support application
separation and code reuse across multiple Streamlit apps.

## Structure

### shared/utils/
- `clustering.py`: Dimensionality reduction (PCA, t-SNE, UMAP) and
  K-means clustering with multi-backend support (sklearn, FAISS, cuML)
- `io.py`: File I/O utilities for embeddings and data persistence
- `models.py`: Shared data models and type definitions

### shared/services/
- `clustering_service.py`: High-level clustering workflow orchestration
- `embedding_service.py`: Image embedding generation using various models
- `file_service.py`: File discovery and validation services

### shared/components/
- `clustering_controls.py`: Streamlit UI controls for backend selection,
  seed configuration, and worker settings
- `summary.py`: Cluster summary statistics and representative images
- `visualization.py`: Scatter plot visualization with Altair

### shared/lib/
- `progress.py`: Progress tracking utilities for long-running operations

## Backend Support
- sklearn: Default CPU backend for all operations
- FAISS: Optional GPU/CPU accelerated K-means clustering
- cuML: Optional RAPIDS GPU acceleration for dim reduction and clustering
  with automatic fallback on unsupported architectures

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Introduces `apps/embed_explore/` as a self-contained Streamlit app for
interactive image embedding exploration and clustering.

## Application Structure

### apps/embed_explore/
- `app.py`: Main application entry point with two-column layout
  (sidebar controls + main visualization area)

### apps/embed_explore/components/
- `sidebar.py`: Complete sidebar UI with embedding and clustering
  sections, model selection, and backend configuration
- `summary.py`: Cluster statistics display and representative images
- `visualization.py`: Interactive scatter plot with image preview panel

## Features
- Directory-based image loading with supported format filtering
- Multiple embedding model support (DINOv2, OpenCLIP, etc.)
- Configurable dimensionality reduction (PCA, t-SNE, UMAP)
- K-means clustering with adjustable cluster count
- Interactive Altair scatter plot with click-to-preview
- Cluster summary statistics with representative samples

## Usage
Run as standalone app:
  streamlit run apps/embed_explore/app.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Updates existing components to use the new shared module and removes
legacy code that has been superseded by the app separation.

## Removed
- `app.py`: Legacy monolithic entry point (replaced by apps/)
- `components/clustering/`: Entire directory moved to shared/ and apps/
- `pages/01_Clustering.py`: Now available as standalone embed_explore app

## Updated Imports
- `components/precalculated/sidebar.py`: Uses shared.services and
  shared.components for clustering functionality
- `pages/02_Precalculated_Embeddings.py`: Uses shared.components for
  visualization and summary rendering

## pyproject.toml Changes
- Entry points updated:
  - `emb-embed-explore` → apps.embed_explore.app:main
  - `list-models` → shared.utils.models:list_available_models
- Package includes: shared/, apps/
- Dependencies:
  - streamlit>=1.50.0 (updated for new API)
  - numpy<=2.2.0 (compatibility constraint)
- Version path: shared/__init__.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… code

Add new standalone app:
- apps/precalculated/ - Precalculated embeddings explorer with dynamic
  cascading filters, CUDA auto-detection, and console logging

Features:
- Dynamic filter generation based on parquet columns
- Cascading/dependent filters with AND logic
- Auto backend selection (cuml when CUDA available)
- Console logging for clustering operations
- Image caching to prevent re-fetch on reruns
- State management for record details panel

Clean up legacy code:
- Remove pages/02_Precalculated_Embeddings.py (monolithic page)
- Remove components/ directory (old component structure)
- Remove services/ directory (old services, now in shared/)
- Remove utils/ directory (old utils, now in shared/)
- Remove list_models.py (replaced by entry point)
- Move taxonomy_tree.py to shared/utils/

Update shared module:
- Add taxonomy tree functions to shared/utils/
- Add VRAM error handling utilities to clustering.py
- Fix import paths in summary.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add .interactive() to scatter plots in both apps:
- Scroll wheel to zoom
- Drag to pan
- Double-click to reset

Note: Zoom state resets on app rerun (known Streamlit limitation)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add toggleable density heatmap overlay using Altair's mark_rect with 2D
binning. This helps visualize point concentration in crowded areas of
the scatter plot while keeping individual points visible.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Streamlit doesn't support selections on multi-view (layered) Altair
charts. When density heatmap is shown, disable on_select and show
a note to the user that point selection is temporarily unavailable.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Off: normal 0.7 opacity, selection enabled
- Opacity: low 0.15 opacity so overlapping points show density naturally,
  selection still works
- Heatmap: 2D binned density layer behind points (selection disabled due
  to Streamlit limitation with layered charts)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add grid resolution slider (10-80 bins) when Heatmap mode is selected
- Replace truncated metadata display with full-width dataframe table
- Show complete UUID and all field values without truncation
- Use compact column layout for density options

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Display Cluster and UUID prominently as markdown (full values, no
truncation), then show remaining metadata fields in a scrollable table.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Both apps now use shared/components/visualization.py for scatter plot
- Shared visualization has all features: zoom/pan, density modes, configurable bins
- Dynamic tooltip building works for any data columns
- Added data_version tracking for selection validation
- Moved embed_explore's render_image_preview to separate file
- App-specific visualization.py files now re-export from shared

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add shared/utils/logging_config.py for centralized logging setup
- Add logging to clustering utilities (backend selection, timing)
- Add logging to ClusteringService (workflow steps, timing)
- Add logging to EmbeddingService (model loading, generation stats)
- Add logging to FileService (file operations, timing)
- Replace print() fallback messages with proper logger.warning()
- Fix use_container_width deprecation: use width="stretch" instead

Logging now tracks:
- Which backend is selected (sklearn/cuML/FAISS)
- Operation timing for performance monitoring
- Fallback events when GPU operations fail

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add shared/utils/backend.py with centralized backend utilities:
  - check_cuda_available(): Cached CUDA detection via PyTorch/CuPy
  - resolve_backend(): Auto-resolve to cuML/FAISS/sklearn based on hardware
  - is_oom_error(), is_cuda_arch_error(), is_gpu_error(): Error classification

- Update embed_explore to use robust error handling:
  - Auto-resolve backends based on available hardware
  - Automatic fallback to sklearn on GPU errors
  - Consistent logging of backend selection

- Update precalculated to use shared backend utilities:
  - Remove duplicate check_cuda_available/resolve_backend functions
  - Replace print() with logger calls for consistency

Both apps now have identical backend selection and fallback behavior.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…logging

- Clustering summary is now computed once when clustering runs and stored
  in session state (clustering_summary, clustering_representatives)
- Summary component displays cached results instead of recomputing on
  every render (zoom, pan, point selection no longer trigger recompute)
- Added logging for image retrieval:
  - URL fetch timing and size
  - Timeout and error handling with warnings
  - Debug logging for image display
- Removed ClusteringService import from summary component (uses cache)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Visualization logging:
- Log density mode changes (Off/Opacity/Heatmap)
- Log heatmap bin changes
- Log point selection with cluster info
- Log chart render with point count and settings

Image I/O logging (fixed to work with caching):
- Separate cached fetch from logging wrapper
- Log fetch start, success (with size), and failures
- Log image open with dimensions
- Track last displayed image to avoid duplicate logs

All logs use [Visualization] and [Image] prefixes for easy filtering.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Root cause: embed_explore was using its local summary.py which called
ClusteringService.generate_clustering_summary() on every render instead
of the shared version that uses cached session state results.

Fix:
- Update embed_explore/app.py to import from shared.components.summary
- Update local summary.py to re-export from shared for backwards compat
- Add ISSUES.md to track known issues

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…clip)

Heavy libraries are now only imported when explicitly needed:
- FAISS: loaded when FAISS backend is selected or auto-resolved
- torch/open_clip: loaded when embedding generation is triggered
- cuML: loaded when cuML backend is selected

Changes:
- shared/utils/clustering.py: lazy-load sklearn, UMAP, FAISS, cuML
- shared/utils/models.py: lazy-load open_clip
- shared/services/embedding_service.py: lazy-load torch and open_clip
- shared/components/clustering_controls.py: cache backend availability check
- shared/utils/backend.py: cache FAISS and cuML availability checks

This significantly improves app startup time by avoiding unnecessary
imports during module load.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Reverting commit d34c33e as the lazy loading implementation
made startup performance worse instead of better.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Issue tracking moved to GitHub Issues:
- Slow startup: #12

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@NetZissou NetZissou self-assigned this Feb 11, 2026
@NetZissou NetZissou marked this pull request as draft February 11, 2026 13:48
NetZissou and others added 6 commits February 11, 2026 11:33
Delete stale lib/ directory (duplicated in shared/lib/), remove unused
imports (pandas from models.py, Counter from taxonomy_tree.py), remove
dead error detection functions from clustering __init__, add logs/ to
gitignore, and add print_available_models() entry point to models.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ClusteringService.run_clustering_safe() to encapsulate GPU-to-CPU
fallback logic, replacing ~100 lines of duplicated error handling in
both app sidebars. Enhance logging format with funcName:lineno, add
persistent file handler (logs/emb_explorer.log), switch error handlers
to logger.exception() for tracebacks, and add data loading/filter
logging to precalculated sidebar.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap scatter chart in @st.fragment so zoom/pan only reruns the chart
fragment, not the full page. Only trigger st.rerun(scope="app") when
the selected point actually changes.

Run cuML UMAP in an isolated subprocess with L2-normalized embeddings
to prevent SIGFPE crashes (NN-descent numerical instability with
large-magnitude embeddings). Falls back to sklearn UMAP automatically
if the subprocess fails.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add GPU acceleration section to README explaining optional GPU support
with CUDA 12/13 install commands. Create docs/DATA_FORMAT.md documenting
expected parquet schema for precalculated app. Split pyproject.toml GPU
extras into gpu-cu12/gpu-cu13 groups and add pynvml dependency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Apply L2 normalization to all embeddings before clustering and
dimensionality reduction via _prepare_embeddings(). This prevents
cuML UMAP SIGFPE crashes from large-magnitude vectors and is
appropriate for CLIP-family contrastive embeddings. Log input norms,
non-finite values, and embedding shapes at each pipeline step.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document the full embedding pipeline (preparation, KMeans, dim
reduction, visualization) with backend details and fallback chain.
Link from README.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR performs a major architectural refactoring, breaking the monolithic app.py into two focused standalone Streamlit applications:

Purpose: Separate workflows for (1) embedding and exploring local images vs. (2) exploring precomputed embeddings from parquet files, with shared infrastructure to keep code DRY and maintainable.

Changes:

  • Restructured codebase into shared/ (reusable utilities, services, components) and apps/ (two standalone Streamlit apps)
  • Enhanced clustering robustness with L2 normalization, GPU-to-CPU fallback, and subprocess isolation for cuML UMAP
  • Added dynamic cascading filter generation for the precalculated embeddings app
  • Implemented centralized logging with console and file output
  • Improved visualization with zoom/pan via @st.fragment, density modes (opacity/heatmap), and optimized reruns

Reviewed changes

Copilot reviewed 48 out of 54 changed files in this pull request and generated 25 comments.

Show a summary per file
File Description
apps/embed_explore/app.py New standalone app for embedding local images with CLIP/BioCLIP models
apps/precalculated/app.py New standalone app for exploring parquet files with precomputed embeddings
apps/precalculated/components/sidebar.py Dynamic cascading filter generation based on parquet schema
apps/precalculated/components/data_preview.py Metadata preview with image URL fetching and caching
shared/utils/clustering.py Major refactor: L2 normalization, subprocess isolation for cuML UMAP, comprehensive logging
shared/utils/backend.py New module for GPU/CUDA detection and backend resolution
shared/utils/logging_config.py Centralized logging configuration with file and console output
shared/services/clustering_service.py Clustering workflows with automatic GPU-to-CPU fallback
shared/components/visualization.py Scatter plot with @st.fragment isolation, density modes, and zoom/pan
shared/components/summary.py Clustering summary with cached results and taxonomy tree
pyproject.toml Updated dependencies, new GPU extras (cu12/cu13), entry points for both apps
README.md Simplified documentation focused on two distinct workflows
docs/DATA_FORMAT.md New documentation for expected parquet schema
docs/BACKEND_PIPELINE.md New documentation explaining the clustering pipeline

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


input_path, output_path = sys.argv[1], sys.argv[2]
n_neighbors = int(sys.argv[3])
seed = int(sys.argv[4]) if sys.argv[4] else None
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subprocess script parses seed from argv[4], but if the seed is empty string (when seed is None), int(sys.argv[4]) will raise a ValueError. The correct check should be seed = int(sys.argv[4]) if sys.argv[4] and sys.argv[4] != "" else None or similar.

Copilot uses AI. Check for mistakes.
dependencies = [
# Core UI and web framework
"streamlit>=1.40.0",
"streamlit>=1.50.0",
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Streamlit version requirement was changed from >=1.40.0 to >=1.50.0 without explanation. According to public release information, Streamlit 1.50.0 was released in February 2025 (very recent). This could cause installation failures on systems that don't have access to the latest version yet. Consider whether this minimum version is truly required for the new features, or if >=1.40.0 would still work.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Streamlit 1.50 is a year old. Dismiss!

st.session_state["selected_image_idx"] = new_idx
st.session_state["selection_data_version"] = st.session_state.get("data_version", None)
# Trigger full page rerun so the preview panel updates
st.rerun(scope="app")
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The st.rerun(scope="app") call at line 182 may not work as expected. According to Streamlit documentation, st.rerun() does not accept a scope parameter. If this is intended to trigger a full app rerun from within a fragment, you should use st.rerun() without arguments. The presence of an invalid parameter could cause a TypeError.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scope parameter was added in Streamlit 1.33+ (early 2024). scope="app" from inside a @st.fragment triggers a full page rerun, which is exactly the intended behavior here. Dismiss!

Comment on lines +42 to +65
def fetch_image_from_url(url: str, timeout: int = 5) -> Optional[bytes]:
"""
Fetch an image from a URL with logging.
Uses caching internally but logs the request.
"""
if not url or not isinstance(url, str):
return None

if not url.startswith(('http://', 'https://')):
logger.warning(f"[Image] Invalid URL scheme: {url[:50]}...")
return None

logger.info(f"[Image] Fetching: {url[:80]}...")
start_time = time.time()

result = _fetch_image_from_url_cached(url, timeout)

elapsed = time.time() - start_time
if result:
logger.info(f"[Image] Loaded: {len(result)/1024:.1f}KB in {elapsed:.3f}s")
else:
logger.warning(f"[Image] Failed to load: {url[:50]}...")

return result
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fetch_image_from_url function logs every image fetch at INFO level. Since the actual fetch is cached by _fetch_image_from_url_cached, subsequent calls will still log as if they're fetching, which could be misleading and create excessive log entries. Consider checking if the result came from cache before logging, or log at DEBUG level instead.

Copilot uses AI. Check for mistakes.
Comment on lines +314 to +317
# Use /dev/shm for fast IPC when available, else /tmp
shm_dir = "/dev/shm" if os.path.isdir("/dev/shm") else tempfile.gettempdir()
input_path = os.path.join(shm_dir, f"cuml_umap_in_{os.getpid()}.npy")
output_path = os.path.join(shm_dir, f"cuml_umap_out_{os.getpid()}.npy")
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subprocess script uses /dev/shm (shared memory) or temp directory for IPC, which is good for performance. However, the file paths use PID-based naming which could theoretically be guessed by another process. While this is a low-risk issue in single-user environments, consider using tempfile.mkstemp() or tempfile.NamedTemporaryFile() with delete=False for more secure temporary file creation, as these create files with unpredictable names and appropriate permissions.

Copilot uses AI. Check for mistakes.
_cuda_check_cache = (True, device_info)
logger.info(f"CUDA available via CuPy: {device_info}")
return _cuda_check_cache
except ImportError:
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
free_bytes, total_bytes = meminfo
used_bytes = total_bytes - free_bytes
return (used_bytes // (1024 * 1024), total_bytes // (1024 * 1024))
except Exception:
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
used = torch.cuda.memory_allocated() // (1024 * 1024)
total = torch.cuda.get_device_properties(0).total_memory // (1024 * 1024)
return (used, total)
except Exception:
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
for path in (input_path, output_path):
try:
os.unlink(path)
except OSError:
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
Comment on lines +394 to +395
except Exception:
pass
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except Exception:
pass
except Exception as exc:
logger.exception("Failed to compute preview of filtered count", exc_info=exc)
st.warning("Unable to compute preview for the current filters. "
"You can still apply the filters to see full results.")

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant