Conversation
Introduces a new `shared/` module structure to support application separation and code reuse across multiple Streamlit apps. ## Structure ### shared/utils/ - `clustering.py`: Dimensionality reduction (PCA, t-SNE, UMAP) and K-means clustering with multi-backend support (sklearn, FAISS, cuML) - `io.py`: File I/O utilities for embeddings and data persistence - `models.py`: Shared data models and type definitions ### shared/services/ - `clustering_service.py`: High-level clustering workflow orchestration - `embedding_service.py`: Image embedding generation using various models - `file_service.py`: File discovery and validation services ### shared/components/ - `clustering_controls.py`: Streamlit UI controls for backend selection, seed configuration, and worker settings - `summary.py`: Cluster summary statistics and representative images - `visualization.py`: Scatter plot visualization with Altair ### shared/lib/ - `progress.py`: Progress tracking utilities for long-running operations ## Backend Support - sklearn: Default CPU backend for all operations - FAISS: Optional GPU/CPU accelerated K-means clustering - cuML: Optional RAPIDS GPU acceleration for dim reduction and clustering with automatic fallback on unsupported architectures Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Introduces `apps/embed_explore/` as a self-contained Streamlit app for interactive image embedding exploration and clustering. ## Application Structure ### apps/embed_explore/ - `app.py`: Main application entry point with two-column layout (sidebar controls + main visualization area) ### apps/embed_explore/components/ - `sidebar.py`: Complete sidebar UI with embedding and clustering sections, model selection, and backend configuration - `summary.py`: Cluster statistics display and representative images - `visualization.py`: Interactive scatter plot with image preview panel ## Features - Directory-based image loading with supported format filtering - Multiple embedding model support (DINOv2, OpenCLIP, etc.) - Configurable dimensionality reduction (PCA, t-SNE, UMAP) - K-means clustering with adjustable cluster count - Interactive Altair scatter plot with click-to-preview - Cluster summary statistics with representative samples ## Usage Run as standalone app: streamlit run apps/embed_explore/app.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Updates existing components to use the new shared module and removes legacy code that has been superseded by the app separation. ## Removed - `app.py`: Legacy monolithic entry point (replaced by apps/) - `components/clustering/`: Entire directory moved to shared/ and apps/ - `pages/01_Clustering.py`: Now available as standalone embed_explore app ## Updated Imports - `components/precalculated/sidebar.py`: Uses shared.services and shared.components for clustering functionality - `pages/02_Precalculated_Embeddings.py`: Uses shared.components for visualization and summary rendering ## pyproject.toml Changes - Entry points updated: - `emb-embed-explore` → apps.embed_explore.app:main - `list-models` → shared.utils.models:list_available_models - Package includes: shared/, apps/ - Dependencies: - streamlit>=1.50.0 (updated for new API) - numpy<=2.2.0 (compatibility constraint) - Version path: shared/__init__.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… code Add new standalone app: - apps/precalculated/ - Precalculated embeddings explorer with dynamic cascading filters, CUDA auto-detection, and console logging Features: - Dynamic filter generation based on parquet columns - Cascading/dependent filters with AND logic - Auto backend selection (cuml when CUDA available) - Console logging for clustering operations - Image caching to prevent re-fetch on reruns - State management for record details panel Clean up legacy code: - Remove pages/02_Precalculated_Embeddings.py (monolithic page) - Remove components/ directory (old component structure) - Remove services/ directory (old services, now in shared/) - Remove utils/ directory (old utils, now in shared/) - Remove list_models.py (replaced by entry point) - Move taxonomy_tree.py to shared/utils/ Update shared module: - Add taxonomy tree functions to shared/utils/ - Add VRAM error handling utilities to clustering.py - Fix import paths in summary.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add .interactive() to scatter plots in both apps: - Scroll wheel to zoom - Drag to pan - Double-click to reset Note: Zoom state resets on app rerun (known Streamlit limitation) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add toggleable density heatmap overlay using Altair's mark_rect with 2D binning. This helps visualize point concentration in crowded areas of the scatter plot while keeping individual points visible. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Streamlit doesn't support selections on multi-view (layered) Altair charts. When density heatmap is shown, disable on_select and show a note to the user that point selection is temporarily unavailable. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Off: normal 0.7 opacity, selection enabled - Opacity: low 0.15 opacity so overlapping points show density naturally, selection still works - Heatmap: 2D binned density layer behind points (selection disabled due to Streamlit limitation with layered charts) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add grid resolution slider (10-80 bins) when Heatmap mode is selected - Replace truncated metadata display with full-width dataframe table - Show complete UUID and all field values without truncation - Use compact column layout for density options Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Display Cluster and UUID prominently as markdown (full values, no truncation), then show remaining metadata fields in a scrollable table. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Both apps now use shared/components/visualization.py for scatter plot - Shared visualization has all features: zoom/pan, density modes, configurable bins - Dynamic tooltip building works for any data columns - Added data_version tracking for selection validation - Moved embed_explore's render_image_preview to separate file - App-specific visualization.py files now re-export from shared Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add shared/utils/logging_config.py for centralized logging setup - Add logging to clustering utilities (backend selection, timing) - Add logging to ClusteringService (workflow steps, timing) - Add logging to EmbeddingService (model loading, generation stats) - Add logging to FileService (file operations, timing) - Replace print() fallback messages with proper logger.warning() - Fix use_container_width deprecation: use width="stretch" instead Logging now tracks: - Which backend is selected (sklearn/cuML/FAISS) - Operation timing for performance monitoring - Fallback events when GPU operations fail Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add shared/utils/backend.py with centralized backend utilities: - check_cuda_available(): Cached CUDA detection via PyTorch/CuPy - resolve_backend(): Auto-resolve to cuML/FAISS/sklearn based on hardware - is_oom_error(), is_cuda_arch_error(), is_gpu_error(): Error classification - Update embed_explore to use robust error handling: - Auto-resolve backends based on available hardware - Automatic fallback to sklearn on GPU errors - Consistent logging of backend selection - Update precalculated to use shared backend utilities: - Remove duplicate check_cuda_available/resolve_backend functions - Replace print() with logger calls for consistency Both apps now have identical backend selection and fallback behavior. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…logging - Clustering summary is now computed once when clustering runs and stored in session state (clustering_summary, clustering_representatives) - Summary component displays cached results instead of recomputing on every render (zoom, pan, point selection no longer trigger recompute) - Added logging for image retrieval: - URL fetch timing and size - Timeout and error handling with warnings - Debug logging for image display - Removed ClusteringService import from summary component (uses cache) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Visualization logging: - Log density mode changes (Off/Opacity/Heatmap) - Log heatmap bin changes - Log point selection with cluster info - Log chart render with point count and settings Image I/O logging (fixed to work with caching): - Separate cached fetch from logging wrapper - Log fetch start, success (with size), and failures - Log image open with dimensions - Track last displayed image to avoid duplicate logs All logs use [Visualization] and [Image] prefixes for easy filtering. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Root cause: embed_explore was using its local summary.py which called ClusteringService.generate_clustering_summary() on every render instead of the shared version that uses cached session state results. Fix: - Update embed_explore/app.py to import from shared.components.summary - Update local summary.py to re-export from shared for backwards compat - Add ISSUES.md to track known issues Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…clip) Heavy libraries are now only imported when explicitly needed: - FAISS: loaded when FAISS backend is selected or auto-resolved - torch/open_clip: loaded when embedding generation is triggered - cuML: loaded when cuML backend is selected Changes: - shared/utils/clustering.py: lazy-load sklearn, UMAP, FAISS, cuML - shared/utils/models.py: lazy-load open_clip - shared/services/embedding_service.py: lazy-load torch and open_clip - shared/components/clustering_controls.py: cache backend availability check - shared/utils/backend.py: cache FAISS and cuML availability checks This significantly improves app startup time by avoiding unnecessary imports during module load. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Reverting commit d34c33e as the lazy loading implementation made startup performance worse instead of better. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Issue tracking moved to GitHub Issues: - Slow startup: #12 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Delete stale lib/ directory (duplicated in shared/lib/), remove unused imports (pandas from models.py, Counter from taxonomy_tree.py), remove dead error detection functions from clustering __init__, add logs/ to gitignore, and add print_available_models() entry point to models.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ClusteringService.run_clustering_safe() to encapsulate GPU-to-CPU fallback logic, replacing ~100 lines of duplicated error handling in both app sidebars. Enhance logging format with funcName:lineno, add persistent file handler (logs/emb_explorer.log), switch error handlers to logger.exception() for tracebacks, and add data loading/filter logging to precalculated sidebar. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap scatter chart in @st.fragment so zoom/pan only reruns the chart fragment, not the full page. Only trigger st.rerun(scope="app") when the selected point actually changes. Run cuML UMAP in an isolated subprocess with L2-normalized embeddings to prevent SIGFPE crashes (NN-descent numerical instability with large-magnitude embeddings). Falls back to sklearn UMAP automatically if the subprocess fails. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add GPU acceleration section to README explaining optional GPU support with CUDA 12/13 install commands. Create docs/DATA_FORMAT.md documenting expected parquet schema for precalculated app. Split pyproject.toml GPU extras into gpu-cu12/gpu-cu13 groups and add pynvml dependency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Apply L2 normalization to all embeddings before clustering and dimensionality reduction via _prepare_embeddings(). This prevents cuML UMAP SIGFPE crashes from large-magnitude vectors and is appropriate for CLIP-family contrastive embeddings. Log input norms, non-finite values, and embedding shapes at each pipeline step. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document the full embedding pipeline (preparation, KMeans, dim reduction, visualization) with backend details and fallback chain. Link from README. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR performs a major architectural refactoring, breaking the monolithic app.py into two focused standalone Streamlit applications:
Purpose: Separate workflows for (1) embedding and exploring local images vs. (2) exploring precomputed embeddings from parquet files, with shared infrastructure to keep code DRY and maintainable.
Changes:
- Restructured codebase into
shared/(reusable utilities, services, components) andapps/(two standalone Streamlit apps) - Enhanced clustering robustness with L2 normalization, GPU-to-CPU fallback, and subprocess isolation for cuML UMAP
- Added dynamic cascading filter generation for the precalculated embeddings app
- Implemented centralized logging with console and file output
- Improved visualization with zoom/pan via
@st.fragment, density modes (opacity/heatmap), and optimized reruns
Reviewed changes
Copilot reviewed 48 out of 54 changed files in this pull request and generated 25 comments.
Show a summary per file
| File | Description |
|---|---|
apps/embed_explore/app.py |
New standalone app for embedding local images with CLIP/BioCLIP models |
apps/precalculated/app.py |
New standalone app for exploring parquet files with precomputed embeddings |
apps/precalculated/components/sidebar.py |
Dynamic cascading filter generation based on parquet schema |
apps/precalculated/components/data_preview.py |
Metadata preview with image URL fetching and caching |
shared/utils/clustering.py |
Major refactor: L2 normalization, subprocess isolation for cuML UMAP, comprehensive logging |
shared/utils/backend.py |
New module for GPU/CUDA detection and backend resolution |
shared/utils/logging_config.py |
Centralized logging configuration with file and console output |
shared/services/clustering_service.py |
Clustering workflows with automatic GPU-to-CPU fallback |
shared/components/visualization.py |
Scatter plot with @st.fragment isolation, density modes, and zoom/pan |
shared/components/summary.py |
Clustering summary with cached results and taxonomy tree |
pyproject.toml |
Updated dependencies, new GPU extras (cu12/cu13), entry points for both apps |
README.md |
Simplified documentation focused on two distinct workflows |
docs/DATA_FORMAT.md |
New documentation for expected parquet schema |
docs/BACKEND_PIPELINE.md |
New documentation explaining the clustering pipeline |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| input_path, output_path = sys.argv[1], sys.argv[2] | ||
| n_neighbors = int(sys.argv[3]) | ||
| seed = int(sys.argv[4]) if sys.argv[4] else None |
There was a problem hiding this comment.
The subprocess script parses seed from argv[4], but if the seed is empty string (when seed is None), int(sys.argv[4]) will raise a ValueError. The correct check should be seed = int(sys.argv[4]) if sys.argv[4] and sys.argv[4] != "" else None or similar.
| dependencies = [ | ||
| # Core UI and web framework | ||
| "streamlit>=1.40.0", | ||
| "streamlit>=1.50.0", |
There was a problem hiding this comment.
The Streamlit version requirement was changed from >=1.40.0 to >=1.50.0 without explanation. According to public release information, Streamlit 1.50.0 was released in February 2025 (very recent). This could cause installation failures on systems that don't have access to the latest version yet. Consider whether this minimum version is truly required for the new features, or if >=1.40.0 would still work.
There was a problem hiding this comment.
Streamlit 1.50 is a year old. Dismiss!
| st.session_state["selected_image_idx"] = new_idx | ||
| st.session_state["selection_data_version"] = st.session_state.get("data_version", None) | ||
| # Trigger full page rerun so the preview panel updates | ||
| st.rerun(scope="app") |
There was a problem hiding this comment.
The st.rerun(scope="app") call at line 182 may not work as expected. According to Streamlit documentation, st.rerun() does not accept a scope parameter. If this is intended to trigger a full app rerun from within a fragment, you should use st.rerun() without arguments. The presence of an invalid parameter could cause a TypeError.
There was a problem hiding this comment.
The scope parameter was added in Streamlit 1.33+ (early 2024). scope="app" from inside a @st.fragment triggers a full page rerun, which is exactly the intended behavior here. Dismiss!
| def fetch_image_from_url(url: str, timeout: int = 5) -> Optional[bytes]: | ||
| """ | ||
| Fetch an image from a URL with logging. | ||
| Uses caching internally but logs the request. | ||
| """ | ||
| if not url or not isinstance(url, str): | ||
| return None | ||
|
|
||
| if not url.startswith(('http://', 'https://')): | ||
| logger.warning(f"[Image] Invalid URL scheme: {url[:50]}...") | ||
| return None | ||
|
|
||
| logger.info(f"[Image] Fetching: {url[:80]}...") | ||
| start_time = time.time() | ||
|
|
||
| result = _fetch_image_from_url_cached(url, timeout) | ||
|
|
||
| elapsed = time.time() - start_time | ||
| if result: | ||
| logger.info(f"[Image] Loaded: {len(result)/1024:.1f}KB in {elapsed:.3f}s") | ||
| else: | ||
| logger.warning(f"[Image] Failed to load: {url[:50]}...") | ||
|
|
||
| return result |
There was a problem hiding this comment.
The fetch_image_from_url function logs every image fetch at INFO level. Since the actual fetch is cached by _fetch_image_from_url_cached, subsequent calls will still log as if they're fetching, which could be misleading and create excessive log entries. Consider checking if the result came from cache before logging, or log at DEBUG level instead.
| # Use /dev/shm for fast IPC when available, else /tmp | ||
| shm_dir = "/dev/shm" if os.path.isdir("/dev/shm") else tempfile.gettempdir() | ||
| input_path = os.path.join(shm_dir, f"cuml_umap_in_{os.getpid()}.npy") | ||
| output_path = os.path.join(shm_dir, f"cuml_umap_out_{os.getpid()}.npy") |
There was a problem hiding this comment.
The subprocess script uses /dev/shm (shared memory) or temp directory for IPC, which is good for performance. However, the file paths use PID-based naming which could theoretically be guessed by another process. While this is a low-risk issue in single-user environments, consider using tempfile.mkstemp() or tempfile.NamedTemporaryFile() with delete=False for more secure temporary file creation, as these create files with unpredictable names and appropriate permissions.
| _cuda_check_cache = (True, device_info) | ||
| logger.info(f"CUDA available via CuPy: {device_info}") | ||
| return _cuda_check_cache | ||
| except ImportError: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| free_bytes, total_bytes = meminfo | ||
| used_bytes = total_bytes - free_bytes | ||
| return (used_bytes // (1024 * 1024), total_bytes // (1024 * 1024)) | ||
| except Exception: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| used = torch.cuda.memory_allocated() // (1024 * 1024) | ||
| total = torch.cuda.get_device_properties(0).total_memory // (1024 * 1024) | ||
| return (used, total) | ||
| except Exception: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| for path in (input_path, output_path): | ||
| try: | ||
| os.unlink(path) | ||
| except OSError: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| except Exception: | ||
| pass |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| except Exception: | |
| pass | |
| except Exception as exc: | |
| logger.exception("Failed to compute preview of filtered count", exc_info=exc) | |
| st.warning("Unable to compute preview for the current filters. " | |
| "You can still apply the filters to see full results.") |
This PR breaks our old monolithic
app.pyinto two focused Streamlit apps underapps/Additionally, we (who's we? me and claude lol) made these improvements:
sklearnand let you know what happened (in the console output)