⚠️ PROTOTYPE - NOT FOR PRODUCTION USE⚠️ This project is an experimental prototype and was developed using AI-assisted "vibe coding".
- The methods and algorithms have NOT been validated for correctness or biological accuracy
- Results should NOT be used for clinical, research, or production purposes
- This is a proof-of-concept exploring energy-based alignment ideas
- Expect bugs, incomplete features, and potentially incorrect outputs
- If you need a production aligner, use established tools like minimap2, BWA-MEM, or LAST
Use at your own risk. Contributions and feedback welcome!
This repository was produced by iteratively prompting AI models to scaffold and refine a prototype codebase. It has not been independently audited or scientifically validated. Expect rough edges, inconsistent style, and spots where automated generation may have introduced subtle mistakes. Please review carefully before relying on any results.
AI-prone hotspots to review:
catalign.align.CatalignAligner._banded_align: uses a simple diagonal band estimated from sequence lengths; alignments far off-diagonal or with large indels may be clipped or mis-scored.catalign.chain.chain_anchors: heuristic DP with a fixed look-back window and no strand/orientation handling; inversions or distant anchors can be missed, and chaining is O(n²) within the window.catalign.quality.evaluate_quality: gap costs are simplified (gap_open applied per event, no gap extension), energy totals may not match the DP scoring, and coverage is based on set cardinality rather than contiguous spans.catalign.sketch.minimizer_sketch: hashes are not canonicalized for reverse complements and use a basic rolling hash; collisions or strand issues can lead to missing/extra anchors.
Where sequences find each other like cats find the warmest spot.
Catalign is a bioinformatics sequence alignment tool that mimics natural molecular forces for aligning DNA sequences. Instead of arbitrary scoring heuristics, Catalign models alignment as an energy minimisation problem — long-range attraction finds approximate matching regions, short-range forces refine base-level alignment, and the final result is the most energetically stable configuration.
The fastest way to see Catalign in action with real genomic data:
# Install with visualization support
pip install catalign[viz]
# Run the mitochondrial genome demo
python examples/mito_demo.pyThis downloads human mitochondrial sequences from NCBI and generates:
- Interactive dot plots showing sequence similarity
- Energy and identity heatmaps
- CIGAR string visualizations
- Quality assessment reports
See examples/README.md for more demo options.
Current sequence aligners (minimap2, BWA-MEM, LAST, …) are engineering marvels, but their scoring models diverge from the biology they serve:
| Problem with current aligners | Catalign's natural approach |
|---|---|
| Seed-and-extend is rigid — seeds are either exact or ignored | Long-range attraction via minimizer sketches provides a smooth, distance-weighted signal that degrades gracefully |
| Gap penalties (affine, convex) are mathematical conveniences, not biology | Energy-based gaps model strand separation: opening a gap costs real energy, extending it costs less — just like pulling apart a DNA duplex |
| Poor handling of structural variants — large indels fall outside band, SVs break aligners | Multi-scale chaining naturally handles large rearrangements: anchor chains can span structural variants |
| Repetitive regions cause spurious mappings and MQ0 scores | Energy minimisation across the full alignment landscape finds the globally optimal placement, not just the first adequate seed chain |
| No multi-scale quality assessment — MAPQ is a single number that hides local problems | Four-level quality evaluation (base → block → region → genome) exposes exactly where and why an alignment is uncertain |
┌──────────────────────────────────────────────────┐
│ Catalign Pipeline │
│ │
│ ┌────────────┐ Long-range ┌────────────┐ │
│ │ Minimizer │ attraction │ Anchor │ │
│ │ Sketching │ ──────────────► │ Chaining │ │
│ └────────────┘ (k-mer match) └─────┬──────┘ │
│ │ │
│ ▼ │
│ ┌────────────┐ │
│ Short-range │ Banded │ │
│ forces │ DP Align │ │
│ └─────┬──────┘ │
│ │ │
│ ▼ │
│ ┌────────────┐ │
│ Energy │ Multi-Scale│ │
│ minimisation │ Quality │ │
│ └────────────┘ │
└──────────────────────────────────────────────────┘
- Sketching (
catalign.sketch) — Extract minimizer profiles from query and target. These compressed representations enable efficient long-range comparison. - Anchor finding & chaining (
catalign.chain) — Shared minimizers become anchors; dynamic programming chains compatible anchors into co-linear groups. - Banded alignment (
catalign.align) — Between anchors, a banded Smith-Waterman-style DP fills in base-level alignment using the energy model. - Quality evaluation (
catalign.quality) — The finished alignment is assessed at four scales: base, block, region, and genome.
pip install catalignFor development with all extras:
git clone https://github.com/your-org/catalign.git
cd catalign
pip install -e ".[dev]"Optional dependency groups:
test— pytest, coverageviz— plotly, streamlit (for interactive visualization)viewer— pysam (for BAM/CRAM support)bench— pytest-benchmark, memory-profilerall— all of the above
Catalign includes caliview, a high-performance genome alignment viewer inspired by IGV but optimized for multi-scale alignment visualization.
- Multi-scale visualization: View alignments at base, block, region, and chromosome scales
- Custom
.califormat: Pre-computed multi-scale metrics for instant browsing - BAM/CRAM support: Import standard alignment formats
- Static binary: Easy deployment with no dependencies (Rust-based)
# Build the viewer (requires Rust)
cd caliview
cargo build --release
# Or install Python tools
pip install catalign[viewer]# Convert BAM to CALI format
python -c "from catalign.viewer import bam_to_cali; bam_to_cali('input.bam', 'output.cali')"
# View with caliview (when compiled)
caliview view output.califrom catalign.viewer import CaliFile, CaliWriter, MetricsTiler
# Read a CALI file
cali = CaliFile("alignment.cali")
tiles = cali.get_tiles("chr1", 0, 1_000_000, tile_size=10_000)
# Generate metrics from alignment
tiler = MetricsTiler(chromosome_length=100_000, tile_sizes=[1000, 10000])
# ... add positions ...
all_tiles = tiler.get_all_tiles()# Align two FASTA files
catalign align query.fa target.fa
# PAF-style output
catalign align query.fa target.fa --output paf
# Custom energy parameters
catalign align query.fa target.fa --match-energy -2.5 --gap-open 6.0
# Quality evaluation
catalign quality query.fa target.fafrom catalign import CatalignAligner, EnergyModel, evaluate_quality
# Configure energy model
em = EnergyModel(match_energy=-2.0, mismatch_energy=3.0,
gap_open_energy=5.0, gap_extend_energy=1.0)
# Align
aligner = CatalignAligner(energy_model=em, k=15, w=50)
aln = aligner.align(query_seq, target_seq)
print(f"CIGAR: {aln.cigar}")
print(f"Energy: {aln.energy_score}")
# Multi-scale quality
qual = evaluate_quality(aln, query_seq, target_seq)
print(f"Identity: {qual.overall_identity:.2%}")
print(f"Quality score: {qual.quality_score:.1f}/100")Catalign evaluates alignment quality at four levels:
Per-position assessment: is this a match, mismatch, insertion, or deletion? Each base carries an energy value reflecting confidence.
Contiguous aligned segments are grouped into blocks. Each block reports identity percentage, length, and cumulative energy — revealing whether a region is solidly matched or weakly aligned.
Blocks are aggregated into larger regions, reporting structural concordance, query/target coverage, and the number of aligned blocks. This is where structural variants become visible.
A QualityReport aggregates all alignments for a genome-wide view: mean identity, mean quality score, total energy, and overall coverage.
pip install -e ".[test]"
pytest tests/ -vCatalign includes a comprehensive synthetic test data generator for validation:
python scripts/generate_test_data.pyThis creates test sequences in tests/resources/ with:
- Identical sequences — baseline for 100% identity alignment
- SNP mutations — single and multiple nucleotide polymorphisms
- Indels — insertions and deletions of various sizes
- Structural variants — inversions, tandem duplications, large deletions
- Repetitive sequences — tandem repeats and interspersed elements
- Benchmark sequences — larger sequences for performance testing
Each test case includes a ground truth JSON file with expected alignment properties.
# Run benchmark suite against ground truth
catalign benchmark --resources-dir tests/resources
# Generate metrics report
catalign metrics query.fa target.fa --jsonThe test suite includes synthetic sequences covering:
- Identical sequence alignment
- Single-base mismatches
- Small insertions and deletions
- Repetitive sequences
- Longer sequences with shared cores
For production-scale testing, we recommend:
- T2T CHM13 v2.0 — the first complete human genome assembly from the Telomere-to-Telomere Consortium. Download from NCBI.
- HPRC assemblies — haplotype-resolved assemblies (hap1 vs hap2) from the Human Pangenome Reference Consortium, ideal for testing structural variant handling and haplotype-aware alignment. Available at https://humanpangenome.org.
Example with real data:
# Align hap1 vs hap2 for a single chromosome
catalign align hap1_chr1.fa hap2_chr1.fa --kmer-size 19 --window-size 100 --output paf- Minimizer sketching reduces the sequence representation by ~50× compared to all k-mers, enabling fast long-range comparison.
- Anchor chaining uses a look-back window in the DP to keep chaining O(n·w) instead of O(n²).
- Banded alignment constrains the DP matrix to a diagonal band, reducing base-level alignment from O(nm) to O(n·bandwidth).
- NumPy arrays are used for base encoding and energy matrices for vectorised computation.
For whole-genome alignment, future versions will add:
- Tiled/blocked processing for memory efficiency
- Multiprocessing for independent chromosome arms
- Optional GPU acceleration via CuPy
- Visualisation — quality heat-maps, dot-plots, energy landscape plots, and interactive dashboard
- GPU acceleration — CuPy backend for banded DP on large sequences
- Population-scale alignment — align many samples against a pangenome graph
- RNA-seq support — splice-aware energy model
- Structural variant calling — leverage anchor chain breaks as SV evidence
Catalign includes a comprehensive visualization suite for alignment analysis. Launch the interactive dashboard:
catalign viz
# Or specify port
catalign viz --port 8080from catalign import CatalignAligner, evaluate_quality
from catalign.viz import (
create_dotplot,
create_energy_heatmap,
create_identity_heatmap,
visualize_cigar,
create_alignment_view,
)
# Align sequences
aligner = CatalignAligner()
aln = aligner.align(query_seq, target_seq)
# Dot plot
dp = create_dotplot(query_seq, target_seq, k=11)
fig = dp.to_figure()
fig.write_html("dotplot.html")
# Energy landscape
energy_fig = create_energy_heatmap(aln, query_seq, target_seq)
# CIGAR visualization
cigar_viz = visualize_cigar(aln.cigar)
cigar_viz.to_figure().show()
# Text alignment view
view = create_alignment_view(aln, query_seq, target_seq)
print(view.to_text())| Visualization | Description |
|---|---|
| Dot Plot | K-mer match visualization showing sequence similarity patterns |
| Energy Heatmap | Sliding window energy across the alignment |
| Identity Heatmap | Local identity percentage along the alignment |
| CIGAR View | Colored block representation of alignment operations |
| Quality Heatmap | Multi-track view of operation types and energies |
| Text Alignment | Traditional side-by-side alignment with match indicators |
MIT — see LICENSE.