FastFilter2 is a high-performance, production-ready Python tool for filtering paired-end FASTQ files. Designed for bioinformatics pipelines, it provides flexible, reliable, and fast filtering of sequencing data with built-in support for multi-threading, compression, and detailed summaries.
This tool is ideal for pre-processing RNA-seq, DNA-seq, or other high-throughput sequencing datasets prior to alignment, assembly, or downstream analysis.
- Biological quality filters:
- Minimum read length
- Maximum allowed ambiguous bases (Ns)
- Homopolymer detection
- Minimum average Phred quality score
- Paired-end safe: Ensures that reads are filtered in pairs, maintaining synchronization between R1 and R2.
- High-performance I/O: Writes uncompressed FASTQ first for speed, then compresses output automatically with
pigzusing multiple threads. - Batch processing: Efficient batch writing to reduce I/O overhead.
- Progress tracking: Real-time progress bars via
tqdmfor monitoring large datasets. - Summary output: Generates CSV reports with total reads, passing reads, and pass rates.
Clone the repository and install dependencies:
git clone https://github.com/GamaPintoLab/fastfilter2.git
cd fastfilter2
pip install -r requirements.txtDependencies:
- Python 3.9 or higher
- Biopython
- tqdm
- pigz (for multi-threaded compression)
Run the tool from the command line:
fastfilter2 -i /path/to/input_fastq_dir -o /path/to/output_dir -j 4-i, --seq-dir: Input directory containing paired-end FASTQ files (required)-o, --output-dir: Directory to write filtered outputs (defaults to<input_dir>/fastfilter)-j, --cpus: Number of threads for parallel processing and compression (default: 1)-l, --minlen: Minimum sequence length (default: 25)-p, --homopolymerlen: Maximum allowed homopolymer length (default: 25)-s, --min-score: Minimum average Phred quality score (default: 30)--dryrun: Run without writing outputs (for testing)
fastfilter2 -i samples/fastq -o results/filtered -j 8 -l 50 -s 20 --dryrunThis example processes paired-end FASTQ files in samples/fastq using 8 CPU threads, filters reads shorter than 50 bases or with average quality below 20, and performs a dry run without writing files.
- Input parsing: Reads paired-end FASTQ files and validates file pairs.
- Filtering: Applies multiple biological filters to each read:
- Removes reads with ambiguous bases (N or .)
- Filters out reads with homopolymers above a given threshold
- Filters based on minimum length and average Phred score
- Batch writing: Writes passing reads in batches to reduce I/O overhead.
- Compression: Automatically compresses output FASTQ files with
pigzfor speed and storage efficiency. - Reporting: Produces a summary CSV with per-file statistics including total reads, passing reads, and pass rates.
Filtered paired-end files are named:
<sample_name>_R1_FILTERED.fastq.gz
<sample_name>_R2_FILTERED.fastq.gz
Summary CSV:
fastfilter_summary.csv containing:
- file: sample name
- total_reads: number of reads in input
- good_reads: reads passing filters
- pass_rate_pct: percentage of reads passing filters
- Multi-threaded filtering and compression using
multiprocessingandpigz. - Efficient memory usage via batch processing of reads.
- Suitable for very large FASTQ datasets (tens to hundreds of millions of reads).
This project is licensed under the MIT License. See the LICENSE file for details.
- Built on top of the original FastFilter concept
- Biopython community for sequence handling tools
tqdmfor progress visualizationpigzfor high-speed parallel compression
Author: Lucas Monteiro
PI: Margarida Gama-Carvalho
Lab: RNA Systems Biology Lab, BioISI, University of Lisbon