Skip to content

🩺 Multi-method ML diagnostic & prognostic classifier using ensemble learning (RF, SVM, XGBoost, KNN, MLP) with permutation testing, feature selection, and profile ranking

Notifications You must be signed in to change notification settings

JMarzec/multi_ML_classifier

Repository files navigation

Multi-Method ML Diagnostic and Prognostic Classifier Dashboard

An interactive React dashboard for visualizing results from multi-method machine learning classification pipelines. Designed for biomedical research, clinical diagnostics, and prognostic prediction workflows.

🎯 Features

Machine Learning Analysis

  • Multiple ML Methods: Random Forest, SVM, XGBoost, KNN, MLP with soft/hard voting ensembles
  • Feature Selection: Forward, backward, and stepwise selection methods
  • Permutation Testing: Statistical validation against random chance
  • Cross-Validation: Repeated k-fold CV with comprehensive metrics
  • Class-Specific Risk Scores: Per-sample risk scores for each class (0-100 scale)

Visualization Dashboard

  • Model Performance Comparison: Side-by-side comparison of all ML methods
  • ROC Curves: Interactive multi-model ROC curves with threshold analysis
  • Confusion Matrices: Visual matrices with derived metrics (Accuracy, Precision, Sensitivity, Specificity, F1, NPV)
  • Feature Importance: Ranked feature visualization with stability analysis
  • Expression Box Plots: True box plots showing median, IQR, whiskers, and mean for top features across classes
  • Dimensionality Reduction: PCA, t-SNE, and UMAP clustering visualizations
  • Permutation Distributions: Actual vs. permuted performance comparison
  • Calibration Curves: Model probability calibration assessment
  • Profile Rankings: Sample-level prediction confidence rankings with risk scores
  • Feature Selection Visualization: Per-method feature selection comparison

Export & Reporting

  • HTML/PDF Reports: Comprehensive export of analysis results
  • Model Export: R code snippets for external predictions
  • Batch Processing: Compare multiple datasets

πŸš€ Quick Start

Dashboard Setup

# Clone the repository
git clone <YOUR_GIT_URL>

# Navigate to the project directory
cd <YOUR_PROJECT_NAME>

# Install dependencies
npm install

# Start the development server
npm run dev

R Script Setup

Two R scripts are available:

  1. Cross-Validation Mode (scripts/multi_ML_classifier.R): Full pipeline with k-fold CV
  2. Full Training Mode (scripts/multi_ML_classifier_full_training.R): Train on 100% of data

Running the Analysis

# Create conda environment
conda env create -f public/ml_classifier_env.yml
conda activate ml_classifier

# Run CV-based analysis
Rscript scripts/multi_ML_classifier.R \
  --expr expression_matrix.txt \
  --annot sample_annotation.txt \
  --target diagnosis \
  --outdir results

# Or run full training (no CV)
Rscript scripts/multi_ML_classifier_full_training.R \
  --expr expression_matrix.txt \
  --annot sample_annotation.txt \
  --target diagnosis \
  --outdir results

Load the generated ml_results.json into the dashboard.

πŸ“Š Input Data Format

Expression Matrix (tab-delimited .txt or .tsv)

  • Rows: Features (genes, proteins, metabolites, etc.)
  • Columns: Samples (column names are sample IDs)
Gene    Sample1  Sample2  Sample3
BRCA1   5.2      4.8      6.1
TP53    3.1      3.5      2.9
...

Sample Annotation (tab-delimited .txt or .tsv)

  • Must contain sample IDs matching expression matrix columns
  • Must contain target variable column for classification
SampleID  diagnosis  age  stage
Sample1   1          45   II
Sample2   0          52   I
Sample3   1          38   III
...

πŸ”¬ ML Method Guidance

Method Best For Dataset Size
Random Forest Robust baseline, feature importance Any
SVM High-dimensional data, clear margins Small-Medium
XGBoost Large datasets, complex patterns Large (>100)
KNN Simple patterns, interpretable Small-Medium
MLP Complex non-linear relationships Large (>100)

Note: XGBoost and MLP require larger datasets for optimal performance. Random Forest, SVM, and KNN are more robust on small biological datasets (<50 samples).

πŸ“ˆ Risk Scoring

The pipeline calculates class-specific risk scores for each sample:

  • Risk Score Class 0 (Negative): Probability Γ— 100 for the negative class
  • Risk Score Class 1 (Positive): Probability Γ— 100 for the positive class

These scores range from 0-100 and indicate the model's confidence that a sample belongs to each class. Higher scores indicate stronger classification confidence.

πŸ›  Technology Stack

  • Frontend: React 18, TypeScript, Vite
  • UI Components: shadcn/ui, Tailwind CSS
  • Charts: Recharts
  • R Backend: caret, randomForest, e1071, xgboost, nnet, pROC

πŸ“ Project Structure

β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ multi_ML_classifier.R              # CV-based analysis script
β”‚   └── multi_ML_classifier_full_training.R # Full training script
β”œβ”€β”€ public/
β”‚   β”œβ”€β”€ ml_classifier_env.yml              # Conda environment
β”‚   └── demo_data/                         # Sample input files
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ components/                        # React components
β”‚   β”œβ”€β”€ types/                             # TypeScript interfaces
β”‚   └── data/                              # Demo data
└── README.md

πŸ“– References

πŸ“ License

MIT License

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

About

🩺 Multi-method ML diagnostic & prognostic classifier using ensemble learning (RF, SVM, XGBoost, KNN, MLP) with permutation testing, feature selection, and profile ranking

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages