Skip to content

Training

End-to-end training is orchestrated by scripts/train_all.py, which runs the baseline and main model in sequence, produces evaluation metrics, the calibration plot, and the global SHAP summary. The full run is CPU-only — no GPU is required or used at any stage.

Prerequisites

uv sync --all-groups
bash scripts/sync_data.sh
uv run python -m cardio_risk_rf.data.prepare --raw data/raw/framingham.csv --out data/processed

sync_data.sh pulls the Framingham CSV from Kaggle, with an HF Datasets mirror (kiselyovd/framingham) as fallback. prepare.py splits 70/15/15 stratified by TenYearCHD and writes train.parquet, val.parquet, test.parquet under data/processed/.

Commands

Full training pipeline (baseline + main + evaluation + SHAP):

uv run python scripts/train_all.py

Individual stages:

# Main model only (LightGBM + Optuna)
uv run python -m cardio_risk_rf.training.train model=lgbm

# Baseline only (RandomForest + GridSearchCV)
uv run python -m cardio_risk_rf.training.train model=rf

# Evaluation (metrics + calibration plot)
uv run python -m cardio_risk_rf.evaluation.evaluate

# Global SHAP summary from main model
uv run python -m cardio_risk_rf.explain --model artifacts/main/cardio_risk_lgbm.joblib

Hyperparameters

LightGBM (main) — Optuna TPE, 50 trials, seed=42. Search ranges (see configs/model/lgbm.yaml):

  • num_leaves: 16–128
  • learning_rate: 0.01–0.2 (log)
  • max_depth: -1 or 3–12
  • min_child_samples: 5–60
  • reg_alpha: 1e-8–10 (log)
  • reg_lambda: 1e-8–10 (log)
  • feature_fraction: 0.6–1.0
  • bagging_fraction: 0.6–1.0

Class imbalance: scale_pos_weight = N_neg / N_pos computed on train. Early stopping on val ROC-AUC with patience 30. No feature scaling or imputation — LightGBM handles NaN natively.

RandomForest (baseline) — GridSearchCV, 5-fold stratified. Grid (see configs/model/rf.yaml):

  • n_estimators: [200, 500]
  • max_depth: [None, 8, 16]
  • min_samples_leaf: [1, 5, 10]
  • class_weight: "balanced"

Pipeline: SimpleImputer(strategy="median")RandomForestClassifier. No StandardScaler (tree models are scale-invariant).

Outputs

  • artifacts/main/cardio_risk_lgbm.joblib — main model pipeline.
  • artifacts/baseline/cardio_risk_rf.joblib — baseline pipeline.
  • reports/metrics_summary.json — ROC-AUC / PR-AUC / F1 / Brier on test (n≈636) for both models.
  • reports/calibration.png — reliability diagram on val.
  • reports/shap_summary.png — global SHAP summary bar + beeswarm on main model.

GPU notes

None — both LightGBM and scikit-learn RandomForest run on CPU. A typical Optuna 50-trial run completes in ~2–4 min on a modern laptop CPU. No CUDA, no torch, no accelerator config.

Framingham mirror upload (one-time, operator only)

hf repo create kiselyovd/framingham --repo-type dataset
hf upload kiselyovd/framingham data/raw/framingham.csv --repo-type dataset