Skip to content

cardio-risk-rf

Production-grade tabular cardiovascular-risk classifier on the Framingham Heart Study — 10-year coronary heart disease risk prediction with LightGBM (main) and RandomForest (baseline), served through FastAPI with per-patient SHAP explanations.

At a glance

  • Dataset: sulianova Cardiovascular Disease Dataset — 70 000 patients × 11 clinical features, balanced target cardio (~50/50), stratified 70/15/15 split. Framingham (4240 rows) kept as secondary benchmark cohort.
  • Main model: LightGBM with native NaN handling, tuned by Optuna (50 trials, TPE sampler, early stopping on val).
  • Baseline: RandomForest with SimpleImputer(median) + GridSearchCV — gives a calibration reference for the main model.
  • Stack: Python 3.12 / 3.13 · scikit-learn · LightGBM · Optuna · SHAP · FastAPI · Hydra · DVC · MkDocs Material · uv.
  • Serving: FastAPI /predict returns {probability, class, threshold, shap_top5, model_version, request_id}. CPU-only — no GPU needed for training or inference.
  • Architecture — data flow, pipeline layout, mermaid diagram, and the main-vs-baseline design decisions.
  • Training — CLI commands, Optuna/Grid hyperparameter notes, and the one-time Framingham mirror runbook.
  • Serving/predict endpoint contract, Pydantic schemas, curl example.
  • API reference — mkdocstrings-generated reference for the cardio_risk_rf package.

Intended use and disclaimer

This model is a portfolio/research demo trained on the public Framingham Heart Study subset. It is not a medical device and must not be used for clinical decision-making, diagnosis, or patient-facing risk communication. Calibration, fairness, and distribution shift have not been validated outside the original cohort. Use only for educational purposes, ML-engineering review, and to compare against other baselines on the same dataset.