cardio-risk-rf
Production-grade tabular cardiovascular-risk classifier on the Framingham Heart Study — 10-year coronary heart disease risk prediction with LightGBM (main) and RandomForest (baseline), served through FastAPI with per-patient SHAP explanations.
At a glance
- Dataset: sulianova Cardiovascular Disease Dataset — 70 000 patients × 11 clinical features, balanced target
cardio(~50/50), stratified 70/15/15 split. Framingham (4240 rows) kept as secondary benchmark cohort. - Main model: LightGBM with native NaN handling, tuned by Optuna (50 trials, TPE sampler, early stopping on val).
- Baseline: RandomForest with
SimpleImputer(median)+GridSearchCV— gives a calibration reference for the main model. - Stack: Python 3.12 / 3.13 · scikit-learn · LightGBM · Optuna · SHAP · FastAPI · Hydra · DVC · MkDocs Material · uv.
- Serving: FastAPI
/predictreturns{probability, class, threshold, shap_top5, model_version, request_id}. CPU-only — no GPU needed for training or inference.
Navigation
- Architecture — data flow, pipeline layout, mermaid diagram, and the main-vs-baseline design decisions.
- Training — CLI commands, Optuna/Grid hyperparameter notes, and the one-time Framingham mirror runbook.
- Serving —
/predictendpoint contract, Pydantic schemas, curl example. - API reference — mkdocstrings-generated reference for the
cardio_risk_rfpackage.
Links
- GitHub: kiselyovd/cardio-risk-rf
- Hugging Face model: kiselyovd/cardio-risk-rf
- Russian README: README.ru.md
- Template: kiselyovd/ml-project-template
Intended use and disclaimer
This model is a portfolio/research demo trained on the public Framingham Heart Study subset. It is not a medical device and must not be used for clinical decision-making, diagnosis, or patient-facing risk communication. Calibration, fairness, and distribution shift have not been validated outside the original cohort. Use only for educational purposes, ML-engineering review, and to compare against other baselines on the same dataset.