API Reference
Auto-generated reference for the cardio_risk_rf package via mkdocstrings. Each module below lists its public classes and functions with their docstrings and type signatures.
Package root
cardio_risk_rf
Production-grade cardiovascular risk tabular classifier.
Data
cardio
sulianova Cardiovascular Disease dataset loader + stratified split.
Canonical main-pipeline dataset for cardio-risk-rf. 70000 patients,
binary target cardio at ~50/50 positive rate. See
https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset
Raw columns (source CSV): id;age;gender;height;weight;ap_hi;ap_lo;cholesterol;gluc;smoke;alco;active;cardio
idis dropped on load.ageis converted from days (18000-23000) to integer years (32-65) during load so downstream pipelines, SHAP plots, and the serving API all work in a human-readable unit.genderis left as 1/2 (source encoding) — LightGBM and RandomForest treat it as a numeric feature without issue.cholesterolandglucare ordinal (1=normal, 2=above normal, 3=well above normal); kept as numeric for trees.smoke,alco,activeare binary 0/1.ap_hi/ap_loare systolic/diastolic BP (mmHg).- Target
cardiois 0 (no CVD) / 1 (CVD).
Functions
load_cardio
load_cardio(csv_path: str | Path) -> pd.DataFrame
Read the sulianova Cardiovascular Disease CSV.
The source file is ;-separated; some Kaggle forks reshuffle to , —
both are auto-detected. id is dropped. age is converted from days
to integer years. Column order is stabilised to FEATURES + [TARGET].
Source code in src/cardio_risk_rf/data/cardio.py
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | |
split_stratified
split_stratified(
df: DataFrame,
*,
seed: int = 42,
train_ratio: float = 0.7,
val_ratio: float = 0.15,
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]
Return (train, val, test) with stratification on the target column.
Source code in src/cardio_risk_rf/data/cardio.py
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | |
dataset
Dataset implementations.
Functions
load_dataset
load_dataset(csv_path: Path | str) -> pd.DataFrame
Load a CSV into a dataframe.
Source code in src/cardio_risk_rf/data/dataset.py
10 11 12 | |
framingham
Framingham Heart Study loader + stratified split.
Dataset columns (as published on Kaggle): male, age, education, currentSmoker, cigsPerDay, BPMeds, prevalentStroke, prevalentHyp, diabetes, totChol, sysBP, diaBP, BMI, heartRate, glucose, TenYearCHD (target).
Functions
load_framingham
load_framingham(csv_path: str | Path) -> pd.DataFrame
Read the Framingham CSV and return it with a stable column order.
Source code in src/cardio_risk_rf/data/framingham.py
36 37 38 39 40 41 42 43 44 | |
split_stratified
split_stratified(
df: DataFrame,
*,
seed: int = 42,
train_ratio: float = 0.7,
val_ratio: float = 0.15,
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]
Return (train, val, test) with stratification on the target column.
Source code in src/cardio_risk_rf/data/framingham.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | |
prepare
CLI to produce data/processed/{train,val,test}.parquet from raw CSV.
Functions
Models
factory
Build the two production sklearn Pipelines: main (LGBM) + baseline (RF).
Functions
build_main
build_main(
*, scale_pos_weight: float, random_state: int = 42, **lgbm_overrides: Any
) -> Pipeline
LightGBM classifier; passthrough preprocessing — native NaN handling.
Source code in src/cardio_risk_rf/models/factory.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | |
build_baseline
build_baseline(*, random_state: int = 42, **rf_overrides: Any) -> Pipeline
RandomForest baseline; median imputation because RF cannot split on NaN.
Source code in src/cardio_risk_rf/models/factory.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | |
sklearn_pipeline
scikit-learn pipeline builder.
Functions
build_pipeline
build_pipeline(model_name: str = 'lgbm', **model_params: Any) -> Pipeline
Build an sklearn Pipeline by model name.
Source code in src/cardio_risk_rf/models/sklearn_pipeline.py
13 14 15 16 17 18 19 20 21 | |
Training
train
Training orchestration for main (LightGBM + Optuna) and baseline (RF + GridSearchCV).
Functions
Evaluation
calibration
Reliability diagram for probabilistic binary predictions.
Functions
save_calibration_plot
save_calibration_plot(
y_true: ndarray,
probs: ndarray,
out_path: str | Path,
*,
bins: int = 10,
title: str = "Calibration curve",
) -> None
Write a reliability-diagram PNG.
Source code in src/cardio_risk_rf/evaluation/calibration.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | |
evaluate
CLI: score a trained model on a Parquet split.
Functions
metrics
Binary classification metrics for the tabular pipeline.
Functions
compute_metrics
compute_metrics(
y_true: ndarray, probs: ndarray, *, threshold: float = 0.5
) -> dict[str, Any]
Return a flat dict with ROC-AUC / PR-AUC / F1 / Brier for reporting.
Source code in src/cardio_risk_rf/evaluation/metrics.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | |
summary
Merge main + baseline metrics into a single summary JSON.
Inference
predict
Inference CLI — load a checkpoint and predict on input(s).
Functions
load_model
load_model(path: str | Path) -> Any
Load a joblib checkpoint from disk.
Source code in src/cardio_risk_rf/inference/predict.py
17 18 19 | |
predict
predict(model: Any, features: dict[str, Any]) -> dict[str, Any]
Run inference on a single feature mapping and return pred + class probabilities.
Source code in src/cardio_risk_rf/inference/predict.py
22 23 24 25 26 27 28 29 | |
main
main() -> None
CLI entry point — parse args, load model, predict, print JSON.
Source code in src/cardio_risk_rf/inference/predict.py
32 33 34 35 36 37 38 39 40 41 | |
Explainability
explain
SHAP wrappers for the LightGBM main model and RF baseline.
TreeExplainer supports both natively. For global reports we pass a sample of val or test; for per-instance serving we pass a single row.
Serving
dependencies
Dependency injection — singleton model loader.
Functions
get_model
cached
get_model() -> Any
Singleton-load the checkpoint from MODEL_PATH.
Source code in src/cardio_risk_rf/serving/dependencies.py
15 16 17 18 19 20 21 22 | |
errors
Exception types and handlers.
main
FastAPI application.
Functions
lifespan
async
lifespan(app: FastAPI) -> AsyncIterator[None]
FastAPI lifespan — eagerly load the model (best-effort) on startup.
Source code in src/cardio_risk_rf/serving/main.py
25 26 27 28 29 30 31 32 33 34 | |
add_request_id
async
add_request_id(
request: Request, call_next: Callable[[Request], Awaitable[Response]]
) -> Response
Middleware — inject a UUID request id and echo it back as X-Request-ID.
Source code in src/cardio_risk_rf/serving/main.py
47 48 49 50 51 52 53 54 55 | |
routes
FastAPI routes: /health and /predict (main + baseline by query param).
Classes
schemas
Pydantic request/response schemas for /predict.
Fields match the sulianova Cardiovascular Disease Dataset schema
(11 features + binary cardio target). age is in years (source data
is in days; converted at load time in data/cardio.py).
Classes
PatientFeatures
Bases: BaseModel
Single-patient feature payload; 11 sulianova cardio-risk features.
ShapEntry
Bases: BaseModel
Single SHAP contribution for one feature.
PredictionResponse
Bases: BaseModel
Response schema for /predict.
Classes
Pydantic configuration — allow populating cls by field or alias.
Utilities
hf_hub
HuggingFace Hub helpers.
logging
Structured logging configuration.
Functions
configure_logging
configure_logging(level: str = 'INFO', json_output: bool = False) -> None
Initialise stdlib + structlog with JSON or console rendering.
Source code in src/cardio_risk_rf/utils/logging.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | |
get_logger
get_logger(name: str | None = None) -> structlog.stdlib.BoundLogger
Return a structlog bound logger; call after configure_logging.
Source code in src/cardio_risk_rf/utils/logging.py
34 35 36 37 | |
seed
Deterministic seeding across libraries.
Functions
seed_everything
seed_everything(seed: int = 42) -> None
Seed Python, NumPy, and (optionally) PyTorch for deterministic behaviour.
Source code in src/cardio_risk_rf/utils/seed.py
11 12 13 14 15 16 17 18 19 20 21 22 23 | |