Architecture
Two independent sklearn.Pipeline artefacts sharing the same {prob, class, shap_top5} contract.
flowchart LR
CSV[Framingham CSV]:::external --> Prep[prepare.py<br/>70/15/15 stratified]:::code
Prep --> Train[train.parquet]:::data
Prep --> Val[val.parquet]:::data
Prep --> Test[test.parquet]:::data
Train --> LGBM[LightGBM<br/>+ Optuna 50]:::model
Train --> RF[RandomForest<br/>+ GridSearchCV]:::model
Val --> LGBM
Val --> Calib[Calibration plot]:::artifact
LGBM --> MainArt[artifacts/main/*.joblib]:::artifact
RF --> BaseArt[artifacts/baseline/*.joblib]:::artifact
MainArt --> API[FastAPI /predict]:::serve
BaseArt --> API
MainArt --> SHAP[Global SHAP<br/>reports/*]:::serve
Test --> Score[compute_metrics]:::code
MainArt --> Score
BaseArt --> Score
classDef external fill:#FFEBEE,stroke:#E53935,color:#B71C1C
classDef data fill:#FFCDD2,stroke:#E53935,color:#B71C1C
classDef code fill:#EF9A9A,stroke:#C62828,color:#B71C1C
classDef model fill:#EF5350,stroke:#B71C1C,color:#fff
classDef artifact fill:#E53935,stroke:#B71C1C,color:#fff
classDef serve fill:#C62828,stroke:#B71C1C,color:#fff
Key design decisions:
- Main model uses LightGBM native NaN handling — no imputer in its pipeline. Missing values at inference (including /predict) are forwarded as-is.
- Baseline uses SimpleImputer(strategy="median") because RandomForest cannot split on NaN. No StandardScaler (tree models are scale-invariant).
- Class imbalance handled by scale_pos_weight = N_neg / N_pos (LGBM) and class_weight="balanced" (RF).
- Hyperparameter search: Optuna 50 trials for LGBM (TPE sampler, seed=42, early-stopping on val), GridSearchCV short grid for RF.
- Final metric table uses test set only; calibration plot uses val to avoid test-set contamination.