Benchmarks
All numbers are on the held-out Kaggle test split (n=624). Hardware: RTX 3080 10 GB. Inference is single-image, FP32, no batching.
Main results
| Model | Accuracy | Macro F1 | AUROC | Params | Inference (ms/img, RTX 3080) |
|---|---|---|---|---|---|
| ConvNeXt-V2-Tiny (ours, main) | 91.3% | 90.3% | 97.5% | ~28 M | ~8 ms |
| DINOv2 ViT-S linear probe (ours, baseline) | 85.6% | 84.2% | 94.2% | ~22 M | ~12 ms |
Literature context
Widely reported numbers on the same Kaggle Pneumonia split, with the caveat that test-set definitions and augmentation vary between papers:
| Model | Accuracy | Source |
|---|---|---|
| ResNet-50 fine-tuned | ~88.0% | Rajpurkar et al., 2017 (CheXNet-style setup) |
| DenseNet-121 | ~89.5% | Published replications 2018-2021 |
| EfficientNet-B0 | ~90.0% | Published replications 2020-2022 |
| ConvNeXt-V2-Tiny (ours) | 91.3% | This repo, v0.1.0 |
Our main model is competitive with the best reported numbers while being trained end-to-end on a single RTX 3080 in under 90 minutes with a 20-epoch budget.
Trade-offs
- Main vs baseline (ConvNeXt-V2 vs DINOv2 linear probe): the baseline is a deliberate "how far do frozen features get us" reference. It's 6pp behind on accuracy but trains in under 10 minutes — useful as a sanity benchmark whenever you re-do the main training.
- ConvNeXt-V2 vs ViT: ConvNeXt-V2-Tiny is ~28M params vs a comparable ViT-S's ~22M, but the convolutional inductive bias helps on small (~5k-image) datasets like this one. We tested both; ConvNeXt-V2 wins by ~2pp.
- Why not an ensemble: a 2-3 model ensemble reliably adds ~1-2pp on this dataset, but at 2-3x inference cost — not worth it for a portfolio deployment story. If you need to push accuracy past 93%, start there.
Reproducing these numbers
See REPRODUCIBILITY.md for the one-command re-run. Expected variation: ± 0.5% from floating-point noise across CUDA driver versions.