Training
Prerequisites
uv sync --all-groups
bash scripts/sync_data.sh # CarFusion COCO export -> data/raw/
uv run python -m vehicle_keypoints.data.prepare \
--raw data/raw --out data/processed
Hardware: an RTX 3080 (10 GB VRAM) or better is recommended. The YOLO branch fits in ~6 GB at the default batch size; the ViTPose branch uses ~4 GB. Both branches expect data/raw/ populated (CarFusion annotations + images) and data/processed/ generated by prepare.py for the YOLO layout.
Main - YOLO26-pose
uv run python -m vehicle_keypoints.training.train +experiment=sota trainer.max_epochs=100
Expected wall time: ~60-120 minutes on an RTX 3080 (batch 16, 640², 100 epochs). Checkpoints land in artifacts/<run>/weights/best.pt. Ultralytics writes its own results.csv and TensorBoard event files alongside - point TensorBoard at artifacts/ to browse live.
Baseline - ViTPose-S
uv run python -m vehicle_keypoints.training.train_vitpose \
model=baseline \
data.vitpose.train_images=data/raw/train \
data.vitpose.train_annotations=data/raw/annotations/car_keypoints_train.json \
data.vitpose.val_images=data/raw/train \
data.vitpose.val_annotations=data/raw/annotations/car_keypoints_train.json \
trainer.max_epochs=30 \
trainer.output_dir=artifacts/baseline
Expected wall time: ~30-60 minutes on an RTX 3080 (batch 32, 256×192 top-down crops, 30 epochs). Checkpoint written to artifacts/baseline/checkpoints/best.ckpt.
MLflow tracking
MLflow is wired into the ViTPose branch only (mlflow ui --backend-store-uri ./mlruns, browse at http://localhost:5000). Every Hydra run is one MLflow run with the resolved config logged as params and train/loss, val/loss, val/pck, val/oks as metrics.
Ultralytics (the YOLO branch) does not use MLflow by default - it emits its own results.csv + TensorBoard logs under artifacts/<run>/. This split is intentional: MLflow would mostly duplicate what Ultralytics already provides for YOLO runs, and keeps the YOLO command free of unnecessary plumbing.
Hydra overrides (ViTPose branch)
| Override | Effect |
|---|---|
trainer.max_epochs=50 |
Longer training |
data.batch_size=16 |
Smaller batches for low-VRAM cards |
model.lr=1e-4 |
Different learning rate |
Multi-run sweep example:
uv run python -m vehicle_keypoints.training.train_vitpose -m \
model.lr=1e-5,3e-5,1e-4 \
data.batch_size=16,32
GPU memory
| Branch | Config | Peak VRAM |
|---|---|---|
YOLO26-pose (n size) |
batch 16 × 640² | ~6 GB |
| ViTPose-S | batch 32 × 256×192 | ~4 GB |
Reduce data.batch_size (or trainer.batch= for YOLO) if you hit OOM on a smaller card.
Transfer-learning note (ViTPose)
ViTPose-S ships pretrained on the COCO human 17-keypoint task. We re-head it to 14 car keypoints by replacing the final deconv layer with a fresh 14-channel head. Expect the first few epochs to show unstable loss while the new head learns from random init - this is normal. The backbone transfers well (car textures are within-distribution for a COCO-pretrained ViT), but the head must start from scratch, so don't early-stop too aggressively.