Skip to content

Training

Prerequisites

uv sync --all-groups
bash scripts/sync_data.sh            # CarFusion COCO export -> data/raw/
uv run python -m vehicle_keypoints.data.prepare \
  --raw data/raw --out data/processed

Hardware: an RTX 3080 (10 GB VRAM) or better is recommended. The YOLO branch fits in ~6 GB at the default batch size; the ViTPose branch uses ~4 GB. Both branches expect data/raw/ populated (CarFusion annotations + images) and data/processed/ generated by prepare.py for the YOLO layout.

Main - YOLO26-pose

uv run python -m vehicle_keypoints.training.train +experiment=sota trainer.max_epochs=100

Expected wall time: ~60-120 minutes on an RTX 3080 (batch 16, 640², 100 epochs). Checkpoints land in artifacts/<run>/weights/best.pt. Ultralytics writes its own results.csv and TensorBoard event files alongside - point TensorBoard at artifacts/ to browse live.

Baseline - ViTPose-S

uv run python -m vehicle_keypoints.training.train_vitpose \
  model=baseline \
  data.vitpose.train_images=data/raw/train \
  data.vitpose.train_annotations=data/raw/annotations/car_keypoints_train.json \
  data.vitpose.val_images=data/raw/train \
  data.vitpose.val_annotations=data/raw/annotations/car_keypoints_train.json \
  trainer.max_epochs=30 \
  trainer.output_dir=artifacts/baseline

Expected wall time: ~30-60 minutes on an RTX 3080 (batch 32, 256×192 top-down crops, 30 epochs). Checkpoint written to artifacts/baseline/checkpoints/best.ckpt.

MLflow tracking

MLflow is wired into the ViTPose branch only (mlflow ui --backend-store-uri ./mlruns, browse at http://localhost:5000). Every Hydra run is one MLflow run with the resolved config logged as params and train/loss, val/loss, val/pck, val/oks as metrics.

Ultralytics (the YOLO branch) does not use MLflow by default - it emits its own results.csv + TensorBoard logs under artifacts/<run>/. This split is intentional: MLflow would mostly duplicate what Ultralytics already provides for YOLO runs, and keeps the YOLO command free of unnecessary plumbing.

Hydra overrides (ViTPose branch)

Override Effect
trainer.max_epochs=50 Longer training
data.batch_size=16 Smaller batches for low-VRAM cards
model.lr=1e-4 Different learning rate

Multi-run sweep example:

uv run python -m vehicle_keypoints.training.train_vitpose -m \
  model.lr=1e-5,3e-5,1e-4 \
  data.batch_size=16,32

GPU memory

Branch Config Peak VRAM
YOLO26-pose (n size) batch 16 × 640² ~6 GB
ViTPose-S batch 32 × 256×192 ~4 GB

Reduce data.batch_size (or trainer.batch= for YOLO) if you hit OOM on a smaller card.

Transfer-learning note (ViTPose)

ViTPose-S ships pretrained on the COCO human 17-keypoint task. We re-head it to 14 car keypoints by replacing the final deconv layer with a fresh 14-channel head. Expect the first few epochs to show unstable loss while the new head learns from random init - this is normal. The backbone transfers well (car textures are within-distribution for a COCO-pretrained ViT), but the head must start from scratch, so don't early-stop too aggressively.