Training
End-to-end training is orchestrated by scripts/train_all.py, which runs the baseline (ruBERT) and main model (XLM-RoBERTa) in sequence, produces evaluation metrics, a confusion matrix, and a metrics_summary.json. GPU is recommended; the main model will fall back to CPU but training time increases ~10×.
Prerequisites
uv sync --all-groups
bash scripts/sync_data.sh
uv run python -m grnti_text_classifier.data.prepare --raw data/raw --out data/processed
sync_data.sh calls snapshot_download for ai-forever/ru-scibench-grnti-classification. prepare.py writes train.parquet, val.parquet, test.parquet, and label_encoder.json under data/processed/.
Commands
Full pipeline (baseline + main + evaluation):
uv run python scripts/train_all.py
Individual stages:
# Main model only (XLM-RoBERTa-base)
uv run python -m grnti_text_classifier.training.train model=main
# Baseline only (ruBERT-base-cased)
uv run python -m grnti_text_classifier.training.train model=baseline
# Optuna sweep (10 trials, 3 epochs each)
uv run python -m grnti_text_classifier.training.sweep model=main
# Evaluation (top-k accuracy, F1, confusion matrix)
uv run python -m grnti_text_classifier.evaluation.evaluate
Hydra configuration layout
configs/
├── train.yaml # top-level: data paths, trainer, logger, seed
└── model/
├── main.yaml # XLM-RoBERTa-base: lr, batch_size, max_length, warmup_ratio
└── baseline.yaml # ruBERT-base-cased: same schema, different defaults
Override any parameter from the CLI without editing YAML:
uv run python -m grnti_text_classifier.training.train model=main model.lr=3e-5 trainer.max_epochs=5
Optuna sweep
training.sweep runs 10 trials × 3 epochs each with the TPE sampler (seed=42). Search space:
| Parameter | Distribution | Range |
|---|---|---|
lr |
log-uniform | 1e-5 – 5e-5 |
weight_decay |
log-uniform | 1e-4 – 1e-1 |
warmup_ratio |
uniform | 0.0 – 0.15 |
The best trial is automatically written to artifacts/best_params.json and used as the starting point for the final full-epoch training run.
RTX 3080 notes
Reference hardware for the v0.1.0 run: RTX 3080 (10 GB VRAM), 32 GB RAM, Ubuntu 22.04.
| Setting | Value |
|---|---|
precision |
bf16-mixed |
batch_size |
16 |
max_length |
256 |
gradient_clip_val |
1.0 |
| Estimated wall time (XLM-R, 5 epochs) | ~45–60 min |
| Estimated wall time (ruBERT, 5 epochs) | ~30–40 min |
bf16-mixed requires Ampere or newer (RTX 30xx / A-series). On older CUDA GPUs set precision=16-mixed; on CPU omit the flag entirely.
Outputs
artifacts/main/hf/— XLM-RoBERTasave_pretrainedsnapshot (model + tokenizer).artifacts/baseline/hf/— ruBERTsave_pretrainedsnapshot.artifacts/main/logs/version_0/metrics.csv— CSVLogger epoch-level metrics (loss, top1_acc, top5_acc, f1_macro).reports/metrics.json— per-class precision/recall/F1 on test set.reports/metrics_summary.json— top-1/top-5 accuracy + macro/weighted F1 summary for both models.reports/confusion_matrix.png— 28×28 normalised confusion matrix for the main model.
HF mirror runbook
Once training completes and artefacts are validated, publish to the HF Hub:
# Ensure HUGGING_FACE_HUB_TOKEN is set (or run `huggingface-cli login`)
uv run python scripts/publish_to_hf.py \
--main-dir artifacts/main/hf \
--baseline-dir artifacts/baseline/hf \
--repo-id kiselyovd/grnti-text-classifier
The script calls push_to_hub for the main model and attaches the model card generated in Task 17. The baseline is pushed to a separate branch baseline within the same repo.