Architecture
Two independent save_pretrained artefacts sharing the same {top1_label, top1_prob, top5, truncated, input_length_tokens, request_id, model_version} contract.
flowchart LR
A["data/raw/*.jsonl<br/>HF snapshot_download"]:::external -->|prepare.py| B["data/processed/<br/>train/val/test.parquet<br/>+ label_encoder.json"]:::data
B -->|GRNTIDataModule| C["XLM-R tokenizer<br/>padding=max_length<br/>truncation=True"]:::code
C --> D["GRNTIClassifier<br/>AdamW + linear warmup+decay<br/>bf16-mixed on CUDA"]:::model
D -->|train_one| E["artifacts/{main,baseline}/hf/<br/>save_pretrained"]:::artifact
E -->|evaluate.py| F["reports/metrics.json<br/>+ confusion_matrix.png<br/>+ metrics_summary.json"]:::artifact
E -.->|publish_to_hf.py| G["HF Hub<br/>kiselyovd/grnti-text-classifier"]:::external
E -.->|"FastAPI /classify"| H["JSON<br/>top-1 + top-5 + probs"]:::serve
classDef external fill:#E0F2F1,stroke:#00796B,color:#004D40
classDef data fill:#B2DFDB,stroke:#00796B,color:#004D40
classDef code fill:#80CBC4,stroke:#00695C,color:#004D40
classDef model fill:#4DB6AC,stroke:#004D40,color:#fff
classDef artifact fill:#26A69A,stroke:#004D40,color:#fff
classDef serve fill:#00897B,stroke:#004D40,color:#fff
Data flow
scripts/sync_data.sh calls snapshot_download from the HF Hub to pull ai-forever/ru-scibench-grnti-classification into data/raw/ as JSONL shards. grnti_text_classifier.data.prepare reads those shards, fits a LabelEncoder over the 28 GRNTI top-level class codes, writes stratified train/val/test.parquet splits under data/processed/, and serialises the encoder to data/processed/label_encoder.json. The GRNTIDataModule wraps those Parquet files: it tokenises on-the-fly with the HF AutoTokenizer (padding="max_length", truncation=True, max_length=256), returning input_ids, attention_mask, and labels tensors.
Why XLM-RoBERTa-base over ruBERT
XLM-RoBERTa-base was pre-trained on 2.5 TB of CommonCrawl covering 100 languages including Russian. Its sub-word vocabulary (250 k sentencepiece tokens) achieves much lower out-of-vocabulary rates on Russian scientific terminology than the 120 k WordPiece vocabulary of ruBERT-base-cased. In practice this translates to longer effective context per 256-token window and better transfer on low-resource GRNTI sections. ruBERT-base-cased is retained as a single-language baseline: it is lighter, faster to fine-tune, and provides a meaningful reference point for any XLM-R gains.
Inverse-frequency class weights
The ru-scibench-grnti-classification dataset is designed to be class-balanced, but minor residual imbalances remain across 28 sections. train_one passes class_weights (computed per-batch from inverse normalised class frequencies on the train split) into CrossEntropyLoss. This has negligible effect on top-1 accuracy but meaningfully protects macro F1: without weighting the rarest classes tend to be under-penalised, dragging the macro average below the weighted average.
HF-native save_pretrained
Both models are serialised via the standard PreTrainedModel.save_pretrained / tokenizer.save_pretrained pattern into artifacts/{main,baseline}/hf/. This means any downstream consumer can load them with AutoModelForSequenceClassification.from_pretrained(repo_id) — zero extra config, no custom unpickling, and fully compatible with the HF Hub widget system for live inference demos on the model card.