API Reference
Auto-generated reference for the grnti_text_classifier package via mkdocstrings. Each module below lists its public classes and functions with their docstrings and type signatures.
Package root
grnti_text_classifier
Production-grade Russian multi-class text classifier (GRNTI).
Data
datamodule
Lightning DataModule wrapping GRNTIDataset for HuggingFace tokenizers.
Classes
GRNTIDataset
GRNTIDataset(df: DataFrame, tokenizer, max_length: int = 256)
Bases: Dataset
Map-style dataset over a processed GRNTI DataFrame.
Source code in src/grnti_text_classifier/data/datamodule.py
19 20 21 22 23 | |
GRNTIDataModule
GRNTIDataModule(
processed_dir: str | Path,
model_name: str,
batch_size: int = 16,
max_length: int = 256,
num_workers: int = 0,
seed: int = 42,
)
Bases: LightningDataModule
LightningDataModule that loads train/val/test parquet splits.
Source code in src/grnti_text_classifier/data/datamodule.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | |
dataset
Dataset implementations.
Classes
TextDataset
TextDataset(
csv_path: Path | str,
text_col: str = "text",
label_col: str = "label",
tokenizer: Callable[..., Any] | None = None,
max_length: int = 512,
)
Bases: Dataset[dict[str, Any]]
CSV-backed text classification dataset.
Source code in src/grnti_text_classifier/data/dataset.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | |
grnti
GRNTI dataset helpers: loader, label encoder, stratified split.
Classes
LabelEncoder
dataclass
LabelEncoder(
code_to_idx: dict[int, int],
idx_to_code: dict[int, int],
idx_to_text: dict[int, str],
num_classes: int,
)
Bidirectional map between raw GRNTI codes and dense 0..N-1 indices.
Functions
encode(labels: Series | list[int]) -> np.ndarray
Map raw codes → dense indices.
Source code in src/grnti_text_classifier/data/grnti.py
133 134 135 | |
decode(idx: int) -> int
Map dense index → raw code.
Source code in src/grnti_text_classifier/data/grnti.py
137 138 139 | |
decode_text(idx: int) -> str
Map dense index → human-readable Russian class name.
Source code in src/grnti_text_classifier/data/grnti.py
141 142 143 | |
to_json_dict() -> dict[str, Any]
Return a plain JSON-serialisable dict.
Source code in src/grnti_text_classifier/data/grnti.py
149 150 151 152 153 154 155 156 | |
classmethod
from_json_dict(d: dict[str, Any]) -> LabelEncoder
Reconstruct a LabelEncoder from a JSON dict.
Source code in src/grnti_text_classifier/data/grnti.py
158 159 160 161 162 163 164 165 166 | |
Functions
load_jsonl
load_jsonl(path: str | Path) -> pd.DataFrame
Read a JSONL file and return a DataFrame with FEATURES columns only.
Source code in src/grnti_text_classifier/data/grnti.py
107 108 109 110 111 112 | |
build_label_encoder
build_label_encoder(df: DataFrame) -> LabelEncoder
Build a LabelEncoder from unique codes in df[LABEL_COL].
Codes are sorted ascending; idx = position in sorted order.
Source code in src/grnti_text_classifier/data/grnti.py
174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 | |
split_stratified_train_val
split_stratified_train_val(
df: DataFrame, *, val_fraction: float = 0.15, seed: int = 42
) -> tuple[pd.DataFrame, pd.DataFrame]
Split df into train and val subsets with stratification on LABEL_COL.
Source code in src/grnti_text_classifier/data/grnti.py
196 197 198 199 200 201 202 203 204 205 206 207 208 209 | |
prepare
Data preparation CLI: raw JSONL → processed Parquet + label_encoder.json.
Functions
prepare_data
prepare_data(
raw_dir: str | Path,
out_dir: str | Path,
*,
val_fraction: float = 0.15,
seed: int = 42,
) -> None
Load raw JSONL splits, build encoder, write Parquet + JSON artefacts.
Source code in src/grnti_text_classifier/data/prepare.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | |
transforms
Image transforms for training and inference.
Models
factory
Model factories — return a pretrained HuggingFace model ready for fine-tuning.
Functions
build_main
build_main(num_labels: int = 28) -> PreTrainedModel
Return XLM-RoBERTa-base configured for sequence classification.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_labels
|
int
|
Number of output classes (default 28 for GRNTI). |
28
|
Returns:
| Type | Description |
|---|---|
PreTrainedModel
|
AutoModelForSequenceClassification instance. |
Source code in src/grnti_text_classifier/models/factory.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 | |
build_baseline
build_baseline(num_labels: int = 28) -> PreTrainedModel
Return ruBERT-base-cased configured for sequence classification.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_labels
|
int
|
Number of output classes (default 28 for GRNTI). |
28
|
Returns:
| Type | Description |
|---|---|
PreTrainedModel
|
AutoModelForSequenceClassification instance. |
Source code in src/grnti_text_classifier/models/factory.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 | |
lightning_module
Lightning wrapper for GRNTI sequence classification models.
Classes
GRNTIClassifier
GRNTIClassifier(
model: PreTrainedModel,
class_weights: Tensor | None = None,
lr: float = 2e-05,
weight_decay: float = 0.01,
warmup_ratio: float = 0.1,
total_steps: int = 1000,
num_classes: int = 28,
)
Bases: LightningModule
Lightning module wrapping any HuggingFace sequence-classification model.
Source code in src/grnti_text_classifier/models/lightning_module.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | |
Training
optuna_sweep
Optuna hyper-parameter sweep over train_one for GRNTI classifiers.
Functions
run_sweep
run_sweep(
processed_dir: Path,
out_dir: Path,
*,
model_builder: Callable[..., Any],
model_name_for_tokenizer: str,
n_trials: int = 10,
seed: int = 42,
trial_epochs: int = 3,
batch_size: int = 16,
num_workers: int = 0,
) -> dict[str, Any]
Run an Optuna TPE sweep over learning-rate, weight-decay and warmup-ratio.
Parameters
processed_dir:
Pre-processed data directory (parquet splits + label_encoder.json).
out_dir:
Root directory; each trial writes to out_dir / "trial_<n>".
model_builder:
Callable (num_labels: int) -> PreTrainedModel.
model_name_for_tokenizer:
HuggingFace model name used to build the tokeniser.
n_trials:
Number of Optuna trials.
seed:
Seed for the TPE sampler (for reproducibility).
trial_epochs:
Max epochs per trial (use a small number for speed).
batch_size:
Mini-batch size forwarded to train_one.
num_workers:
DataLoader workers forwarded to train_one.
Returns
dict
{"best_params": {...}, "best_value": float}
Source code in src/grnti_text_classifier/training/optuna_sweep.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 | |
train
Lightning Trainer entrypoint for GRNTI text classification.
Classes
Functions
train_one
train_one(
model_builder: Callable[..., Any],
model_name_for_tokenizer: str,
processed_dir: Path,
out_dir: Path,
*,
max_epochs: int = 5,
batch_size: int = 16,
lr: float = 2e-05,
weight_decay: float = 0.01,
warmup_ratio: float = 0.1,
patience: int = 2,
seed: int = 42,
max_length: int = 256,
num_workers: int = 0,
save_hf: bool = True,
) -> Path
Train a single GRNTI classifier run.
Parameters
model_builder:
Callable (num_labels: int) -> PreTrainedModel.
model_name_for_tokenizer:
HuggingFace model name used to load the tokenizer, e.g.
"FacebookAI/xlm-roberta-base".
processed_dir:
Directory containing train.parquet, val.parquet,
test.parquet, and label_encoder.json.
out_dir:
Root output directory for this run.
max_epochs:
Maximum training epochs.
batch_size:
Batch size for training and validation.
lr:
Peak learning rate for AdamW.
weight_decay:
AdamW weight-decay coefficient.
warmup_ratio:
Fraction of total steps used for linear warmup.
patience:
Early-stopping patience (in validation epochs).
seed:
Global random seed.
max_length:
Tokeniser max-sequence length.
num_workers:
DataLoader worker count.
save_hf:
When True (default) saves the best checkpoint as a HuggingFace
model directory and returns that path. When False, skips the HF
save and returns the raw checkpoint path instead (useful for sweeps).
Returns
Path
out_dir / "hf" when save_hf is True; otherwise the path to the
best Lightning checkpoint file.
Source code in src/grnti_text_classifier/training/train.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 | |
Evaluation
confusion
Confusion matrix visualisation — saves a seaborn heatmap PNG.
Functions
save_confusion_matrix
save_confusion_matrix(
y_true: ndarray, preds: ndarray, labels: list[str], out_path: str | Path
) -> None
Save a row-normalised confusion matrix heatmap to out_path (PNG).
Parameters
y_true:
Ground-truth integer labels, shape (n,).
preds:
Predicted integer labels, shape (n,).
labels:
Human-readable class names (e.g. ["Математика", "Информатика"]).
Length must equal the number of classes.
out_path:
Destination file path. Parent directories are created if absent.
Source code in src/grnti_text_classifier/evaluation/confusion.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | |
evaluate
CLI: score a saved HF checkpoint on a processed parquet split.
Functions
main
main() -> None
Entry point for the scoring CLI.
Source code in src/grnti_text_classifier/evaluation/evaluate.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 | |
metrics
Metrics computation for classification scoring.
Functions
compute_metrics
compute_metrics(
y_true: ndarray, logits: ndarray | object, num_classes: int
) -> dict[str, Any]
Return top-1/top-5 accuracy, macro/weighted F1, num_classes, and n.
Parameters
y_true:
Integer class indices, shape (n,).
logits:
Raw model outputs, shape (n, num_classes). Accepts either a
NumPy array or a torch.Tensor — tensors are converted to NumPy
automatically.
num_classes:
Total number of label classes.
Returns
dict with keys: top1_accuracy, top5_accuracy, macro_f1, weighted_f1, num_classes, n.
Source code in src/grnti_text_classifier/evaluation/metrics.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | |
report
Summary report builder — merges main and baseline metrics into a JSON file.
Functions
build_summary
build_summary(
main_metrics: dict[str, Any],
baseline_metrics: dict[str, Any],
*,
out_path: str | Path,
) -> dict[str, Any]
Write a flat JSON summary combining main and baseline scoring results.
Parameters
main_metrics:
Output of compute_metrics for the primary model.
baseline_metrics:
Output of compute_metrics for the baseline model.
out_path:
Destination path for the JSON file. Parent dirs are created if needed.
Returns
The summary dict that was written to disk.
Source code in src/grnti_text_classifier/evaluation/report.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | |
Inference
predict
Inference CLI — load a checkpoint and predict on input(s).
Functions
load_model
load_model(checkpoint_path: str | Path) -> Any
Load a Lightning module from checkpoint, rebuilding the backbone from hparams.
Source code in src/grnti_text_classifier/inference/predict.py
15 16 17 18 19 20 21 22 23 24 25 | |
predict
predict(model: Any, input_path: str | Path) -> dict[str, Any]
Run a single prediction. Returns a task-specific result dict.
Source code in src/grnti_text_classifier/inference/predict.py
28 29 30 | |
Serving
dependencies
Dependency injection — singleton model loader.
Functions
errors
Exception types and handlers.
main
FastAPI application.
routes
GRNTI classifier routes — /health, /classify, /labels.
Classes
schemas
Pydantic request/response schemas for the /classify endpoint.
Classes
TextPayload
Bases: BaseModel
Request body for text classification — raw abstract + optional token budget.
LabelProb
Bases: BaseModel
GRNTI class identifier together with its human-readable name and probability.
ClassificationResponse
Bases: BaseModel
Response payload of /classify: top-1 plus top-5 probabilities and metadata.
LabelEntry
Bases: BaseModel
Label catalog entry returned by /labels — numeric id plus human-readable name.
Utilities
hf_hub
HuggingFace Hub helpers.
logging
Structured logging configuration.
seed
Deterministic seeding across libraries.