Esoterikos Daimonas: TODDScore (The Overfitting Demo & Discussion Score)

A 7-dimension weighted composite evaluation metric for fine-tuned language models, comparing custom model performance against a foundation model baseline (Nova Pro) using an LLM judge (Nova Lite) with a strict rubric.

Evaluation Modes (3):

Verbatim — training prompts sent as-is (overfit detection)
Rephrase — training prompts reworded by a foundation model (near generalization)
Novel Out-of-Domain — auto-generated prompts from unrelated domains (far generalization)

Quality Dimensions (5, scored 1-5 by LLM judge):

Relevance, Accuracy, Completeness, Coherence, Specificity
Judge uses a strict rubric with explicit red flags (generic filler, hedge words, circular reasoning, repetitive structure) and rewards (concrete numbers, named examples, precise technical terms, quantified comparisons)

TODDScore Dimensions (7, threshold mode out of 19 points):

Dimension Weight What it measures

Overall quality 0-3 Trimmed mean of judge scores across all modes

Verbatim/rephrase delta 0-4 Quality vs baseline on training-adjacent prompts

Verbatim delta 0-3 Quality vs baseline on exact training prompts

Novel quality 0-3 Absolute quality on out-of-domain prompts

Text similarity 0-2 SequenceMatcher overlap with teacher responses (lower = better)

Template mimicry 0-2 Structural pattern matching — headings, lists, bold, paragraphs (lower = better)

Degradation delta 0-2 Quality dropoff from verbatim to novel (flatter = better)

Scoring Methodologies (2, toggle on dashboard):

Threshold — step-function scoring with defined breakpoints, green/amber/red classification (≥75% / ≥50% / <50%)
Continuous — smooth linear mapping between defined worst/best bounds per dimension, same weights

Statistical Features:

Trimmed mean (10%) on all aggregated metrics to reduce outlier impact
95% confidence intervals on quality scores
CI-based "within margin of noise" detection for statistically indistinguishable models

Flagging System (5 flag types for human spot-check):

Individual outlier (>1σ below model's mean)
Absolute floor (quality ≤ 2, e.g., content filter hits)
Model divergence (delta diff ≥ 0.8 between models on same prompt)
Both models spike low (both deltas ≤ -0.8 on same prompt)
Extreme positive delta (either delta ≥ +1.5, baseline may have fumbled)

Experiment Framework:

Named experiments with persistent results
Paired execution — same prompt sent to both models per step (apples-to-apples)
10 runs per mode, round-robin cadence
Findings system — flagged items can be saved with notes for documentation
Auto-refresh dashboard with moving average toggle (5-run window)

Known Limitations:

LLM-as-judge introduces its own nondeterminism
Threshold scoring susceptible to clustering at breakpoints (use continuous for cross-experiment aggregation)
Content filter behavior is probabilistic and can produce false quality failures
GPU floating point nondeterminism means identical models on separate deployments will show variation
Not a substitute for formal evaluation frameworks (HELM, lm-eval-harness) — designed for rapid iterative assessment and teaching

TODDScore (The Overfitting Demo & Discussion Score)

No comments:

Post a Comment