A 7-dimension weighted composite evaluation metric for fine-tuned language models, comparing custom model performance against a foundation model baseline (Nova Pro) using an LLM judge (Nova Lite) with a strict rubric.
Evaluation Modes (3):
- Verbatim — training prompts sent as-is (overfit detection)
- Rephrase — training prompts reworded by a foundation model (near generalization)
- Novel Out-of-Domain — auto-generated prompts from unrelated domains (far generalization)
Quality Dimensions (5, scored 1-5 by LLM judge):
- Relevance, Accuracy, Completeness, Coherence, Specificity
- Judge uses a strict rubric with explicit red flags (generic filler, hedge words, circular reasoning, repetitive structure) and rewards (concrete numbers, named examples, precise technical terms, quantified comparisons)
TODDScore Dimensions (7, threshold mode out of 19 points):
Dimension Weight What it measures
Overall quality 0-3 Trimmed mean of judge scores across all modes
Verbatim/rephrase delta 0-4 Quality vs baseline on training-adjacent prompts
Verbatim delta 0-3 Quality vs baseline on exact training prompts
Novel quality 0-3 Absolute quality on out-of-domain prompts
Text similarity 0-2 SequenceMatcher overlap with teacher responses (lower = better)
Template mimicry 0-2 Structural pattern matching — headings, lists, bold, paragraphs (lower = better)
Degradation delta 0-2 Quality dropoff from verbatim to novel (flatter = better)
Scoring Methodologies (2, toggle on dashboard):
- Threshold — step-function scoring with defined breakpoints, green/amber/red classification (≥75% / ≥50% / <50%)
- Continuous — smooth linear mapping between defined worst/best bounds per dimension, same weights
Statistical Features:
- Trimmed mean (10%) on all aggregated metrics to reduce outlier impact
- 95% confidence intervals on quality scores
- CI-based "within margin of noise" detection for statistically indistinguishable models
Flagging System (5 flag types for human spot-check):
- Individual outlier (>1σ below model's mean)
- Absolute floor (quality ≤ 2, e.g., content filter hits)
- Model divergence (delta diff ≥ 0.8 between models on same prompt)
- Both models spike low (both deltas ≤ -0.8 on same prompt)
- Extreme positive delta (either delta ≥ +1.5, baseline may have fumbled)
Experiment Framework:
- Named experiments with persistent results
- Paired execution — same prompt sent to both models per step (apples-to-apples)
- 10 runs per mode, round-robin cadence
- Findings system — flagged items can be saved with notes for documentation
- Auto-refresh dashboard with moving average toggle (5-run window)
Known Limitations:
- LLM-as-judge introduces its own nondeterminism
- Threshold scoring susceptible to clustering at breakpoints (use continuous for cross-experiment aggregation)
- Content filter behavior is probabilistic and can produce false quality failures
- GPU floating point nondeterminism means identical models on separate deployments will show variation
- Not a substitute for formal evaluation frameworks (HELM, lm-eval-harness) — designed for rapid iterative assessment and teaching
No comments:
Post a Comment