Evaluation

Evaluation Results

All models evaluated on 52 hand-labeled real test examples. Metrics include Macro-F1, per-class F1, ECE, and calibration analysis.

The Fine-Tuning Delta

Every zero-shot model scored below 0.11 on Macro-F1. Fine-tuning on 884 synthetic examples transformed performance — and the smallest model won.

Full Results

All systems evaluated on the same 52 hand-labeled real test examples. ECE measured after temperature scaling on logprob confidence.

System Params Macro-F1 95% CI ECE
Zero-shot Gemma 2B2B 0.063[0.040, 0.089]0.755
Zero-shot Mistral 7B7B 0.095[0.063, 0.128]0.645
Zero-shot Llama 3B3B 0.108[0.029, 0.186]0.632
Llama 1B LoRA1B 0.196[0.117, 0.274]0.154
Gemma 2B CF LoRA (r=8)2B 0.249[0.151, 0.336]0.129
Mistral 7B CF LoRA7B 0.760[0.648, 0.852]0.075
Llama 3.2 3B LoRA3B 0.856[0.764, 0.930]0.094
Gemma 2B Full LoRA2B 0.916[0.813, 0.981]0.056

Per-Class F1 Across Models

Where each model excels and struggles. All three top adapters nail bot detection (1.0 or close), but committed_leave is universally the hardest class — its event traces overlap with comparison_shopping (both involve browsing without filling fields).

Calibration: ECE Comparison

Verbalized confidence ("the model says 0.85") is consistently overconfident. Extracting logprobs directly improves calibration. Temperature scaling pushes ECE below 0.06 for the Gemma 2B adapter — the model's probability distribution is more trustworthy than its words.

Reliability Diagram

The diagonal is perfect calibration. Before calibration, the model clusters predictions in the 0.8-1.0 range regardless of actual accuracy. After temperature scaling, predictions spread across the confidence range and track the diagonal more closely.

Confusion Matrix

Gemma 2B Full LoRA on 52 real test examples. The main failure mode: comparison_shopping misclassified as committed_leave (3 cases). Both classes involve browsing without form engagement — the distinguishing signal is subtle.

Training Loss Curves

All models converge by epoch 3 with no overfitting. The Gemma 2B CF adapter (r=8, constrained for Cloudflare) starts with much higher loss and never catches up — rank matters more than model size for this task.