Evaluation & Safety
Measuring whether models are good — and keeping them from being bad.
Evaluation Basics
Because "it looks good to me" is not a test suite.
LLM-as-a-Judge
Using models to grade models — and when to distrust the grader.
Benchmarks & Leaderboards
MMLU to SWE-bench to LMArena — what the scores mean and when they lie.
Red Teaming & Jailbreaks
Attack your own AI before someone else does.
Alignment & Safety Basics
Why models refuse, how they're steered, and the bigger risk map.