Calibration bench
Every useful definition leaves test marks.
The bench is where a sentence is treated like a tool. It is turned, measured, stressed, and compared against the readers who will use it. The point is not to make LLM vocabulary sound more academic. The point is to keep practical language from collapsing into vague confidence.

Layer
Does the wording identify whether it describes a model, product interface, data process, policy claim, or user behavior?
Transfer
Can the definition move from a specialist note into a plain-language answer without losing its caveat?
Failure
What mistaken belief would appear if the caveat were removed or the phrase were quoted out of context?
Evidence
What kind of source would be needed to prove the definition in a particular product or research setting?
Repair
How should the definition change when vendors rename features, evaluation norms shift, or model behavior improves?
The bench favors repairable language.
A brittle definition sounds exact until the product, model, or research habit changes. A repairable definition names its assumptions. It can say what is known, what is contested, and what must be checked in a specific deployment. That is why the bench keeps failure modes close to the final wording. The reader should understand not only what a term means, but why a looser version of the same term would cause practical trouble.