
The AI safety illusion: why current safety datasets fool us on model safety
AI models today are increasingly trained to behave “safely,” meaning they decline requests that could lead to harmful outcomes. But what does it actually mean for a model to be safe? In most cases, safety is measured through safety benchmarks—curated collections of adversarial prompts designed to test whether a model refuses unsafe requests. If the…[Read More]


