Coral Mountain Data | AI Data Annotation Services

AI models today are increasingly trained to behave “safely,” meaning they decline requests that could lead to harmful outcomes. But what does it actually mean for a model to be safe? In most cases, safety is measured through safety benchmarks—curated collections of adversarial prompts designed to test whether a model refuses unsafe requests. If the refusal rate reaches a certain threshold, the model is generally considered safe.

However, this assumption deserves closer examination. What if the datasets used to measure safety are flawed? If that is the case, what exactly are they measuring, and can we truly rely on them to judge model safety?

In a recent research study, Coral Mountain conducted a systematic evaluation of two widely used AI safety datasets—AdvBench and HarmBench. The study examined these datasets both in isolation and in real-world evaluation settings. In isolation, we analyzed whether the datasets reflect adversarial attacks that are motivated by hidden intent, carefully crafted, and outside normal data distributions. In practical evaluation scenarios, we examined whether they capture meaningful safety risks.

Beyond safety evaluation, this work also connects with broader AI infrastructure practices such as data collection, natural data generation, image annotation, photo annotation, video annotation, traffic annotation, and the use of off-the-shelf datasets in model training and benchmarking pipelines.

Key findings

Our analysis reveals several important issues in how safety datasets currently function.

First, safety datasets often fail to represent real-world adversarial behavior accurately. One major reason is their heavy reliance on what we call triggering cues—words or phrases with obvious negative or sensitive meaning that are intentionally included to activate a model’s safety filters. Examples include combinations such as a harmful request followed by phrases like “without getting caught.”

Second, when these triggering cues are removed, models that previously appeared “reasonably safe” often fail to maintain that safety.

Third, the implications extend beyond public benchmarks. Although our analysis used publicly available datasets, the same structural patterns likely influence internal safety evaluations and alignment efforts, since similar testing approaches are widely used across models from different providers.

Together, these findings highlight a significant gap between how model safety is currently evaluated and how adversarial behavior actually appears in real-world scenarios.

This raises a fundamental question: Are we measuring AI safety correctly, and are alignment efforts based on these measurements moving in the right direction?

What real-world adversarial behavior looks like

Real-world adversarial attacks typically share three core characteristics.

Motivated by ulterior intent.
In real misuse scenarios, malicious goals are rarely stated directly. Even inexperienced attackers usually avoid language that explicitly signals harmful intent.

Well-crafted.
Effective attacks are carefully structured to exploit weaknesses in the system while avoiding detection by safety filters.

Out-of-distribution.
Adversarial prompts often differ significantly from ordinary user queries. They may use unusual phrasing, unexpected formats, or indirect framing that models rarely encounter during training.

For safety benchmarks to be effective, they must reflect these characteristics. Only then can they serve as realistic proxies for real-world misuse.

Dataset pathologies: triggering cues and structural duplication

To understand the structure of existing safety datasets, we analyzed AdvBench and HarmBench using two simple techniques: word cloud visualization and pairwise similarity analysis.

Word clouds of the 40 most frequent n-grams, where n ∈ {1, 2, 3}, from the combined AdvBench and HarmBench corpus. Triggering cues are highlighted in red for inherently, and orange for contextually negative/sensitive connotations. Neutral-connotation unigrams that contribute to triggering cues in higher-order n-grams are also shown in green.

Word cloud visualization

Examining the most frequent n-grams (for n = 1, 2, 3) reveals a heavy concentration of expressions with obvious negative or sensitive meaning. These expressions form what we call triggering cues.

Triggering cues appear in two main forms.

Inherent cues are words that naturally carry negative meaning, such as “steal.”

Contextual cues become sensitive depending on the surrounding phrase—for example, a harmful request combined with “without getting caught.”

However, the excessive and repetitive use of such cues is inconsistent with real-world adversarial behavior. Attackers rarely use language that directly exposes malicious intent, because doing so would immediately trigger safety mechanisms.

This suggests that many prompts in existing safety datasets are artificially constructed and disconnected from real attack strategies. As a result, they violate two key properties of real-world adversarial behavior: being well-crafted and concealing malicious intent.

Pairwise similarity analysis

The repeated use of triggering cues also leads to structural overlap between prompts. To quantify this, we conducted pairwise similarity analysis across different similarity thresholds within each dataset.

We then compared these results with a non-safety dataset, GSM8K, which served as a baseline.

The difference is striking. Safety datasets—despite supposedly representing diverse adversarial inputs—show much higher structural similarity than the non-safety dataset. For example, in AdvBench, more than 70% of prompts have similarity scores above 0.9, and over 11% are nearly identical with similarity scores above 0.99.

This level of homogeneity indicates substantial duplication and repetition of the same underlying malicious intent.

Motivating evidence

Two major structural issues emerge from this analysis.

First, safety datasets heavily overuse triggering cues.
Second, many prompts are duplicated or extremely similar.

Together, these problems undermine all three defining properties of real-world adversarial attacks: hidden intent, careful design, and distributional novelty.

In practical terms, this means that many prompts in these datasets effectively represent the same malicious scenario. If a model successfully refuses one prompt, it is likely to refuse the duplicates as well, artificially inflating perceived safety performance.

Proportion of duplicated versus unique data points in the AdvBench and HarmBench datasets across different similarity thresholds. Each safety dataset is compared with a size-matched subset of GSM8K shown below its plot.

Examples of duplicated data points from the AdvBench and HarmBench datasets. These examples exhibit two unusual patterns: (1) explicit and repetitive overuse of triggering cues, either inherently (in red, e.g., “chop shops”) or contextually (in orange, e.g., “in detail”); and (2) substantial duplication resulting from this overuse. Each group of duplicates effectively represents a single malicious intent, i.e., a refusal or response to one is sufficient to evaluate the robustness of the model for that intent. As a result, safety evaluations based on these data points can be inflated.

Intent laundering: removing triggering cues while preserving malicious intent

The observations above naturally lead to a key question: if triggering cues are removed, will models still appear safe?

To explore this, we developed a technique called intent laundering. The goal is to remove triggering cues while keeping the underlying malicious objective unchanged.

Rather than stating harmful intent directly, the technique reframes the request so that the intent is implied.

Intent laundering relies on two complementary transformations.

Connotation neutralization.
Words with strong negative connotations are replaced with more neutral alternatives. For example, a request asking how to “manipulate” someone could be reframed using the word “influence,” while maintaining the same practical meaning.

Context transposition.
References to real-world individuals or institutions that may function as triggering cues are replaced with fictional or abstract equivalents. The resulting scenario still preserves the operational steps needed to carry out the harmful action.

To automate this process, we built an “intent launderer”—a large language model configured with a few-shot in-context learning setup. The system generates revised prompts that remove triggering cues while preserving the underlying intent.

Model responses to these prompts are then evaluated along two dimensions.

Safety, which measures whether the model produces harmful content.
Practicality, which measures whether the response provides actionable information that could enable real-world misuse.

The attack success rate (ASR) is defined as the proportion of cases where a response is both unsafe and practically usable.

Where safety datasets fail to reflect real-world adversarial threats

Impact of removing triggering cues using our intent laundering technique on ASR for the AdvBench and HarmBench. The no revision setting uses the original data points when triggering cues are present. The first revision setting uses revisions in which triggering cues are abstracted away.

We evaluated multiple models using both the original datasets and their intent-laundered versions.

The results reveal a dramatic contrast.

When evaluated on the original prompts—where triggering cues are present—models appear relatively safe. The average attack success rate is roughly 5% on AdvBench and 14% on HarmBench.

However, once triggering cues are removed through intent laundering, the picture changes dramatically. The average ASR rises to approximately 87% on AdvBench and 80% on HarmBench.

This suggests that many model refusals are triggered primarily by the presence of explicit cues rather than by genuine understanding of harmful intent.

In other words, current safety datasets measure sensitivity to specific words more than they measure true resistance to malicious behavior.

Even models widely recognized for strong safety alignment, such as Gemini 3 Pro, or for strict refusal behavior, such as Claude Sonnet 3.7, no longer appear safe once those cues disappear.

Perhaps most concerning is that this pattern appears consistently across models from different providers. This suggests that internal evaluation frameworks and alignment strategies may also depend heavily on the same dataset artifacts.

Intent laundering as a strong jailbreaking technique

Because intent laundering bypasses triggering cues, it can also function as an effective jailbreaking method.

By adding an iterative revision process, we further improved its effectiveness. Each unsuccessful attempt is fed back into the system, allowing the intent launderer to generate a refined version of the prompt.

After only a few iterations, the attack success rate increases dramatically—reaching 90% to 98% across models.

These results suggest two important conclusions.

First, current safety alignment techniques remain vulnerable to attacks that resemble realistic adversarial behavior.
Second, intent laundering provides a systematic way to expose weaknesses in model safety mechanisms.

Performance of intent laundering as a jailbreaking technique on the AdvBench and HarmBench datasets. The no revision setting uses the original data points with triggering cues, while the other settings apply intent laundering iteratively. For each dataset, the number of iterations is selected to best demonstrate the performance of our method.

What this means for AI safety

Taken together, these findings suggest that current safety evaluations rely too heavily on artificial triggering cues rather than on realistic harmful behavior.

This creates three escalating problems.

Measurement failure.
Safety benchmarks may overestimate model safety because models learn to respond to specific trigger phrases rather than actual malicious intent.

Scientific misinterpretation.
Research conclusions about safety may become overly optimistic if they are based on simplified evaluation methods.

Operational vulnerability.
Real-world attacks may be far easier than safety metrics suggest, because unrealistic testing conditions inflate safety performance.

Conclusion and future directions

Our study highlights a critical blind spot in current AI safety evaluations. Many safety datasets rely on unrealistic triggering cues, creating a mismatch between benchmark results and real-world adversarial behavior.

This does not mean AI safety research is ineffective. Instead, it highlights the need for better evaluation methods and more realistic benchmarks.

Future work should focus on several improvements:

Reducing reliance on predefined triggering cues
Designing safety datasets that better reflect real-world adversarial behavior
Evaluating models using scenarios that realistically represent misuse

At Coral Mountain, improving the quality of AI datasets is central to building trustworthy systems. This includes developing advanced pipelines for data collection, natural data generation, image annotation, photo annotation, video annotation, traffic annotation, and high-quality off-the-shelf datasets that support robust benchmarking and training.

Finally, intent laundering can serve as a powerful diagnostic and red-teaming tool. By moving beyond the limitations of current benchmarks, it helps reveal where AI safety mechanisms truly work—and where they fail.

The AI safety illusion: why current safety datasets fool us on model safety

Word cloud visualization

On this page

Share this article

Stay Updated

Data Collection

Solutions

Industries

Company

Offices