For Policy’s Sake: Why We Must Distinguish AI Safety from AI Security in Regulatory Governance

post by Katalina Hernandez (katalina-hernandez) · 2025-04-04T09:16:20.712Z · LW · GW · 1 comments

Contents

  1. AI Safety: Protecting Humans from AI-generated Harms
        Core Concerns:
        Examples of Typical Techniques:
        Real-world Examples:
  2. AI Security: Protecting AI from Malicious Human Actors
        Core Concerns:
        Examples of Typical Techniques:
        Consequences if Security Fails:
        Real-world Examples:
  Where Safety and Security Intersect
      Why This Distinction Matters for AI Governance
  Connecting AI Safety to Specific AI Act Provisions
  Why This Matters for Governance Translation
    Why This Distinction Might Feel Artificial or Limiting
None
1 comment

I am a policy researcher and data protection officer working in AI Governance for an European corporation. I do not have a ML background, and I am a lawyer... which is precisely why I am here, seeking your expertise. 

Over the past few months, I’ve been working at the intersection of AI governance and technical alignment research, trying to understand how AI safety concepts can be translated into regulatory structures that preserve both nuance and enforceability.

In doing so, I’ve noticed recurring confusion around terminology, especially between what we mean by “AI Safety” vs. “AI Security”. 

Mentioning these disciplines can carry different connotations depending on whether you're an ML engineer, a policy advisor, or a safety researcher focused on x-risk. But, as I read papers that aim to inform policy, I am inclined to conclude that a clearer distinction is needed for regulatory governance purposes.

This post outlines a simplified but actionable framework that I’ve found useful when mapping these concepts onto regulatory efforts (e.g., the EU AI Act). 

This distinction is not intended to replace deeper discourse. It is meant to be a starting point for productive cooperation between technical and governance stakeholders.

I’m particularly keen to hear from this community on where this mapping breaks down, or how it could better reflect the realities of working on interpretability, adversarial robustness, or scalable oversight from the ground up.
 

1. AI Safety: Protecting Humans from AI-generated Harms

Objective: Ensuring that AI systems behave in ways that avoid causing harm or unintended consequences to humans, society, or the environment.

What AI Safety Seeks to Prevent: Human well-being, societal values, fundamental rights, and environmental integrity.

Core Concerns:

Examples of Typical Techniques:

Human Role: Humans as beneficiaries. AI Safety ensures AI remains beneficial and protective of human interests.

Core Question:
"Will this AI system unintentionally or intentionally harm me or others?"

Consequences if Safety Fails: Direct human harm ranging from physical injuries, misinformation, emotional manipulation, to potentially catastrophic societal risks.

Real-world Examples:

2. AI Security: Protecting AI from Malicious Human Actors

Objective: Defending AI systems and their data against intentional attacks, unauthorized access, theft, manipulation, or exploitation.

What AI Security Protects: The integrity of AI systems, their data, and intellectual property (e.g., model weights and proprietary algorithms).

Core Concerns:

Examples of Typical Techniques:

Human Role: Humans here act as potential attackers, adversaries, or malicious users of the AI system.

Core Question:
"Can someone intentionally exploit, manipulate, or steal information from this AI system?"

Consequences if Security Fails:

Real-world Examples:

Where Safety and Security Intersect

It's essential for regulatory AI Governance to acknowledge the overlap here: A security failure, such as an adversarial attack tricking a self-driving car into not recognizing pedestrians, is not just a security concern as it can become an immediate safety issue causing direct human harm.

Yet, despite this overlap, the fundamental intentions behind these two fields differ:

Why This Distinction Matters for AI Governance

AI governance, particularly in regulatory contexts like the EU AI Act, explicitly aims at safeguarding individuals from AI-related harm. The AI Act defines its purpose in Art.1 as:

“Ensuring a high level of protection of health, safety, fundamental rights […] against the harmful effects of AI systems.”

Given this objective, I believe that regulatory frameworks should explicitly incorporate and incentivize not only AI Security but also AI Safety research, including alignment, interpretability, and control.

Connecting AI Safety to Specific AI Act Provisions

To ground this in existing regulatory language, I will list a few provisions of the EU AI Act where AI Safety (rather than “security”) needs to be kept in mind:

Art. 13 - Transparency Obligations:

“High-risk AI systems shall be designed and developed in such a way as to ensure that their operation is sufficiently transparent to enable deployers to interpret a system’s output and use it appropriately.”

This provision isn't just about documentation, it calls for meaningful transparency into model behavior.

While traditional “explainability” tools (e.g., SHAP, LIME) offer surface-level insights, mechanistic interpretability aims to go further: it investigates the internal reasoning structures of the model (circuits, attention heads, representations) to explain why a model behaved a certain way, not just what it did.

“Sufficient transparency” is currently undefined. Without standards that include interpretability research, this requirement risks being satisfied by shallow explainability: presenting outputs with plausible reasoning, without surfacing the actual mechanisms behind them.

Art. 14 - Human Oversight:

“Human oversight shall aim to prevent or minimise the risks to health, safety or fundamental rights... in particular where such risks persist despite the application of other requirements.”

Human oversight is about preventing AI from causing harm despite compliance with other measures.

But for the underlying objective of minimising risks to health, safety or fundamendal rights, we still need model alignment: ensuring that systems produce outcomes consistent with human intent and values. It also hinges on interpretability, because oversight without insight is just observation.

“Such measures shall enable the oversight person to... correctly interpret the high-risk AI system’s output, taking into account, for example, the interpretation tools and methods available.”

Oversight is not just about who is watching, but how they’re empowered to understand and intervene. This refers directly to interpretability and control tools, methods that help humans not only interpret outputs, but intervene when the system behaves unexpectedly. Alignment research (e.g., RLHF, Constitutional AI) is foundational here, as are control techniques like steering via reward models, logit Regularization / Activation Steering or Rejection Sampling / Output Filtering.

 

Art. 15 - Accuracy, Robustness, and Cybersecurity:

Now, these are the core AI Security obligations explicitly covering adversarial robustness, cybersecurity, and data protection, critical for preventing external exploitation or manipulation of AI systems.

Also relevant: Art 55’s model evaluation obligations.

Why This Matters for Governance Translation

The distinctions I've outlined aren't just for semantics. I find these useful when I think about governance frameworks and allocation of responsibilities among stakeholders: when looking at a given problem (for example, the issue with human oversight in article 14) and possible solutions, are we emphasizing the intended, positive outcome that we expect for humans (Safety) or are we focusing on the integrity of the AI system and the potential confidentiality tradeoffs arising from model auditing (security)?

Clarifying the boundaries between AI Safety and AI Security is practical and necessary for robust AI governance. Misunderstanding these terms leads to confusion and misplaced regulatory expectations, ultimately reducing the effectiveness of governance efforts.

 

Why This Distinction Might Feel Artificial or Limiting

As I present this simplified framework distinguishing AI Safety from AI Security, I anticipate (and welcome!) pushback, particularly from ML engineers and researchers.

Some valid critiques might include:

If, by any chance, you've come across my Substack, you know how much I recognize and value these critiques.

We know that the complexities of real-world engineering and research rarely fit neatly into conceptual categories. My intention with this isn't to ignore these overlaps or nuances.

My goal is simply to provide clarity and structure that supports policy-makers, regulatory professionals, and enterprise risk experts in translating technical insights into effective governance.

I strongly believe AI governance should be driven by open dialogue between policy and technical communities.

So, I am asking engineers, researchers, and security specialists to please challenge, critique, and refine this framework.

The last thing we need is for AI governance to impose arbitrary distinctions.

We need practical solutions that reflect real-world technical realities.

But for governance to work, we need ways to bring key safety and security breakthroughs into policy, and that means understanding which research is relevant to which regulatory goal.

Your expertise is essential to get this right!

1 comments

Comments sorted by top scores.

comment by Katalina Hernandez (katalina-hernandez) · 2025-04-04T09:24:00.551Z · LW(p) · GW(p)

I’m aware this “safety vs. security” distinction isn’t clean in real-world ML work (e.g., I understand that adversarial robustness spans both).

But it’s proven useful for communicating with policy teams who are trying to assign accountability across domains.

I’m not arguing against existential AI Safety framing, just using the regulatory lens where “safety” often maps to preventing tangible human harms, and “security” refers to model integrity and defense against malicious actors.

If you’ve found better framings or language that have worked across engineering/policy interfaces, I’d love to hear them. 

Especially if you think interpretability or control work gets misclassified in governance discourse.

Grateful for your thoughts, please tell me where this falls short of your technical experience.