For Policy’s Sake: Why We Must Distinguish AI Safety from AI Security in Regulatory Governance
post by Katalina Hernandez (katalina-hernandez) · 2025-04-04T09:16:20.712Z · LW · GW · 1 commentsContents
1. AI Safety: Protecting Humans from AI-generated Harms Core Concerns: Examples of Typical Techniques: Real-world Examples: 2. AI Security: Protecting AI from Malicious Human Actors Core Concerns: Examples of Typical Techniques: Consequences if Security Fails: Real-world Examples: Where Safety and Security Intersect Why This Distinction Matters for AI Governance Connecting AI Safety to Specific AI Act Provisions Why This Matters for Governance Translation Why This Distinction Might Feel Artificial or Limiting None 1 comment
I am a policy researcher and data protection officer working in AI Governance for an European corporation. I do not have a ML background, and I am a lawyer... which is precisely why I am here, seeking your expertise.
Over the past few months, I’ve been working at the intersection of AI governance and technical alignment research, trying to understand how AI safety concepts can be translated into regulatory structures that preserve both nuance and enforceability.
In doing so, I’ve noticed recurring confusion around terminology, especially between what we mean by “AI Safety” vs. “AI Security”.
Mentioning these disciplines can carry different connotations depending on whether you're an ML engineer, a policy advisor, or a safety researcher focused on x-risk. But, as I read papers that aim to inform policy, I am inclined to conclude that a clearer distinction is needed for regulatory governance purposes.
This post outlines a simplified but actionable framework that I’ve found useful when mapping these concepts onto regulatory efforts (e.g., the EU AI Act).
This distinction is not intended to replace deeper discourse. It is meant to be a starting point for productive cooperation between technical and governance stakeholders.
I’m particularly keen to hear from this community on where this mapping breaks down, or how it could better reflect the realities of working on interpretability, adversarial robustness, or scalable oversight from the ground up.
1. AI Safety: Protecting Humans from AI-generated Harms
Objective: Ensuring that AI systems behave in ways that avoid causing harm or unintended consequences to humans, society, or the environment.
What AI Safety Seeks to Prevent: Human well-being, societal values, fundamental rights, and environmental integrity.
Core Concerns:
- Alignment: Ensuring AI systems' objectives and behaviors are in line with human intentions, values, and ethics.
- Interpretability: Understanding how and why AI models reach their decisions, particularly through research avenues like mechanistic interpretability.
- Preventing Catastrophic Failures: Anticipating and mitigating scenarios where AI could inadvertently cause large-scale harm.
- Avoiding Unintended Behavior: Identifying and correcting subtle ways AI might deviate from intended purposes, even without malicious intent.
Examples of Typical Techniques:
- Mechanistic Interpretability (think: Feature steering, sparse autoencoders or Dictionary learning)
- Reinforcement Learning with Human Feedback (RLHF)
- Constitutional AI
- Scalable oversight mechanisms for “human in the loop” mandates.
Human Role: Humans as beneficiaries. AI Safety ensures AI remains beneficial and protective of human interests.
Core Question:
"Will this AI system unintentionally or intentionally harm me or others?"
Consequences if Safety Fails: Direct human harm ranging from physical injuries, misinformation, emotional manipulation, to potentially catastrophic societal risks.
Real-world Examples:
- Preventing medical AI from giving harmful advice.
- Ensuring autonomous vehicles don't endanger pedestrians.
- Avoiding algorithmic amplification of extremist content.
- Preventing chatbots from offering harmful mental health advice
Avoiding deceptive or manipulative behavior in goal-directed agents
2. AI Security: Protecting AI from Malicious Human Actors
Objective: Defending AI systems and their data against intentional attacks, unauthorized access, theft, manipulation, or exploitation.
What AI Security Protects: The integrity of AI systems, their data, and intellectual property (e.g., model weights and proprietary algorithms).
Core Concerns:
- Cybersecurity for AI: Protecting AI infrastructure from external attacks.
- Adversarial Robustness: Defending AI systems against attacks specifically designed to mislead or deceive models.
- Confidentiality and information security: Using techniques such as secure enclaves, encryption, differential privacy, and secure multiparty computation to protect sensitive data.
Examples of Typical Techniques:
- Adversarial Robustness Training
- Model Watermarking
- Input Validation and Sanitization
- Differential Privacy, Secure Multiparty Computation, Homomorphic Encryption.
Human Role: Humans here act as potential attackers, adversaries, or malicious users of the AI system.
Core Question:
"Can someone intentionally exploit, manipulate, or steal information from this AI system?"
Consequences if Security Fails:
- Misuse or weaponization of AI systems by adversaries.
- Breaches of proprietary data leading to competitive losses, confidentiality breaches, or indirect societal harms.
Real-world Examples:
- Preventing model theft via API scraping or reverse-engineering
- Defending facial recognition systems from adversarial patches or spoofing attacks
- Stopping autonomous agents from self-replicating or self-exfiltrating code
Preventing jailbreaks of AI safety guardrails via obfuscation tricks
Where Safety and Security Intersect
It's essential for regulatory AI Governance to acknowledge the overlap here: A security failure, such as an adversarial attack tricking a self-driving car into not recognizing pedestrians, is not just a security concern as it can become an immediate safety issue causing direct human harm.
Yet, despite this overlap, the fundamental intentions behind these two fields differ:
- AI Safety primarily addresses direct human harm caused by AI’s internal behavior.
- AI Security focuses on external threats exploiting AI systems, which can then indirectly cause harm.
Why This Distinction Matters for AI Governance
AI governance, particularly in regulatory contexts like the EU AI Act, explicitly aims at safeguarding individuals from AI-related harm. The AI Act defines its purpose in Art.1 as:
“Ensuring a high level of protection of health, safety, fundamental rights […] against the harmful effects of AI systems.”
Given this objective, I believe that regulatory frameworks should explicitly incorporate and incentivize not only AI Security but also AI Safety research, including alignment, interpretability, and control.
- Alignment ensures AI outputs genuinely reflect human intentions and ethical standards. Without alignment, even secure systems might produce harmful outcomes.
- Interpretability helps us directly investigate how AI models reason internally, allowing us to audit and improve alignment, beyond merely documenting outputs.
Control helps prevent models from producing unintended harmful behaviors in the first place.
Connecting AI Safety to Specific AI Act Provisions
To ground this in existing regulatory language, I will list a few provisions of the EU AI Act where AI Safety (rather than “security”) needs to be kept in mind:
Art. 13 - Transparency Obligations:
“High-risk AI systems shall be designed and developed in such a way as to ensure that their operation is sufficiently transparent to enable deployers to interpret a system’s output and use it appropriately.”
This provision isn't just about documentation, it calls for meaningful transparency into model behavior.
While traditional “explainability” tools (e.g., SHAP, LIME) offer surface-level insights, mechanistic interpretability aims to go further: it investigates the internal reasoning structures of the model (circuits, attention heads, representations) to explain why a model behaved a certain way, not just what it did.
“Sufficient transparency” is currently undefined. Without standards that include interpretability research, this requirement risks being satisfied by shallow explainability: presenting outputs with plausible reasoning, without surfacing the actual mechanisms behind them.
Art. 14 - Human Oversight:
“Human oversight shall aim to prevent or minimise the risks to health, safety or fundamental rights... in particular where such risks persist despite the application of other requirements.”
Human oversight is about preventing AI from causing harm despite compliance with other measures.
But for the underlying objective of minimising risks to health, safety or fundamendal rights, we still need model alignment: ensuring that systems produce outcomes consistent with human intent and values. It also hinges on interpretability, because oversight without insight is just observation.
“Such measures shall enable the oversight person to... correctly interpret the high-risk AI system’s output, taking into account, for example, the interpretation tools and methods available.”
Oversight is not just about who is watching, but how they’re empowered to understand and intervene. This refers directly to interpretability and control tools, methods that help humans not only interpret outputs, but intervene when the system behaves unexpectedly. Alignment research (e.g., RLHF, Constitutional AI) is foundational here, as are control techniques like steering via reward models, logit Regularization / Activation Steering or Rejection Sampling / Output Filtering.
Art. 15 - Accuracy, Robustness, and Cybersecurity:
Now, these are the core AI Security obligations explicitly covering adversarial robustness, cybersecurity, and data protection, critical for preventing external exploitation or manipulation of AI systems.
Also relevant: Art 55’s model evaluation obligations.
Why This Matters for Governance Translation
The distinctions I've outlined aren't just for semantics. I find these useful when I think about governance frameworks and allocation of responsibilities among stakeholders: when looking at a given problem (for example, the issue with human oversight in article 14) and possible solutions, are we emphasizing the intended, positive outcome that we expect for humans (Safety) or are we focusing on the integrity of the AI system and the potential confidentiality tradeoffs arising from model auditing (security)?
Clarifying the boundaries between AI Safety and AI Security is practical and necessary for robust AI governance. Misunderstanding these terms leads to confusion and misplaced regulatory expectations, ultimately reducing the effectiveness of governance efforts.
Why This Distinction Might Feel Artificial or Limiting
As I present this simplified framework distinguishing AI Safety from AI Security, I anticipate (and welcome!) pushback, particularly from ML engineers and researchers.
Some valid critiques might include:
- Overlapping Realities: In practical engineering, the distinction between safety and security often blurs. For instance, adversarial robustness (typically categorized as security) directly impacts safety, making strict categorization feel artificial or overly simplistic.
- Operational Constraints: Engineers might argue that, in reality, teams work simultaneously on security and safety. For example, an engineer improving model robustness might concurrently address alignment concerns, challenging the notion of separate domains.
- Risk of Silos: Creating rigid conceptual distinctions could inadvertently reinforce organizational silos, potentially hindering interdisciplinary collaboration that is crucial for addressing complex AI risks.
- Terminological Confusion: Some may find that introducing yet another set of distinctions adds to confusion rather than resolving it, particularly given the diverse usage of these terms across academia, policy circles, and industry.
If, by any chance, you've come across my Substack, you know how much I recognize and value these critiques.
We know that the complexities of real-world engineering and research rarely fit neatly into conceptual categories. My intention with this isn't to ignore these overlaps or nuances.
My goal is simply to provide clarity and structure that supports policy-makers, regulatory professionals, and enterprise risk experts in translating technical insights into effective governance.
I strongly believe AI governance should be driven by open dialogue between policy and technical communities.
So, I am asking engineers, researchers, and security specialists to please challenge, critique, and refine this framework.
- How might your practical experience refine these conceptual boundaries in ways that are both accurate and actionable?
- Are there better ways to frame these distinctions that more authentically capture your day-to-day realities while still meeting governance needs?
The last thing we need is for AI governance to impose arbitrary distinctions.
We need practical solutions that reflect real-world technical realities.
But for governance to work, we need ways to bring key safety and security breakthroughs into policy, and that means understanding which research is relevant to which regulatory goal.
Your expertise is essential to get this right!
1 comments
Comments sorted by top scores.
comment by Katalina Hernandez (katalina-hernandez) · 2025-04-04T09:24:00.551Z · LW(p) · GW(p)
I’m aware this “safety vs. security” distinction isn’t clean in real-world ML work (e.g., I understand that adversarial robustness spans both).
But it’s proven useful for communicating with policy teams who are trying to assign accountability across domains.
I’m not arguing against existential AI Safety framing, just using the regulatory lens where “safety” often maps to preventing tangible human harms, and “security” refers to model integrity and defense against malicious actors.
If you’ve found better framings or language that have worked across engineering/policy interfaces, I’d love to hear them.
Especially if you think interpretability or control work gets misclassified in governance discourse.
Grateful for your thoughts, please tell me where this falls short of your technical experience.