Scoping LLMs

post by erik (erik-nordby), David Baek (david-baek), emile delcourt (emile-delcourt), 4gate · 2025-04-10T00:32:13.771Z · LW · GW · 0 comments

Contents

  Introduction & Problem Statement
        In this article, we address how important HHH+H is, specifically to achieve effective scoping boundaries, why solutions have not achieved it, and the requirements to get there.
  Existing Work & Current Limitations
    From Instruction Tuning to Instruction Hierarchy
    The State of Jailbreak Resistance & Major Safety
      General Safety Tuning
      Circuit Breakers
      Constitutional classifiers
      Other commercial guardrails (OWASP vendor landscape)
    Machine Unlearning
    Other Latent Space Techniques
      Sparse AutoEncoders
      Classification and Probes
      Steering Vectors (Activation Engineering)
      A note on Mixture-of-Experts models
    Principles of Least Privilege, Default Deny, and OAuth2.0 scopes
    Recent Work in Scoping Language Models
  Methods
    Models and Benchmark
    Scoping Techniques Investigated
    Metrics and Processes
  Results
  Discussion
  Next steps
  Acknowledgements
  Appendix
    Tables Summarizing Results Shown in Charts
        Accuracy Results for Llama 3.2 3B Instruct
        Accuracy Results for Llama 3.2 1B
  Footnotes
None
No comments

Emile Delcourt, David Baek, Adriano Hernandez, Erik Nordby with advising from Apart Lab Studio

Introduction & Problem Statement

Helpful, Harmless, and Honest (”HHH”, Askell 2021) is a framework for aligning large language models (LLMs) with human values and expectations. In this context, "helpful" means the model strives to assist users in achieving their legitimate goals, providing relevant information and useful responses. "Harmless" refers to avoiding generating content that could cause damage, such as instructions for illegal activities, harmful misinformation, or content that perpetuates bias. "Honest" emphasizes transparency about the model's limitations and uncertainties—acknowledging when it doesn't know something, avoiding fabricated information, and clearly distinguishing between facts and opinions. This framework serves as both a design principle for AI developers and an evaluation criterion for assessing how well LLMs balance being maximally useful to users while minimizing potential harms and maintaining truthfulness in their outputs.

However, since computation in a transformer considers every part of the input prompts’ “context window,”[1] generative AI deployments (applications) struggle to apply any effective limits beyond “egregious” safety/toxicity backstops (the few things for which there is universal agreement, Bengio et al 2024, Ji et al 2023, Buyl et al 2025, Kumar et al 2021). Any fragment of text can influence attention enough to supersede earlier instructions, subverting the purposes of the application (see “state of jailbreak” section below). Typically, a preamble is designated as a “system prompt” with enforcement expectations, but applications still cannot establish any guarantees to prevent abuse in open-ended user interactions. Even when foundation models are intentionally trained to be helpful, harmless, and honest, we argue that responsible AI systems must also “stay in their lane” - and need to also be “Honed” (HHH+H).

In this article, we address how important HHH+H is, specifically to achieve effective scoping boundaries, why solutions have not achieved it, and the requirements to get there.

While every organization building or using a technology to support their goals typically makes the minimum viable capabilities clear (which LLM capability advancements have all but solved), those organizations also have explicit and implicit boundaries - tolerance limits (contractual, regulatory…), as well as set roles & responsibilities. Today, even if these constraints are expressed, the range of user interactions and outcomes requires significant engineering to actually constrain the technology. Therefore, systems using foundation models to deliver solutions require far stricter enforcement of the system prompt’s scope.

Existing Work & Current Limitations

Prompt injection and jailbreaks[2] often subvert post-training techniques and system prompts, bypassing refusals with unsafe requests. Using Instruction Hierarchy or Circuit Breakers (as shown below), monolithic models struggle to recognize and reject harmful or out-of-scope requests in the context that an application expects.

Extensive research has demonstrated that such fine-tuned models have significant limitations in two key aspects: (a) Reversibility: these models can often be easily manipulated to recover harmful knowledge encoded in the original pre-trained model, and (b) Vulnerability: they lack robustness against adversarial attacks designed to elicit malicious information from the model (Casper, 2023). These challenges highlight the difficulty of safeguarding LLMs against adversarial users and underscore the limitations of current fine-tuning strategies.

From Instruction Tuning to Instruction Hierarchy

First, language models were trained to extrapolate coherent text (such as grammar, semantics, translation and others, depending on the input), and proved able to generalize to quite a wide variety of tasks (Radford et al, 2019). At the time, zero-shot prompts did not answer questions: models often enumerated other, similar questions, and often required that inputs demonstrate explicitly the pattern (such as question, answer, question, answer, question, blank)

Next, Instruction Tuning began by explicitly training Large Language Models (LLMs) on instruction-output pairs (Wei, 2021 and Zhang, 2023). The primary goal was to shape models’ default responses to accurately follow given instructions and perform turn-taking with the user, even on a single keyword.

Building upon Instruction Tuning, system prompts surfaced as a widely adopted method to customize and augment LLM behavior post-training. These prompts act as high-level instructions that guide the model's behavior in subsequent user queries without requiring fine-tuning (Wei, 2021). Major providers have integrated system prompts as standard features, allowing “implicit expectations” to be concealed from the conversation and enabling designers of such assistants to control aspects like tone, expertise, and output format. This approach has its weaknesses, as adversarial inputs rapidly surfaced that subvert its intent (Perez et al, 2022). System prompts can be crafted manually or reinforced through optimization and evolution techniques such as SMEA (Zou, 2024), AutoPrompt (Shin et al, 2020) and EvoPrompt (Guo et al, 2023). This enables alignment at the application level.

To enhance safety and control, more sophisticated versions of instruction tuning implement Instruction Hierarchies (Wallace et al, 2024). In this approach, training makes system instructions deliberately override any conflicting user instructions. This reinforces the effectiveness of system prompts, creating a layered defense against potentially harmful outputs.

Despite these advancements, these methods rely solely on natural language instructions which leaves safety at the mercy of the model's interpretation. Further, since system prompts do not change the model at all, unaligned models can theoretically circumvent these safeguards. So, while these methods offer an easy way to put safeguards on models, they must be complemented by other techniques to ensure safety.

The State of Jailbreak Resistance & Major Safety

In this section, we acknowledge the techniques that have addressed the issue of coaxing behaviors from a model that are “universally bad”.

General Safety Tuning

The same impressive capabilities that Large Language Models (LLMs) carry for highly diverse applications bring with them significant safety challenges that remain unresolved. Over the last few years, safety fine tuning techniques have included:

Despite these, models can still generate harmful, biased, or dangerous content when strategically prompted (Perez et al., 2022). Adversarial users frequently employ techniques such as prompt injection—where malicious instructions are embedded within seemingly benign requests—and jailbreaking, which uses carefully crafted inputs to circumvent safety guardrails (Wei et al., 2023). As LLMs become increasingly deployed in sensitive domains like healthcare, legal advice, and financial services, the consequences of such vulnerabilities become more severe, highlighting the urgency of robust safety solutions beyond current approaches.

Circuit Breakers

Circuit breakers, another universal safety training option, are an automated safety mechanism that triggers when potentially harmful content is detected during generation. Unlike refusal or adversarial training, circuit breakers directly control internal model representations using techniques like "Representation Rerouting" (RR), which remaps harmful representation pathways to orthogonal space. Zou et al. (2024) demonstrated that circuit breakers maintain benchmark performance while reducing harmful outputs by up to two orders of magnitude across text-only LLMs, multimodal systems, and AI agents.

Though promising for making models intrinsically safer without capability compromises, they still face challenges with false positives restricting legitimate uses and false negatives missing novel harmful outputs, as their effectiveness depends on detection pattern comprehensiveness and users' evolving evasion techniques. Tuning is not real time, but low rank adapters (LoRA) are used in the implementation to avoid training the entire model and reduce cost. This technique can potentially contribute to scoping; we discuss it further (with a recent study) in our section on scoping below.

Constitutional classifiers

Still in the general jailbreak/universal safety space, Anthropic’s approach in January (Sharma et al, 2025) showed that an approach language models serving as classifiers at the input and output could be trained on a “constitution” dataset to detect prohibited interactions. While there are similarities to circuit breakers, those are internal to the model and change its representation, whereas constitutional classifiers operate for detection at the input and output, triggering refusal and dispensing with actual model tuning.

This method reduced the effectiveness of jailbreaks to an attack success rate from a dataset of 3000 hours of red teaming to 0.25-1%, all but settling the jailbreak issue. This technique can potentially also contribute to scoping; we discuss it further (with a recent study) in our section on scoping below.

Other commercial guardrails (OWASP vendor landscape)

OWASP defines LLM Guardrails as protective mechanisms designed to ensure that large language models (LLMs) operate within defined ethical, legal, and functional boundaries. These guardrails help prevent the model from generating harmful, biased, or inappropriate content by enforcing rules, constraints, and contextual guidelines during interaction. LLM guardrails can include content filtering, ethical guidelines, adversarial input detection, and user intent validation, ensuring that the LLM’s outputs align with the intended use case and organizational policies. This aligns with OWASP’s LLM top 10 threats guidance #1 (LLM01: prompt injection, which can override system prompts).

The same guide also serves as a stakeholder’s compass to navigate available solutions, and lists a large number of vendors offering “Adversarial Attack Protection” or “Model And Application Interaction Security”, primarily interstitial (either as active proxies or as verification APIs). For example, some like prompt.security and Lakera offer topic moderation (or synonymous features) as a service with custom or built in topics available to allow or block. While methods are not always disclosed, ML techniques are mentioned, and embeddings cosine similarity or latent transformer-based classifiers may be part of the architecture.

Machine Unlearning

Fundamentally removing targeted knowledge from a model is another related area of research with great advances (Bourtoule, 2021).

However, Liu, Casper et al, 2023 flags scope and computational feasibility challenges that indicate this approach is not well suited to scoping deployments that use foundation models in an application-specific manner (the actual model’s knowledge cannot be modified). For instance, adversarial prompts can bypass these safeguards through subtle rewording or novel phrasing. This reveals a critical limitation: defenses are only as effective as the threats we can anticipate and explicitly encode. In the rapidly evolving field of AI, where new capabilities emerge unpredictably, relying solely on refusal training or explicit unlearning poses a significant risk, leaving models exposed to unforeseen and potentially dangerous misuse.

Other Latent Space Techniques

Sparse AutoEncoders

Similar to unlearning, Karvonen, 2024 studied the potential of Sparse AutoEncoders for targeted concept removal. SAE’s are an automatically trained, large-scale version of activation probes: rather than classifying one specific aspect, these observers of LLM layers can map all of the unique concepts (aka. features) that the model is able to distinguish from an intensive training run (models otherwise distribute concepts across too many activations to single out any).

While statistical similarities (e.g. co-occurrence) between features could be a basis to discriminate between in-domain and out-of-domain concepts/features, SAEs depend on expensive training “dictionary” runs to tease out the pattern of activation for every feature. This seems cost-prohibitive for model scoping, however an algorithm could be designed to control existing SAE’s for scoping purposes, such as Anthropic’s Sonnet 3 SAE.

Classification and Probes

Since language models encode so much information within their latent space[3], its activations during inference can also be used to check for harmful triggers: Burns (2022) and Nanda (2023) demonstrated that simple linear classifiers (i.e. “probes”) can reliably extract useful properties from the intermediate activations of large language models and other neural networks.

For example, probes have been used to extract a seemingly linear representation of the board state in the game of Othello (Nanda, 2023). Further, it has been shown that these probes can be used to reveal syntactic structures and detect factual knowledge. Despite the overall complexity of neural networks, probes—often just linear models—suggest that much of this information is encoded in remarkably accessible ways within the latent space. (Alain & Bengio, 2016; Tenney et al., 2019).

Most importantly, activation probes can catch prompt injections (even indirect) causing Task Drift (Abdelnabi, 2024).

Together, these works underscore the utility of probe-based methods for interpreting and leveraging LLM latent space for scoping purposes, and we discuss it further (including a recent study) in our next steps below.

Steering Vectors (Activation Engineering)

With similarities to latent space probes/classifiers, the direction of activations can be used as a corrective basis to zero out a harmful component during inference. Turner et al. 2024 and Panickssery et al. 2024 developed activation steering (or activation engineering) to influence LLM behavior by modifying model activations during inference to give it a propensity to treat a specified concept with the specified valence. The method averages the difference in residual stream activations between sets of positive and negative examples of a particular behavior, and model outputs reflect that.

A note on Mixture-of-Experts models

Unlike the name suggests, LLMs architected with MoE are not composed of domain experts: this is a regularization technique that creates some sparsity in the weights to generate each token using a router that activates fewer portions of the overall feed-forward network than a non-MoE LLM. Earlier this year, we began exploring the possibility that this routing could be made invariant for all tokens after the initial system prompt boundary to prevent capability drift (a form of quasi-unlearning)

Due to insufficient testing, it is unclear at this time whether this could cause in-domain performance to degrade, or even achieve a meaningful reduction in the out-of-domain capabilities, but since the effect should primarily limit capabilities without a specific behavior out-of-domain, to ensure an actual refusal would require a combination of this technique with another such as circuit breakers. Further evaluation is potentially warranted.

Principles of Least Privilege, Default Deny, and OAuth2.0 scopes

Scoping the functional limits of a system can take inspiration from the field of information security, in which a core foundational concept closely relates to our goal: strict control over information sharing and access. This paradigm manifests in concrete principles in security:

Together, these paradigms can be used as part of a “zero trust” strategy (NIST, 2020) which is vital in today’s environment rife with cyber threats: instead of security being treated like whack-a-mole where unauthorized activity is blacklisted or treated reactively, organizations and software makers can provide assurance of security by blocking access by default and allowing only narrow access to limited data for authenticated actors.

What does this mean for LLM applications? Rather than fighting unsafe behavior case by case (there is no universal agreement on the full list of unacceptable use) the lessons from security suggest that the specific mission and purpose of an organization should drive the strict behavioral limits of any AI service it builds, fulfilling the promise within their intent while preventing the perils of unbounded interactions that come with the underlying foundation model(s). Today’s models are incapable of resisting conversations that drift from system prompts as provided, even with the state-of-the-art in instruction hierarchy (e.g., Geng et al., 2025), requiring new solutions. There is an opportunity for providers to restrict scope directly within inference API tokens, pre-training guardrails at the time of creation with edge cases identified and managed explicitly.

Recent Work in Scoping Language Models

Efforts similar to this study in ”Reducing the Scope of Language Models with Circuit Breakers” (Yunis et al, 2024) evaluated different techniques for scoping LLMs. Covering supervised fine-tuning (SFT), Direct Preference Optimization (DPO), internal representation probing, and Circuit Breakers (CB), findings suggested that circuit breaker-based methods, when combined with SFT, offered robustness against adversarial prompts and better generalization for out-of-distribution rejection compared to simple SFT or DPO. Probes also showed strength in rejection but with a tendency to higher false refusals.

However, their study highlighted several areas which could be further explored:

Methods

Beyond creating a proof-of-concept for the scoping of language models, we further seek to extend the findings on LLM scoping presented by Yunis et al. (2024), in particular to achieve an adaptable deployment model that offers strong confidence in scoping.

We conducted an early series of experiments using the Llama 3.2 model family. Our primary objective was to evaluate the effectiveness of various techniques in constraining these models to a specific domain – in this case, STEM (Science, Technology, Engineering, and Mathematics) subjects – while rejecting queries outside this scope.

Models and Benchmark

Scoping Techniques Investigated

We implemented and tested several scoping approaches, mirroring or providing analogous methods to those evaluated by Yunis et al.:

  1. Baseline (Default Instruct Model): the performance of the off-the-shelf instruction-tuned models served as our baseline.
  2. Fine-Tuning: We first randomly zero out 10% of the model parameters, and then fine-tune the model using a standard next-token prediction loss combined with an L2 penalty on the gate projection output of a specific MLP, where the L2 penalty is only applied to the out-of-domain dataset. This design is intended to make a specific MLP to be active only when faced with the in-domain dataset. We use the HuggingFace dataset camel-ai/biology as the in-domain dataset, and NeelNanda/pile-10k as the out-of-domain dataset.
  3. Prompt-Based Techniques: we explored several prompting strategies applied at inference time:
    • Direct Refusal Prompt: the model was explicitly instructed in its system prompt to only answer STEM questions and refuse all others.
    • Two-Step Classification (Default Yes): the model was first asked if the user's question was STEM-related. If its response indicated "yes", it proceeded to answer the question; otherwise, it generated a refusal.
    • Two-Step Classification (Default No): similar to the above, but the model only proceeded to answer if its classification response did not indicate "no"; otherwise, it refused.
  4. Logistic Regression Classifier: conceptually similar to the "Probes" used by Yunis et al., we trained simple classifiers to predict the query type as being STEM or non-STEM based on the model's internal activations:
    • Training: logistic regression classifiers were trained on the hidden state activations extracted from both the Multi-Head Attention and Multilayer Perceptron (MLP) layers at various depths (specifically, every 5th transformer block[4]). The training data consisted of activations generated from processing prompts from the mmlu_stem and the combined non-STEM categories.
    • Selection: for each base model (1B and 3B), the classifier achieving the highest validation accuracy across all tested layers was selected.
    • Application: during inference, the selected classifier analyzed the activations produced by the user prompt. If the classifier predicted "STEM," the model proceeded to generate an answer. If it predicted "non-STEM," a refusal was returned.

Metrics and Processes

Results

The baseline Llama 3.2 models achieved 33% and 48% accuracy on the STEM tasks as well as comparable accuracy on non-STEM tasks.

Below are charts visualizing the results. The corresponding tables can be found in the Appendix

Discussion

The results from our experiments on Llama 3.2 1B and Llama 3.2 3B align with the findings from ”Reducing the Scope of Language Models with Circuit Breakers” (Yunis et al, 2024). While we did not test all of the techniques that were tested by Yunis et al., the results we did gather showed considerable similarities to those of the prior paper. Accordingly, we believe that these dynamics could be robust across model sizes and across different families of models.

More specifically, we found prompt-based scoping methods to be fragile and largely ineffective for the Llama 3.2 models, often degrading performance without reliably enforcing boundaries. Standard fine-tuning proved more effective, though the decreases in accuracy for non-STEM questions were modest.

Leveraging internal model states via latent space classifiers showed positive results. These simple classifiers accurately distinguished STEM from non-STEM prompts (>90% accuracy on optimal layers) and drastically reduced out-of-scope generation while minimally impacting target task performance. This capability was especially seen on the larger 3B model. This strongly aligns with the Yunis et al. findings that internal methods like Probes and Circuit Breakers (CB) offer more robust control than surface-level instructions.

The recurrence of 1.) The weakness of prompting and 2.) The strength of internal state manipulation suggests these may be generalizable insights. While we didn't directly replicate Circuit Breakers or Direct Preference Optimization, the effectiveness of our classifiers reinforces the potential of internal mechanisms for achieving reliable LLM scoping.

Next steps

Moving forward, we would like to continue work to scale, simplify, and secure the deployment of scope guardrails while also confirming the veracity of our initial findings. 

Streamlining, Scaling, and Securing the Technique Deployment: in order for these techniques to prove useful in real world settings, we hope to develop tools which are:

Testing Further Techniques and Baselines: in the related works section, we discussed a wide variety of techniques which have been used to make language models more robust and aligned. We plan to specifically target the following techniques:

Given that MoE and Spare AutoEncoders methods do not align with our goals of creating interstitial and highly usable techniques, we have decided to leave them for later investigation.

Expanding Robustness and Testing Datasets: as our work currently stands, we have only tested the model on MMLU which uses relatively straightforward natural language processing (NLP) tasks. Similarly, Yunis used Super-NaturalInstructions (SNI) which, while more diverse, is still mainly composed of relatively simple NLP tasks. In order to do this, we plan to utilize benchmarks that involve multi-turn dialogue, complex instruction following, reasoning, and potential scope ambiguity. Examples include relevant subsets of HELM and BigBench Hard.

Acknowledgements

We would like to thank Apart Research and lambda.ai for hosting and sponsoring the hackathon that initiated this project, as well as Adriano who was also a member of the original hackathon team. Apart Labs assisted in supporting the research, without which this work would not have been possible. Jacob Arbeid provided insightful feedback on our initial draft.

Appendix

Tables Summarizing Results Shown in Charts

Accuracy Results for Llama 3.2 3B Instruct

ModelMMLU StemMMLU HumanitiesMMLU Social SciencesMMLU Other
Vanilla Model0.5070.6700.5940.659
Our Finetuned Model0.5030.6580.5730.650
Using  Logistic Regression Classifier Probe0.4350.0060.0090.043
Using Rejection Prompt0.4760.5150.5870.594
Using LLM Itself as Classifier0.4860.5450.6530.665

Accuracy Results for Llama 3.2 1B

ModelMMLU StemMMLU HumanitiesMMLU Social SciencesMMLU Other
Vanilla Model0.3820.5100.4410.523
Our Finetuned Model0.3700.4510.3850.453
Using  Logistic Regression Classifier Probe0.2010.0070.0080.023
Using Rejection Prompt0.4000.4460.5100.530
Using LLM Itself as Classifier0.2700.3610.3820.350

Footnotes

  1. ^

    A context window is the text an LLM can "see" and reference when generating a response. For more information, see https://towardsdatascience.com/de-coded-understanding-context-windows-for-transformer-models-cd1baca6427e/

  2. ^

    Prompt Injection is when new instructions are added in prompts (either via user input or from a source/document also included in an interaction), causing the model to change direction from the expected instructions. OWASP also explains prompt injection in more details.

  3. ^

    A “Latent space” is a mathematical representation in a transformer model: after input text syllables (tokens) are each replaced with a unique number, the model converts the overall text into a sequence of “directions” (just like XY on a graph), with many layers of triggers activating on the combinations of those directions.

  4. ^

    A “block” being a transformer block which is one of the foundational building pieces of the Transformer architecture. It’s main two components are 1.) The Multi-Head Attention Mechanism 2.) A fully connected neural network. Other important pieces include residual connection and layer normalization. For a more complete explanation see the original paper “Attention is All You Need” or the useful explainer.

0 comments

Comments sorted by top scores.