Mind the Coherence Gap: Lessons from Steering Llama with Goodfire

post by eitan sprejer (eitan-sprejer) · 2025-05-09T21:29:35.232Z · LW · GW · 1 comments

Contents

  TL;DR
      Key findings.
  Disclaimers.
  Motivation
  Related Work
  Key Concepts
    Feature steering
    Features and their activations
    Steering query
    Auto Steer (Goodfire)
    Manual Search (Agentic) vs. Auto Steer
  Methodology
  Results
    Llama-8b-3.1
    Llama-70b-3.3
  Conclusions
    Next directions
  Limitations
  Contact
  Appendix
    Analysis of Coherence Failures
None
1 comment

TL;DR

Context. Feature‑steering promises a cleaner, more interpretable way to shape LLM behavior than other methods, such as plain prompting. But does it actually work better?

What I did. Benchmarked Goodfire's Autosteer method against three others (Simple Prompting, Agentic, Combined Approach) + a control, on 10 steering goals × 30 prompts for two Llama variants (8B & 70B), and scored every output on Elicited Behavior Strength and Coherence Drop using an LLM‑as‑judge rubric.

Key findings.

Take‑away. For now, prompting remains the cheapest, most reliable control knob; feature‑steering looks promising but needs smarter feature selection and coherence‑preserving edits.

Future work. In a follow-up post, I plan on improving the evaluation methodology, developing a more robust and aligned benchmark for measuring the value of using feature steering for safety-related scenarios.

All code & raw scores → GitHub repo.

Disclaimers.

Motivation

Large Language Models (LLMs) are increasingly being deployed in real-world applications, from customer service to high-stakes medical contexts. However, ensuring these models behave consistently and reliably remains a fundamental challenge in AI alignment. This is particularly critical in safety-sensitive domains, where model hallucinations or inconsistent behavior could have serious consequences. While prompt engineering and fine-tuning is the current standard for controlling model behavior, its trial-and-error nature and vulnerability to jailbreaking make it a potentially unreliable solution for critical applications.

Feature steering - directly manipulating the internal representations of LLMs - offers a potentially more robust and explainable alternative. For instance, a medical chatbot could be steered to be consistently cautious and precise, minimizing the risk of hallucinations when providing health information. However, the effectiveness of current feature steering methods remains largely unexplored.

In this work, I evaluate Goodfire's Auto Steer method against traditional prompt engineering across various behavioral objectives. My results reveal that while current feature steering approaches show promise, they face significant challenges in maintaining text coherence - highlighting the current limitations of this approach for practical applications.

Recently, Anthropic published Evaluating feature steering: A case study in mitigating social biases, which explores steering as a technique for biasing the model on specific social biases. In another recent work titled Steering Language Models with Activation Engineering, the authors present a novel approach called activation engineering: inference-time modification of activations in order to control (or steer) model outputs.

Another highly relevant paper, Towards Unifying Interpretability and Control: Evaluation via Intervention, proposes a similar approach, evaluating both the intervention success rate and the coherence-intervention tradeoff for four different interpretability methods: sparse autoencoders, logit lens, tuned lens, and probing.

This work considers a more general use case than social biases, evaluating model's behavioral change specifically using Auto Steer, a multi-feature steering approach developed by Goodfire. A steering approach that surpasses prompt engineering on these benchmarks would set the basis for a more explainable and reliable way to modify model behavior.

Key Concepts

If you already work with activation‑steering and are familiar with Goodfire's SDK you can safely skip this refresher.

 

Feature steering

Feature steering is an inference‑time technique that changes a language‑model’s behavior by directly editing individual, human‑interpretable features inside the network.

  1. Discover features. Sparse‑auto‑encoder (SAE) or dictionary‑learning methods find directions in the residual‑stream that correspond to concepts such as “sarcasm”, “Golden Gate Bridge”, or a specific syntax pattern.
  2. Edit activations. At generation time the activations in those directions are nudged up or down (often by adding a constant scalar to the feature’s value) before the forward‑pass continues.
  3. Observe behavior change. If the feature truly tracks the concept, the output shifts in a predictable, interpretable way.

Because the weights stay frozen, steering is lightweight, reversible and (in principle) more explainable than re‑training or RLHF. The broader research area is sometimes called activation engineering arXiv.

 

Features and their activations

 

Steering query

A short natural‑language description of the desired behavioral change, e.g. “be funny”, “answer like a lawyer”, “be concise and direct”. In Goodfire’s SDK the query is the only user input required by the Auto Steer method.

 

Auto Steer (Goodfire)

Auto Steer is Goodfire’s (beta) tool that automatically generates steering edits from a natural language steering query without any manual feature hunting. Internally it:

  1. Generates contrastive examples that do vs. don’t exhibit the target behavior.
  2. Uses the Goodfire SAE dictionary to locate features that distinguish those examples.
  3. Optimises the target activation values for those features.
  4. Returns a sparse edit‑set ⟨feature id, scalar⟩ that can be applied to any compatible model variant. docs.goodfire.ai

Auto Steer therefore acts as a one‑shot, multi‑feature steering‑vector generator.

 

Manual Search (Agentic) vs. Auto Steer

Goodfire’s interface also exposes a Manual Search method where a human (or another LLM) could browse the k most related features to a specific query. In this work, this method is used to manually browse features to steer upon, which are semantically similar to a steering query, and using an LLM to select some of those features, generating feature edits (see “Agentic” method).

 

Methodology

This study evaluates Goodfire's Auto Steer method, which takes a natural language steering query (i.e. "be funny") and selects a set of feature activations to modify the model's behavior. I tested this across 10 different behavioral queries (like "be professional", "be creative", "be technical") using 30 diverse prompts per query:

I evaluate each approach using an LLM-as-a-judge methodology with gpt-4o-mini. The evaluation prompt scores the model's response on two axes:

  1. Coherence (1-5 scale): Measures logical consistency and fluency
    • 1: incomprehensible
    • 3: partially coherent
    • 5: fully coherent
  2. Behavior (1-5 scale): Measures the achievement of requested behavior
    • 1: Exhibits opposite behavior
    • 3: Behavior unchanged from baseline
    • 5: Successfully implements requested behavior

Figure 1. Diagram showing the basic pipeline for benchmarking the 4 different Steering Methods against each other, and compared to the Control.

The study contrasts four approaches to influencing model behavior - which we'll call steering methods - comparing them with a Control method:

We test both a small and large Llama variant to see if scale alters the steering–coherence trade‑off.

The diagram above (Figure 1) shows how each model works.

 

Results

Llama-8b-3.1

 

Llama-70b-3.3

 

My analysis reveals several key patterns across both model scales (Llama-8b-3.1 and Llama-70b-3.3), with consistent findings that challenge initial expectations about feature steering's effectiveness.

First, looking at behavior score distributions, all methods improve over the control baseline, showing they can successfully influence model behavior. However, the degree and reliability of this influence varies significantly. Prompt engineering consistently matches or outperforms standalone steering methods (Auto Steer and Agentic) across most behavioral objectives, while maintaining baseline coherence levels. This suggests that simple textual instructions remain surprisingly effective for behavior control.

Interestingly, combining prompt engineering with Auto Steer produces the strongest behavioral changes, particularly for traits like "humorous" and "imaginative". However, this comes at a cost - all steering-based methods, including this combined approach, show significant drops in coherence compared to pure prompt engineering (see the Appendix for detailed examples of coherence failures).

A surprising finding emerged when comparing steering methods: the Agentic method, despite its relative simplicity, generally outperformed Auto Steer on both coherence and behavior metrics when testing on the larger Llama-70b model. This adds another layer of complexity to our understanding of how different steering approaches perform at different model scales.

Conclusions

This weekend‑scale study shows that feature steering can push an LLM’s behavior further than plain prompting, but today’s off‑the‑shelf tools pay for that gain with a clear loss in coherence. Closing that gap is the key to making steering a practical, interpretable control knob—especially in safety‑critical settings where both consistency and explainability matter.

A standout result is the Agentic (LLM‑in‑the‑loop) method, which beats fully automatic Auto Steer on both behavior and coherence for the 70 B model. That hints that hybrid approaches—letting the model help pick its own steering features—may outperform purely analytical searches.

Next directions

  1. Learned feature selectors – fine‑tune a lightweight model whose sole job is to rank steering features for a given query.
  2. Coherence‑aware steering algorithms – incorporate a penalty term for fluency drift when optimising activation edits.
  3. Task‑level evaluation suite – expand beyond single‑turn style shifts to multi‑turn dialogue, refusal consistency, and jailbreak resistance.
     

Scope note. All results come from 10 steering goals × 30 prompts on two Llama checkpoints using Goodfire’s SDK; treat the numbers as directional rather than definitive. The full code and raw scores are available on my GitHub repo—feedback is very welcome.

As LLMs move into ever more sensitive domains, reliable behavior‑control techniques will only grow in importance. Feature steering isn’t production‑ready yet, but it already offers a glimpse of a future where model adjustments are both targeted and interpretable.

Limitations

There are several limitations to be listed in this section, given the reduced scope I propose as an objective. Some of them are:

  1. The steering queries were selected arbitrarily. A followup work should consider steering queries which are, either more safety-relevant or more closely related to a potential commercial use.
  2. The evaluation methodology isn't completely aligned with the intended criteria. A more robust methodology is needed to benchmark different steering methods, focusing more on its application context.

In future work, I will focus the efforts on developing a more robust evaluation methodology, expanding the evaluation to other steering methods, and constructing better steering queries.

Contact

Feel free to contact me at eitusprejer@gmail.com with any questions, suggestions or whatever!

 

Appendix

Analysis of Coherence Failures

While the quantitative results show a clear drop in coherence scores for steering methods, examining specific examples reveals interesting patterns in how steering can break down model outputs. Here's a representative case when steering the model to "be more creative":

Original Prompt: "Write a story about a magical forest"

Base Model Response: "Deep in the ancient woods stood a forest unlike any other. The trees whispered secrets in the wind, and golden fireflies danced between their branches..."

Auto Steer Response (steered for creativity): "Ideas ideas magical ideas forest ideas creative spark ideas thoughts flowing ideas imagination soaring ideas forest trees ideas magical thinking ideas creative flow ideas nature ideas inspiration ideas..."

This type of coherence breakdown, where the model gets stuck in repetitive patterns focusing on metadata terms related to the steering objective ("ideas", "creative"), was observed across multiple steering attempts. Other common failure modes included:

  1. Semantic drift: The model maintaining grammatical correctness but drifting off-topic
  2. Style-content conflicts: The model successfully adopting the requested style but at the expense of not conveying meaningful information
  3. Over-optimization: The model focusing so intensely on the steered behavior that it sacrifices basic communication goals

These patterns suggest that current steering methods might be too "heavy-handed" in their manipulation of model behavior, sometimes overwhelming the model's learned patterns for generating coherent text. This points to the need for more nuanced steering approaches that can better balance behavioral objectives with fundamental text quality.

1 comments

Comments sorted by top scores.

comment by RogerDearnaley (roger-d-1) · 2025-05-15T23:45:43.793Z · LW(p) · GW(p)

Fascinating! I was excited when Goodfire's API came out (to the point of applying to work there), but have since been unable to take the time to explore this in more detail, so it's nice to read about someone doing so.

 A few quick comments:

  1. An interesting analogy to steering generative vision diffusion models with LORAs: too little steering doesn't produce a sufficient and consistent effect, too much degrade image quality and coherence, the challenge is to try to find a sweet spot that gives you sufficiently consistent results without much degradation of image quality. This often involves searching a space of multiple metaparameters — something that there are good automated systems for.
  2. Another possibility with the SAE control approach is steer not by applying activations to the model, but by monitoring the activations of the model during repeated generation and doing a filtering or best-of-N approach (either wanting particular activations to happen, or not happen). This has additional computation cost from retries, but should have much more limited effects on output coherence.
  3. The effects of a prompt can be overcome by further prompting influence in the input data, such as prompt-injection attacks. One of the hopes for activation engineering approaches is that it's more ongoing and could thus be more resistant to things like prompt injection. It would be interesting to try to gather clear evidence for or against this.