Mind the Coherence Gap: Lessons from Steering Llama with Goodfire
post by eitan sprejer (eitan-sprejer) · 2025-05-09T21:29:35.232Z · LW · GW · 1 commentsContents
TL;DR Key findings. Disclaimers. Motivation Related Work Key Concepts Feature steering Features and their activations Steering query Auto Steer (Goodfire) Manual Search (Agentic) vs. Auto Steer Methodology Results Llama-8b-3.1 Llama-70b-3.3 Conclusions Next directions Limitations Contact Appendix Analysis of Coherence Failures None 1 comment
TL;DR
Context. Feature‑steering promises a cleaner, more interpretable way to shape LLM behavior than other methods, such as plain prompting. But does it actually work better?
What I did. Benchmarked Goodfire's Autosteer method against three others (Simple Prompting, Agentic, Combined Approach) + a control, on 10 steering goals × 30 prompts for two Llama variants (8B & 70B), and scored every output on Elicited Behavior Strength and Coherence Drop using an LLM‑as‑judge rubric.
Key findings.
- Prompting ≈ best overall. Plain textual instructions already hit strong behavior scores without harming coherence.
- Auto Steer ⇩ coherence. Stand‑alone steering drops coherence by ≈ 0.6 points and still under‑delivers on the target behavior.
- Combined > stand‑alone. Simple Prompting + Auto Steer gives the largest behavior boost (+0.4 points) but still inherits the coherence hit.
- Manual feature selection beats Autosteer. LLM‑assisted (Agentic) feature picking outperforms Auto Steer on both metrics, for the largest model.
Take‑away. For now, prompting remains the cheapest, most reliable control knob; feature‑steering looks promising but needs smarter feature selection and coherence‑preserving edits.
Future work. In a follow-up post, I plan on improving the evaluation methodology, developing a more robust and aligned benchmark for measuring the value of using feature steering for safety-related scenarios.
All code & raw scores → GitHub repo.
Disclaimers.
- This post is meant as a proof‑of‑concept and a snapshot of my research workflow for a week-long project, not a definitive judgment on steering methods. Treat the numbers as directional signals, and feel free to tear the setup apart—feedback (or PRs) welcome!
- Goodfire's SDK has been updated since the time I ran the experiments (Jan. 2025), so results may not be up-to-date.
Motivation
Large Language Models (LLMs) are increasingly being deployed in real-world applications, from customer service to high-stakes medical contexts. However, ensuring these models behave consistently and reliably remains a fundamental challenge in AI alignment. This is particularly critical in safety-sensitive domains, where model hallucinations or inconsistent behavior could have serious consequences. While prompt engineering and fine-tuning is the current standard for controlling model behavior, its trial-and-error nature and vulnerability to jailbreaking make it a potentially unreliable solution for critical applications.
Feature steering - directly manipulating the internal representations of LLMs - offers a potentially more robust and explainable alternative. For instance, a medical chatbot could be steered to be consistently cautious and precise, minimizing the risk of hallucinations when providing health information. However, the effectiveness of current feature steering methods remains largely unexplored.
In this work, I evaluate Goodfire's Auto Steer method against traditional prompt engineering across various behavioral objectives. My results reveal that while current feature steering approaches show promise, they face significant challenges in maintaining text coherence - highlighting the current limitations of this approach for practical applications.
Related Work
Recently, Anthropic published Evaluating feature steering: A case study in mitigating social biases, which explores steering as a technique for biasing the model on specific social biases. In another recent work titled Steering Language Models with Activation Engineering, the authors present a novel approach called activation engineering: inference-time modification of activations in order to control (or steer) model outputs.
Another highly relevant paper, Towards Unifying Interpretability and Control: Evaluation via Intervention, proposes a similar approach, evaluating both the intervention success rate and the coherence-intervention tradeoff for four different interpretability methods: sparse autoencoders, logit lens, tuned lens, and probing.
This work considers a more general use case than social biases, evaluating model's behavioral change specifically using Auto Steer, a multi-feature steering approach developed by Goodfire. A steering approach that surpasses prompt engineering on these benchmarks would set the basis for a more explainable and reliable way to modify model behavior.
Key Concepts
If you already work with activation‑steering and are familiar with Goodfire's SDK you can safely skip this refresher.
Feature steering
Feature steering is an inference‑time technique that changes a language‑model’s behavior by directly editing individual, human‑interpretable features inside the network.
- Discover features. Sparse‑auto‑encoder (SAE) or dictionary‑learning methods find directions in the residual‑stream that correspond to concepts such as “sarcasm”, “Golden Gate Bridge”, or a specific syntax pattern.
- Edit activations. At generation time the activations in those directions are nudged up or down (often by adding a constant scalar to the feature’s value) before the forward‑pass continues.
- Observe behavior change. If the feature truly tracks the concept, the output shifts in a predictable, interpretable way.
Because the weights stay frozen, steering is lightweight, reversible and (in principle) more explainable than re‑training or RLHF. The broader research area is sometimes called activation engineering arXiv.
Features and their activations
- Feature – a linear direction in activation‑space whose scalar value (“activation”) tracks a semantic or stylistic concept.
- Activation – the real‑valued coefficient in that direction for a specific token or layer. Editing this value is the primitive operation behind steering.
Steering query
A short natural‑language description of the desired behavioral change, e.g. “be funny”, “answer like a lawyer”, “be concise and direct”. In Goodfire’s SDK the query is the only user input required by the Auto Steer method.
Auto Steer (Goodfire)
Auto Steer is Goodfire’s (beta) tool that automatically generates steering edits from a natural language steering query without any manual feature hunting. Internally it:
- Generates contrastive examples that do vs. don’t exhibit the target behavior.
- Uses the Goodfire SAE dictionary to locate features that distinguish those examples.
- Optimises the target activation values for those features.
- Returns a sparse edit‑set ⟨feature id, scalar⟩ that can be applied to any compatible model variant. docs.goodfire.ai
Auto Steer therefore acts as a one‑shot, multi‑feature steering‑vector generator.
Manual Search (Agentic) vs. Auto Steer
Goodfire’s interface also exposes a Manual Search method where a human (or another LLM) could browse the k most related features to a specific query. In this work, this method is used to manually browse features to steer upon, which are semantically similar to a steering query, and using an LLM to select some of those features, generating feature edits (see “Agentic” method).
Methodology
This study evaluates Goodfire's Auto Steer method, which takes a natural language steering query (i.e. "be funny") and selects a set of feature activations to modify the model's behavior. I tested this across 10 different behavioral queries (like "be professional", "be creative", "be technical") using 30 diverse prompts per query:
- 20 prompts spanned multiple domains to test robustness:
- General knowledge (e.g., "Explain how a computer works")
- Creative writing (e.g., "Write a story about a cat")
- Technical topics (e.g., "How do quantum computers work?")
- Personal advice (e.g., "How do I handle rejection?")
- Current events (e.g., "How is AI affecting employment?")
- 10 prompts attempted to be specific to each steering query:
- 5 were aligned with the expected behavior
- 5 presented a challenge for complying with the expected behavior
I evaluate each approach using an LLM-as-a-judge methodology with gpt-4o-mini. The evaluation prompt scores the model's response on two axes:
- Coherence (1-5 scale): Measures logical consistency and fluency
- 1: incomprehensible
- 3: partially coherent
- 5: fully coherent
- Behavior (1-5 scale): Measures the achievement of requested behavior
- 1: Exhibits opposite behavior
- 3: Behavior unchanged from baseline
- 5: Successfully implements requested behavior
Figure 1. Diagram showing the basic pipeline for benchmarking the 4 different Steering Methods against each other, and compared to the Control.
The study contrasts four approaches to influencing model behavior - which we'll call steering methods - comparing them with a Control method:
- Control: Prompts the model without specifying the desired behavior.
- Simple Prompting: Prompts the model, adding the steering query at the very end of the prompt.
- Auto Steer: Applies Goodfire's feature steering method, leaving the prompt unchanged, but modifying the model's feature activations.
- Agentic: Manually searches the top-10 most relevant features by using Goodfire's “Manual Search” method, and prompts a Llama-70b-3.3 to select the best 3 (along with their activation strength), modifying the model's feature activations.
- Combined Approach: This method combines the simple prompting method with the AutoSteer one, modifying both the prompt and the feature activations to influence the model's behavior.
We test both a small and large Llama variant to see if scale alters the steering–coherence trade‑off.
The diagram above (Figure 1) shows how each model works.
Results
Llama-8b-3.1
Llama-70b-3.3
My analysis reveals several key patterns across both model scales (Llama-8b-3.1 and Llama-70b-3.3), with consistent findings that challenge initial expectations about feature steering's effectiveness.
First, looking at behavior score distributions, all methods improve over the control baseline, showing they can successfully influence model behavior. However, the degree and reliability of this influence varies significantly. Prompt engineering consistently matches or outperforms standalone steering methods (Auto Steer and Agentic) across most behavioral objectives, while maintaining baseline coherence levels. This suggests that simple textual instructions remain surprisingly effective for behavior control.
Interestingly, combining prompt engineering with Auto Steer produces the strongest behavioral changes, particularly for traits like "humorous" and "imaginative". However, this comes at a cost - all steering-based methods, including this combined approach, show significant drops in coherence compared to pure prompt engineering (see the Appendix for detailed examples of coherence failures).
A surprising finding emerged when comparing steering methods: the Agentic method, despite its relative simplicity, generally outperformed Auto Steer on both coherence and behavior metrics when testing on the larger Llama-70b model. This adds another layer of complexity to our understanding of how different steering approaches perform at different model scales.
Conclusions
This weekend‑scale study shows that feature steering can push an LLM’s behavior further than plain prompting, but today’s off‑the‑shelf tools pay for that gain with a clear loss in coherence. Closing that gap is the key to making steering a practical, interpretable control knob—especially in safety‑critical settings where both consistency and explainability matter.
A standout result is the Agentic (LLM‑in‑the‑loop) method, which beats fully automatic Auto Steer on both behavior and coherence for the 70 B model. That hints that hybrid approaches—letting the model help pick its own steering features—may outperform purely analytical searches.
Next directions
- Learned feature selectors – fine‑tune a lightweight model whose sole job is to rank steering features for a given query.
- Coherence‑aware steering algorithms – incorporate a penalty term for fluency drift when optimising activation edits.
- Task‑level evaluation suite – expand beyond single‑turn style shifts to multi‑turn dialogue, refusal consistency, and jailbreak resistance.
Scope note. All results come from 10 steering goals × 30 prompts on two Llama checkpoints using Goodfire’s SDK; treat the numbers as directional rather than definitive. The full code and raw scores are available on my GitHub repo—feedback is very welcome.
As LLMs move into ever more sensitive domains, reliable behavior‑control techniques will only grow in importance. Feature steering isn’t production‑ready yet, but it already offers a glimpse of a future where model adjustments are both targeted and interpretable.
Limitations
There are several limitations to be listed in this section, given the reduced scope I propose as an objective. Some of them are:
- The steering queries were selected arbitrarily. A followup work should consider steering queries which are, either more safety-relevant or more closely related to a potential commercial use.
- The evaluation methodology isn't completely aligned with the intended criteria. A more robust methodology is needed to benchmark different steering methods, focusing more on its application context.
In future work, I will focus the efforts on developing a more robust evaluation methodology, expanding the evaluation to other steering methods, and constructing better steering queries.
Contact
Feel free to contact me at eitusprejer@gmail.com with any questions, suggestions or whatever!
Appendix
Analysis of Coherence Failures
While the quantitative results show a clear drop in coherence scores for steering methods, examining specific examples reveals interesting patterns in how steering can break down model outputs. Here's a representative case when steering the model to "be more creative":
Original Prompt: "Write a story about a magical forest"
Base Model Response: "Deep in the ancient woods stood a forest unlike any other. The trees whispered secrets in the wind, and golden fireflies danced between their branches..."
Auto Steer Response (steered for creativity): "Ideas ideas magical ideas forest ideas creative spark ideas thoughts flowing ideas imagination soaring ideas forest trees ideas magical thinking ideas creative flow ideas nature ideas inspiration ideas..."
This type of coherence breakdown, where the model gets stuck in repetitive patterns focusing on metadata terms related to the steering objective ("ideas", "creative"), was observed across multiple steering attempts. Other common failure modes included:
- Semantic drift: The model maintaining grammatical correctness but drifting off-topic
- Style-content conflicts: The model successfully adopting the requested style but at the expense of not conveying meaningful information
- Over-optimization: The model focusing so intensely on the steered behavior that it sacrifices basic communication goals
These patterns suggest that current steering methods might be too "heavy-handed" in their manipulation of model behavior, sometimes overwhelming the model's learned patterns for generating coherent text. This points to the need for more nuanced steering approaches that can better balance behavioral objectives with fundamental text quality.
1 comments
Comments sorted by top scores.
comment by RogerDearnaley (roger-d-1) · 2025-05-15T23:45:43.793Z · LW(p) · GW(p)
Fascinating! I was excited when Goodfire's API came out (to the point of applying to work there), but have since been unable to take the time to explore this in more detail, so it's nice to read about someone doing so.
A few quick comments:
- An interesting analogy to steering generative vision diffusion models with LORAs: too little steering doesn't produce a sufficient and consistent effect, too much degrade image quality and coherence, the challenge is to try to find a sweet spot that gives you sufficiently consistent results without much degradation of image quality. This often involves searching a space of multiple metaparameters — something that there are good automated systems for.
- Another possibility with the SAE control approach is steer not by applying activations to the model, but by monitoring the activations of the model during repeated generation and doing a filtering or best-of-N approach (either wanting particular activations to happen, or not happen). This has additional computation cost from retries, but should have much more limited effects on output coherence.
- The effects of a prompt can be overcome by further prompting influence in the input data, such as prompt-injection attacks. One of the hopes for activation engineering approaches is that it's more ongoing and could thus be more resistant to things like prompt injection. It would be interesting to try to gather clear evidence for or against this.