Representation Engineering has Its Problems, but None Seem Unsolvable

lukasz-g-bartoszcze

Representation Engineering has Its Problems, but None Seem Unsolvable

post by Lukasz G Bartoszcze (lukasz-g-bartoszcze) · 2025-02-26T19:53:32.095Z · LW · GW · 1 comments

1 comment

TL;DR: Representation engineering is a promising area of research with high potential for bringing answers to key challenges of modern AI development and AI safety. We understand it is tough to navigate it and urge all ML researchers to have a closer look at this topic. To make it easier, we publish a survey of the representation engineering literature, outlining key techniques, problems and open challenges in the field.

We have all been pretty frustrated by LLMs not being able to answer our questions the way we want them to.

ML practicioners often see AI as a black box. An unknown, intractable object. A maze of billions of parameters we can’t really decipher. We try to influence it with prompt engineering and guardrails, but shy away from influencing what really happens in its brain.

Representation engineering is a novel approach that promises a solution to that. By identifying semantically-charged directions in the activation space, it offers a way to users to see inside the AI mind and shape its capabilities.

And it works. Using interventions, we can make the models hallucination-free. Less prone to jailbreaks. Or even personalize them to be better aligned with our feedback.

This matters because hallucinations are a prevalent problem, with between 3-10% of AI outputs being untruthful. Half of US workers report struggling with AI accuracy. A third of them are worried about the explainability of AI actions. Security is a major issue too. All state-of-the-art models get jailbroken within hours, completely controlled by combinations of tokens the adversarial prompters choose. Same attacks work on multiple models, transferring between white-box and closed-source models with little insight. Very strong attacks can be generated endlessly, for example by bijection learning. Through this, LLMs can be manipulated in generating harmful outputs, assisting in cyberattacks and creating deepfakes. Increasingly, users tend to demand personalized, specific AI that reflects their values. Fine-tuning models to accurately understand and reflect the values of localized societies has led to the creation of the Dolphin models, Jais or Bielik. However, fine-tuning the models is difficult and costly. Generating unique, personalized experiences for every user with fine-tuning is hence impossible. Models have also been changed to improve their performance, fine-tuning for specific knowledge or performance on a particular task.

Representation engineering promises to resolve these issues. It aims to identify and edit directions in the latent space of the model, hence changing its behaviour. By looking into the activation space, we can fully control the model behaviour and have complete overview of the model’s reasoning. Theoretically, it seems much better than what we have right now. Evaluating models using benchmark performance or input/output guardrails does not seem particularly robust, especially if we think the LLMs might be turning against us with scale.

When I think about LLMs, I see them as a very complex function to generate the next token. I don’t buy the arguments of some postulating that LLMs will forever remain intractable black-boxes, unexplainable and intractable. Fundamentally, this is a function we can control and analyze through the latent space. I am optimistic that soon we will be able to control them in a much more fine grained way without requiring complex finetuning or wizard level prompting.

If I go to a mechanic because something is wrong with my car, I don’t accept the explanation that the car has too many parts to fix it. LLMs are like cars, in the sense of also being composed of a finite number of moving parts that need to be studied and optimized. Even if they can’t be studied, we can still define useful abstractions to fix them. A mechanic might not know what all elements and atoms in the car’s engine. This does not prevent them from being able to fix it by repairing the radiator.

Just because we are not able to explain and track all changes in LLMs currently reliably, does not mean this will always be the case. Right now, superposition is a major problem preventing us from fully explaining all LLM features, but as models increase in scale, it is possible we reach a point where all features are encoded within the model.

Slowing down AI development to work on AI Safety in the meantime is a common postulate of AI Safety communities, but it seems increasingly unlikely to be widely adopted by industry practitioners and top research labs. Instead, we propose focusing on accelerating the capabilities to diagnose and edit the internal model workings, so that we are able to catch issues related to AI safety as the models grow in scale.

This is what representation engineering promises. With a top-down approach, we are able to make high-level interventions on the level of activation space and mitigate the problems with a very direct intervention.

Of course, there are alternatives. Fine-tuning and mechanistic interpretability also allow us to influence the latent space of the model. With fine-tuning, however, the latent space is not monitored. Mechanistic interpretability is great and moved us forward on latent space explainability, but feels really granular. Do we really need to decompose all fundamental parts of the latent space to make helpful interventions? I think we can still make meaningful progress with a general, contrastive imaging based interventions to decompose the internal workings of the model.

Once you try to implement representation engineering in practice though, cracks start to show.

With steering, the performance on other tasks often degrades significantly. How to prevent that?
Much of the method parameters computed on an ad-hoc basis, with no actual theoretical validation (e.g. how many pairs are enough for a stimulus to detect representations? What even is a representation? What layers should be targeted by an intervention?)
How to evaluate these interventions reliably?
and others..

These problems has been already analyzed before on Lesswrong [LW · GW]. We find that much of the original criticism is still valid. However, new techniques partially mitigate these problems already and provide hopeful optimism that the current problems with representation engineering are not fundamentally unsolvable. Therefore, we provide an outline of the path forward, a repository with examples of work completed in the field and urge more researchers to look into latent space interventions.

1 comments

Comments sorted by top scores.

comment by Devan Patel (devan-patel) · 2025-02-26T22:31:53.749Z · LW(p) · GW(p)

Controlling latent space is the future. Haven't had the time to dig into the survey but it seems really cool! gj guys

Representation Engineering has Its Problems, but None Seem Unsolvable

Contents

1 comments