Posts

[Linkpost] MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data 2024-03-10T01:30:46.477Z
Inducing human-like biases in moral reasoning LMs 2024-02-20T16:28:11.424Z
AISC project: How promising is automating alignment research? (literature review) 2023-11-28T14:47:29.372Z
[Linkpost] OpenAI's Interim CEO's views on AI x-risk 2023-11-20T13:00:40.589Z
[Linkpost] Concept Alignment as a Prerequisite for Value Alignment 2023-11-04T17:34:36.563Z
[Linkpost] Generalization in diffusion models arises from geometry-adaptive harmonic representation 2023-10-11T17:48:24.500Z
[Linkpost] Large language models converge toward human-like concept organization 2023-09-02T06:00:45.504Z
[Linkpost] Robustified ANNs Reveal Wormholes Between Human Category Percepts 2023-08-17T19:10:39.553Z
[Linkpost] Personal and Psychological Dimensions of AI Researchers Confronting AI Catastrophic Risks 2023-08-12T22:02:09.895Z
[Linkpost] Applicability of scaling laws to vision encoding models 2023-08-05T11:10:35.599Z
[Linkpost] Multimodal Neurons in Pretrained Text-Only Transformers 2023-08-04T15:29:16.957Z
[Linkpost] Deception Abilities Emerged in Large Language Models 2023-08-03T17:28:19.193Z
[Linkpost] Interpreting Multimodal Video Transformers Using Brain Recordings 2023-07-21T11:26:39.497Z
Bogdan Ionut Cirstea's Shortform 2023-07-13T22:29:07.851Z
[Linkpost] A shared linguistic space for transmitting our thoughts from brain to brain in natural conversations 2023-07-01T13:57:56.021Z
[Linkpost] Rosetta Neurons: Mining the Common Units in a Model Zoo 2023-06-17T16:38:16.906Z
[Linkpost] Mapping Brains with Language Models: A Survey 2023-06-16T09:49:23.043Z
[Linkpost] The neuroconnectionist research programme 2023-06-12T21:58:57.722Z
[Linkpost] Large Language Models Converge on Brain-Like Word Representations 2023-06-11T11:20:09.078Z
[Linkpost] Scaling laws for language encoding models in fMRI 2023-06-08T10:52:16.400Z

Comments

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-05-08T20:45:38.190Z · LW · GW

Contra both the 'doomers' and the 'optimists' on (not) pausing. Rephrased: RSPs (done right) seem right.

Contra 'doomers'. Oversimplified, 'doomers' (e.g. PauseAI, FLI's letter, Eliezer) ask(ed) for pausing now / even earlier - (e.g. the Pause Letter). I expect this would be / have been very much suboptimal, even purely in terms of solving technical alignment. For example, Some thoughts on automating alignment research suggests timing the pause so that we can use automated AI safety research could result in '[...] each month of lead that the leader started out with would correspond to 15,000 human researchers working for 15 months.' We clearly don't have such automated AI safety R&D capabilities now, suggesting that pausing later, when AIs are closer to having the required automated AI safety R&D capabilities would be better. At the same time, current models seem very unlikely to be x-risky (e.g. they're still very bad at passing dangerous capabilities evals), which is another reason to think pausing now would be premature.

Contra 'optimists'. I'm more unsure here, but the vibe I'm getting from e.g. AI Pause Will Likely Backfire (Guest Post) is roughly something like 'no pause ever'; largely based on arguments of current systems seeming easy to align / control. While I agree with the point that  current systems do seem easy to align / control and I could even see this holding all the way up to ~human-level automated AI safety R&D, I can easily see scenarios where around that time things get scary quickly without any pause. For example, similar arguments to those about the scalability of automated AI safety R&D suggest automated AI capabilities R&D could also be scaled up significantly. For example, figures like those in Before smart AI, there will be many mediocre or specialized AIs suggest very large populations of ~human-level automated AI capabilities researchers could be deployed (e.g. 100x larger than the current [human] population of AI researchers). Given that even with the current relatively small population, algorithmic progress seems to double LM capabilities ~every 8 months, it seems like algorithmic progress could be much faster with 100x larger populations, potentially leading to new setups (e.g. new AI paradigms, new architectures, new optimizers, synthetic data, etc.) which could quite easily break the properties that make current systems seem relatively easy / safe to align. In this scenario, pausing to get this right (especially since automated AI safety R&D would also be feasible) seems like it could be crucial.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-05-08T19:23:29.270Z · LW · GW

Also positive update for me on interdisciplinary conceptual alignment being automatable differentially soon; which seemed to me for a long time plausible, since LLMs have 'read the whole internet' and interdisciplinary insights often seem (to me) to require relatively small numbers of inferential hops (plausibly because it's hard for humans to have [especially deep] expertise in many different domains), making them potentially feasible for LLMs differentially early (reliably making long inferential chains still seems among the harder things for LLMs).


 

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-05-08T18:01:21.101Z · LW · GW

Decomposability seems like a fundamental assumption for interpretability and condition for it to succeed. E.g. from Toy Models of Superposition:

'Decomposability: Neural network activations which are decomposable can be decomposed into features, the meaning of which is not dependent on the value of other features. (This property is ultimately the most important – see the role of decomposition in defeating the curse of dimensionality.) [...]

The first two (decomposability and linearity) are properties we hypothesize to be widespread, while the latter (non-superposition and basis-aligned) are properties we believe only sometimes occur.'

If this assumption is true, it seems favorable to the prospects of safely automating large parts of interpretability work (e.g. using [V]LM agents like MAIA) differentially sooner than many other alignment research subareas and than the most consequential capabilities research (and likely before AIs become x-risky, e.g. before they pass many dangerous capabilities evals). For example, in a t-AGI framework, using an interpretability LM agent to search for the feature corresponding to a certain semantic direction should be much shorter horizon than e.g. coming up with a new conceptual alignment agenda or coming up with a new ML architecture (as well as having much faster feedback loops than e.g. training a SOTA LM using a new architecture).

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-05-08T17:10:32.677Z · LW · GW

Selected fragments (though not really cherry-picked, no reruns) of a conversation with Claude Opus on operationalizing something like Activation vector steering with BCI by applying the methodology of Concept Algebra for (Score-Based) Text-Controlled Generative Models to the model from High-resolution image reconstruction with latent diffusion models from human brain activity (website with nice illustrations of the model).

My prompts bolded:

'Could we do concept algebra directly on the fMRI of the higher visual cortex?
Yes, in principle, it should be possible to perform concept algebra directly on the fMRI signals from the higher visual cortex. The key idea would be to identify subspaces in the fMRI signal space that correspond to specific semantic concepts, and then manipulate these subspaces to control the reconstructed image.
Here's a hypothetical workflow:
 

  1. Collect fMRI data while the subject views a large set of images that vary along certain semantic dimensions (e.g., gender, emotion, object category, scene type, etc.).
  2. For each semantic dimension, identify a set of fMRI samples that differ only along that dimension. For example, for the gender dimension, you might compare the fMRI signals for "man" vs "woman", keeping other aspects of the image content constant.
  3. Use these samples to estimate a "semantic subspace" for each dimension in the fMRI signal space, similar to how concept algebra estimates these subspaces in the latent space of the diffusion model. You could use techniques like PCA or linear discriminant analysis to identify the directions in fMRI space that best capture the variation along each semantic dimension.
  4. To reconstruct an image, first predict the full fMRI signal from the early visual cortex (similar to predicting z in Takagi et al.).
  5. Then, to manipulate a specific semantic concept, project the predicted fMRI signal onto the orthogonal complement of the corresponding semantic subspace (to remove the original semantic information), and add the desired semantic component from a reference fMRI signal (similar to the concept algebra manipulation of z_c).
  6. Finally, use this modified fMRI signal from the higher visual cortex to condition the diffusion model, reconstructing an image with the desired semantic manipulation.

The main challenges would be:
 

  • Collecting a sufficiently large and diverse fMRI dataset to robustly estimate the semantic subspaces.
  • Ensuring that the estimated subspaces are truly orthogonal and capture the desired semantic variation.
  • Translating the manipulated fMRI signals into a format that can effectively condition the diffusion model.

Nonetheless, this is an intriguing direction that could potentially unite the biological understanding from Takagi et al.'s work with the interpretable semantic manipulation of concept algebra. It suggests a way to manipulate images directly from brain-space, without the need for text prompts. This could provide insights into how the brain itself represents and manipulates semantic concepts.'

'How might the above be relevant to the AI alignment problem?
The hypothetical approach of performing concept algebra directly on fMRI signals from the higher visual cortex could potentially offer some interesting insights relevant to the AI alignment problem:
 

  1. Understanding AI systems: By comparing the semantic subspaces identified in fMRI signals with those found in the latent spaces of AI models (like the diffusion model in concept algebra), we could gain insights into how AI systems represent and manipulate semantic concepts, and how this compares to the human brain. This could help in designing AI systems that reason in ways more analogous to humans, which could be beneficial for alignment.
  2. Interpretability and control: The ability to identify and manipulate specific semantic concepts directly in the space of brain activity could provide a more direct and interpretable way of controlling the output of AI systems. If we can align the semantic subspaces of the AI with those of the human brain, it might become possible to control the AI's outputs by manipulating these subspaces, leading to more interpretable and controllable AI systems.
  3. Value alignment: By understanding how human values and preferences are represented in the brain, and how they relate to the semantic concepts identified through concept algebra, we might gain insights into how to align AI systems with human values. If we can identify the neural subspaces corresponding to human values and ethics, we could potentially use this information to constrain the AI's outputs to align with these values.
  4. Cognitive modeling: This approach could also contribute to the development of more accurate cognitive models of human reasoning and decision-making. By understanding how semantic concepts are represented and manipulated in the brain, we can build AI systems that more closely mimic human cognition, which could be useful for predicting and understanding human behavior in the context of AI alignment.
  5. Neuro-symbolic AI: The idea of manipulating semantic concepts directly in the space of brain activity is reminiscent of neuro-symbolic AI approaches, which aim to combine the strengths of neural networks and symbolic reasoning. Insights from this fMRI-based concept algebra could potentially inform the development of neuro-symbolic AI systems that are more interpretable and aligned with human cognition.

However, it's important to note that these are speculative connections, and there would be significant technical and ethical challenges in realizing this approach. Nonetheless, exploring the parallels between the structure of human thought and the structure of AI representations could be a valuable direction for informing AI alignment research.'

'How would this be relevant in the context of this proposal? [note: the proposal being Activation vector steering with BCI]
The proposed concept algebra on fMRI signals could be highly relevant to the goals outlined in this funding proposal. Specifically:
 

  1. The proposal aims to find mappings between human brain states and the latent representations of language models. Applying concept algebra directly on fMRI signals, as discussed, could provide a more direct and interpretable way to establish these mappings. By identifying semantic subspaces in the fMRI signals and aligning them with subspaces in the language model's latent space, we could potentially "steer" the language model to exhibit brain states similar to specific people.
  2. The proposal suggests that reward models could be trained to reproduce humanlike brain states during evaluation, to improve their generalization. If concept algebra on fMRI can successfully manipulate semantic concepts in the reconstructed outputs (e.g., text or images), this could provide a way to directly optimize the reward models to produce outputs that match desired brain states.
  3. The proposal aims to advance scientific understanding of analogies between LLM behavior and human behavior. Comparing the semantic subspaces identified in fMRI with those in LLMs, as enabled by concept algebra, could provide insights into these analogies at a more fundamental, representational level.
  4. The proposed fMRI experiments could potentially provide the necessary data for attempting concept algebra on brain signals. The proposal includes plans for designing and implementing fMRI data-collection protocols, recruiting subjects, and administering the experiments. This data could be used to estimate the semantic subspaces needed for concept algebra.
  5. The ML experiments proposed, including the fMRI feature extraction pipeline and architecture modifications, could potentially be adapted to implement concept algebra on the fMRI signals, in addition to the original goal of mapping brain states to LLM representations.

In summary, while the original proposal does not explicitly mention concept algebra, the ideas discussed around applying concept algebra to fMRI signals could potentially enhance and extend the proposal's goals of connecting human brain states with AI representations, improving reward model generalization, and advancing scientific understanding of human-AI analogies. The proposed fMRI experiments could provide the necessary data, and the planned ML experiments could be extended to include concept algebra on the brain signals.'

'Above you wrote: 'Use these samples to estimate a "semantic subspace" for each dimension in the fMRI signal space, similar to how concept algebra estimates these subspaces in the latent space of the diffusion model. You could use techniques like PCA or linear discriminant analysis to identify the directions in fMRI space that best capture the variation along each semantic dimension.' What if instead I just took a large diverse dataset of fMRI of images of men and fMRI of images of women, took the means of the two, substracted them, and used that as the sex direction in concept algebra?
Your proposed approach of taking the difference between the mean fMRI signals for men and women to define the "sex direction" is a great idea and could indeed work as a simple alternative to more complex methods like PCA or LDA.
Here's why this approach makes sense:
 

  1. If the fMRI signals for men and women differ primarily along a single direction in the high-dimensional fMRI space, then the difference between their means will point along this direction.
  2. By using a large and diverse dataset, you're essentially averaging out other sources of variation (e.g., individual differences, noise), leaving only the consistent difference between men and women.
  3. This is analogous to how concept algebra identifies concept directions in the latent space of the diffusion model: by taking the difference between the latent representations of prompts that differ only in the target concept.

To implement this:
 

  1. Collect a large dataset of fMRI signals for images of men and women, ensuring diversity in other aspects (age, race, emotion, background, etc.).
  2. Compute the mean fMRI signal for men (m_men) and women (m_women) separately.
  3. Compute the "sex direction" as: v_sex = m_women - m_men.
  4. To manipulate the sex of a reconstructed image, you could then project the fMRI signal onto the orthogonal complement of v_sex (to remove the original sex information), and then add or subtract a scaled version of v_sex to introduce the desired sex information.

The main advantages of this approach over PCA or LDA are simplicity and interpretability. It doesn't require any complex eigendecompositions or model fitting, and the resulting direction has a clear interpretation as the difference between men and women.
Potential limitations to keep in mind:
 

  1. If there are multiple independent directions along which men and women differ, this approach will only capture their average, not the individual directions.
  2. If there are significant interactions between sex and other factors (e.g., if the sex difference is itself different for different ages or emotions), this simple difference may not fully capture the sex effect.
  3. The approach assumes linearity - that the sex effect is a simple additive shift in fMRI space. If there are nonlinear interactions, this linear shift may not fully capture the effect.

Despite these potential limitations, I think this is a promising and elegantly simple approach to try for implementing concept algebra directly on fMRI signals. It's a great starting point that could be refined with more sophisticated methods if needed.'

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Mechanistically Eliciting Latent Behaviors in Language Models · 2024-05-06T13:52:42.466Z · LW · GW

In future work, one could imagine automating the evaluation of the coherence and generalization of learned steering vectors, similarly to how Bills et al. (2023) automate interpretability of neurons in language models. For example, one could prompt a trusted model to produce queries that explore the limits and consistency of the behaviors captured by unsupervised steering vectors.

Probably even better to use interpretability agents (e.g. MAIA, AIA) for this, especially since they can do (iterative) hypothesis testing. 

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-05-06T13:24:44.830Z · LW · GW

I wonder how much near-term interpretability [V]LM agents (e.g. MAIA, AIA) might help with finding better probes and better steering vectors (e.g. by iteratively testing counterfactual hypotheses against potentially spurious features, a major challenge for Contrast-consistent search (CCS)). 

This seems plausible since MAIA can already find spurious features, and feature interpretability [V]LM agents could have much lengthier hypotheses iteration cycles (compared to current [V]LM agents and perhaps even to human researchers).

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-05-05T19:12:45.505Z · LW · GW

Also, TC0 is very much limited, see e.g. this presentation.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-05-05T19:11:14.628Z · LW · GW

I think I remember William Merrill (in a video) pointing out that the rational inputs assumption seems very unrealistic (would require infinite memory); and, from what I remember, https://arxiv.org/abs/2404.15758 and related papers made a different assumption about the number of bits of memory per parameter and per input. 

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Mechanistically Eliciting Latent Behaviors in Language Models · 2024-05-04T09:50:17.480Z · LW · GW

TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space seems to be using a contrastive approach for steering vectors (I've only skimmed though), it might be worth having a look.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Mechanistically Eliciting Latent Behaviors in Language Models · 2024-05-02T17:35:31.172Z · LW · GW

Unsupervised Feature Detection There is a rich literature on unsupervised feature detection in neural networks.

It might be interesting to add (some of) the literature doing unsupervised feature detection in GANs and in diffusion models (e.g. see recent work from Pinar Yanardag and citation trails). 

Related, I wonder if instead of / separately from the L2 distance, using something like a contrastive loss (similarly to how it was used in NoiseCLR or in LatentCLR) might produce interesting / different results.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Interpreting the Learning of Deceit · 2024-04-29T18:27:49.906Z · LW · GW

If, instead, we see some parts of the deceit circuitry becoming more active, or even almost-always active, then it seems very likely that something like the training in of a deceitfully-pretending-to-be-honest policy (as I described above) has happened: some of the deceit circuitry had been repurposed and is being used all of the time to enable an ongoing deceit.

A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity seems to me very related in terms of methodology.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight · 2024-04-29T14:12:43.361Z · LW · GW

Any thoughts on how helpful it might be to try to automate the manual inspection and evaluation (for task-relevancy for each feature in the circuit) part from section 4 in the paper, using e.g. a future version of MAIA (to reduce human costs / make the proposal more scalable)?

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-04-29T13:59:06.784Z · LW · GW

Language model agents for interpretability (e.g. MAIA, FIND) seem to be making fast progress, to the point where I expect it might be feasible to safely automate large parts of interpretability workflows soon.

Given the above, it might be high value to start testing integrating more interpretability tools into interpretability (V)LM agents like MAIA and maybe even considering randomized controlled trials to test for any productivity improvements they could already be providing. 

For example, probing / activation steering workflows seem to me relatively short-horizon and at least somewhat standardized, to the point that I wouldn't be surprised if MAIA could already automate very large chunks of that work (with proper tool integration). (Disclaimer: I haven't done much probing / activation steering hands-on work [though I do follow such work quite closely and have supervised related projects], so my views here might be inaccurate).

While I'm not sure I can tell any 'pivotal story' about such automation, if I imagine e.g. 10x more research on probing and activation steering / year / researcher as a result of such automation, it still seems like it could be a huge win. Such work could also e.g. provide much more evidence (either towards or against) the linear representation hypothesis.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Refusal in LLMs is mediated by a single direction · 2024-04-29T13:14:25.502Z · LW · GW

You might be interested in Concept Algebra for (Score-Based) Text-Controlled Generative Models, which uses both a somewhat similar empirical methodology for their concept editing and also provides theoretical reasons to expect the linear representation hypothesis to hold (I'd also interpret the findings here and those from other recent works, like Anthropic's sleeper probes, as evidence towards the linear representation hypothesis broadly).

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-04-29T13:11:02.018Z · LW · GW

Thanks, seen it; see also the exchanges in the thread here: https://twitter.com/jacob_pfau/status/1784446572002230703

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-04-26T18:02:22.018Z · LW · GW

Noteably, the mainline approach for catching doesn't involve any internals usage at all, let alone labeling a bunch of things.

This was indeed my impression (except for potentially using steering vectors, which I think are mentioned in one of the sections in 'Catching AIs red-handed'), but I think not using any internals might be overconservative / might increase the monitoring / safety tax too much (I think this is probably true more broadly of the current control agenda framing).

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-04-26T10:26:38.127Z · LW · GW

Hey Jacques, sure, I'd be happy to chat!  

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-04-26T10:24:39.559Z · LW · GW

Yeah, I'm unsure if I can tell any 'pivotal story' very easily (e.g. I'd still be pretty skeptical of enumerative interp even with GPT-5-MAIA). But I do think, intuitively, GPT-5-MAIA might e.g. make 'catching AIs red-handed' using methods like in this comment significantly easier/cheaper/more scalable. 

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-04-25T11:35:38.397Z · LW · GW

I expect large parts of interpretability work could be safely automatable very soon (e.g. GPT-5 timelines) using (V)LM agents; see A Multimodal Automated Interpretability Agent for a prototype. 

Notably, MAIA (GPT-4V-based) seems approximately human-level on a bunch of interp tasks, while (overwhelmingly likely) being non-scheming (e.g. current models are bad at situational awareness and out-of-context reasoning) and basically-not-x-risky (e.g. bad at ARA).

Given the potential scalability of automated interp, I'd be excited to see plans to use large amounts of compute on it (including e.g. explicit integrations with agendas like superalignment or control; for example, given non-dangerous-capabilities, MAIA seems framable as a 'trusted' model in control terminology).

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on [Full Post] Progress Update #1 from the GDM Mech Interp Team · 2024-04-21T12:56:37.943Z · LW · GW

A proxy that may be slightly less imperfect is auto-interp, a technique introduced by Bills et al. We take the text that highly activates a proposed feature, and have an LLM like GPT-4 or Gemini Ultra try to find an explanation for the common pattern in these texts. We then give the LLM some new text, and this natural language explanation, and have it predict the activations (often quantized to integers between 0 and 10) on this new text, and score it on those predictions

This seems conceptually very related to cycle consistency and backtranslation losses, on which there are large existing literatures it might be worth having a look at, including e.g. theoretical results like in On Translation and Reconstruction Guarantees of the Cycle-Consistent Generative Adversarial Networks or Towards Identifiable Unsupervised Domain Translation: A Diversified Distribution Matching Approach

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-04-18T18:40:08.219Z · LW · GW

Recent long-context LLMs seem to exhibit scaling laws from longer contexts - e.g. fig. 6 at page 8 in Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, fig. 1 at page 1 in Effective Long-Context Scaling of Foundation Models.

The long contexts also seem very helpful for in-context learning, e.g. Many-Shot In-Context Learning.

This seems differentially good for safety (e.g. vs. models with larger forward passes but shorter context windows to achieve the same perplexity), since longer context and in-context learning are differentially transparent.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer · 2024-04-18T18:14:03.323Z · LW · GW

With that in mind, the real hot possibility is the inverse of what Shai and his coresearchers did. Rather than start with a toy model with some known nice latents, start with a net trained on real-world data, and go look for self-similar sets of activations in order to figure out what latent variables the net models its environment as containing. The symmetries of the set would tell us something about how the net updates its distributions over latents in response to inputs and time passing, which in turn would inform how the net models the latents as relating to its inputs, which in turn would inform which real-world structures those latents represent.

The theory-practice gap here looks substantial. Even on this toy model, the fractal embedded in the net is clearly very very noisy, which would make it hard to detect the self-similarity de novo. And in real-world nets, everything would be far higher dimensional, and have a bunch of higher-level structure in it (not just a simple three-state hidden Markov model). Nonetheless, this is the sort of problem where finding a starting point which could solve the problem even in principle is quite difficult, so this one is potentially a big deal.

This seems very much related to agendas like How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme and Searching for a model's concepts by their shape – a theoretical framework

Some (additional) hope for locating the latent representations might come from recent theoretical results around convergence to approximate causal world models and linear representations of causally separable / independent variables in such world models; see this comment. E.g. in Concept Algebra for (Score-Based) Text-Controlled Generative Models they indeed seem able to locate some relevant latents (in Stable Diffusion) using the linearity assumptions.

There's also a large literature out there of using unsupervised priors / constraints, e.g. to look for steering directions inside diffusion models or GANs, including automatically. See for example many recent papers from Pinar Yanardag.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer · 2024-04-18T18:04:55.444Z · LW · GW

One obvious guess there would be that the factorization structure is exploited, e.g. independence and especially conditional independence/DAG structure. And then a big question is how distributions of conditionally independent latents in particular end up embedded.

There are some theoretical reasons to expect linear representations for variables which are causally separable / independent. See recent work from Victor Veitch's group, e.g. Concept Algebra for (Score-Based) Text-Controlled Generative Models, The Linear Representation Hypothesis and the Geometry of Large Language Models, On the Origins of Linear Representations in Large Language Models.

Separately, there are theoretical reasons to expect convergence to approximate causal models of the data generating process, e.g. Robust agents learn causal world models.

Linearity might also make it (provably) easier to find the concepts, see Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Current AIs Provide Nearly No Data Relevant to AGI Alignment · 2024-04-16T19:52:26.130Z · LW · GW

The weak single forward passes argument also applies to SSMs like Mamba for very similar theoretical reasons.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-04-16T19:50:06.340Z · LW · GW

Like transformers, SSMs like Mamba also have weak single forward passes: The Illusion of State in State-Space Models (summary thread). As suggested previously in The Parallelism Tradeoff: Limitations of Log-Precision Transformers, this may be due to a fundamental tradeoff between parallelizability and expressivity:

'We view it as an interesting open question whether it is possible to develop SSM-like models with greater expressivity for state tracking that also have strong parallelizability and learning dynamics, or whether these different goals are fundamentally at odds, as Merrill & Sabharwal (2023a) suggest.'

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on What convincing warning shot could help prevent extinction from AI? · 2024-04-16T09:06:45.983Z · LW · GW

For example, a researcher I've been talking to, when asked what they would need to update, answered, "An AI takes control of a data center." This would be probably too late.

Very much to avoid, but I'm skeptical it 'would be probably too late' (especially if I assume humans are aware of the data center takeover); see e.g. from What if self-exfiltration succeeds?:

'Most likely the model won’t be able to compete on making more capable LLMs, so its capabilities will become stale over time and thus it will lose relative influence. Competing on the state of the art of LLMs is quite hard: the model would need to get access to a sufficiently large number of GPUs and it would need to have world-class machine learning skills. It would also mean that recursive self-improvement is already possible and could be done by the original model owner (as long as they have sufficient alignment techniques). The model could try fine-tuning itself to be smarter, but it’s not clear how to do this and the model would need to worry about currently unsolved alignment problems.'

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on What convincing warning shot could help prevent extinction from AI? · 2024-04-16T08:56:27.785Z · LW · GW

A 10-year global pause would allow for a lot of person-years-equivalents of automated AI safety R&D. E.g. from Some thoughts on automating alignment research (under some assumptions mentioned in the post): 'each month of lead that the leader started out with would correspond to 15,000 human researchers working for 15 months.' And for different assumptions the numbers could be [much] larger still: 'For a model trained with 1000x the compute, over the course of 4 rather than 12 months, you could 100x as many models in parallel.[9] You’d have 1.5 million researchers working for 15 months.'

This would probably obsolete all previous AI safety R&D.

Of course, this assumes you'd be able to use automated AI safety R&D safely and productively. I'm relatively optimistic that a world which would be willing to enforce a 10-year global pause would also invest enough in e.g. a mix of control and superalignment to do this.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Experiments with an alternative method to promote sparsity in sparse autoencoders · 2024-04-16T08:50:01.551Z · LW · GW

There's also an entire literature of variations of [e.g. sparse or disentangled] autoencoders and different losses and priors that it might be worth looking at and that I suspect SAE interp people have barely explored; some of it literally decades-old. E.g. as a potential starting point https://lilianweng.github.io/posts/2018-08-12-vae/ and the citation trails to and from e.g. k-sparse autoencoders.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt · 2024-04-11T21:57:21.223Z · LW · GW

YouTube link

Error message: "Video unavailable
This video is private"

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-03-30T09:06:22.150Z · LW · GW

Examples of reasons to expect (approximate) convergence to the same causal world models in various setups: theorem 2 in Robust agents learn causal world models; from Deep de Finetti: Recovering Topic Distributions from Large Language Models: 'In particular, given the central role of exchangeability in our analysis, this analysis would most naturally be extended to other latent variables that do not depend heavily on word order, such as the author of the document [Andreas, 2022] or the author’s sentiment' (this assumption might be expected to be approximately true for quite a few alignment-relevant-concepts); results from Victor Veitch: Linear Structure of (Causal) Concepts in Generative AI.
 

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Before smart AI, there will be many mediocre or specialized AIs · 2024-03-30T08:50:00.970Z · LW · GW

From https://epochai.org/blog/optimally-allocating-compute-between-inference-and-training, seems consistent with this post's main assumption: 'If it is feasible to trade off inference and training compute, we find that it is optimal for AI labs to spend similar amounts on training and inference.' 

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on LawrenceC's Shortform · 2024-03-20T12:26:01.705Z · LW · GW

Seems relevant - RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval:

'Our theoretical analysis reveals that CoT improves RNNs but is insufficient to close the gap with Transformers. A key bottleneck lies in the inability of RNNs to perfectly retrieve information from the context, even with CoT: for several tasks that explicitly or implicitly require this capability, such as associative recall and determining if a graph is a tree, we prove that RNNs are not expressive enough to solve the tasks while Transformers can solve them with ease. Conversely, we prove that adopting techniques to enhance the in-context retrieval capability of RNNs, including Retrieval-Augmented Generation (RAG) and adding a single Transformer layer, can elevate RNNs to be capable of solving all polynomial-time solvable problems with CoT, hence closing the representation gap with Transformers.'

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Many arguments for AI x-risk are wrong · 2024-03-19T20:52:47.569Z · LW · GW

Note that even if RLHF takes a non-scheming pretrained model and converts it to a schemer, if this RLHF doesn't teach the model new things, then we can always train another copy of the pretrained model to be a monitor for the RLHF'd model and this monitor will be just as competent as the RLHF'd model. So scheming seems like substantially less of a problem in this case. (We'd need to use this monitor for all potentially dangerous actions to get safety properties.) (This is similar to the proposal in Appendix G of the weak-to-strong generalization paper, but with this addition that you deploy the reward model as a monitor which is required for any interesting guarantees.)

I'd personally like to see this written up in more details (or a reference). Also, is it Appendix G of the weak-to-strong generalization paper? I looked briefly and it didn't seem very related.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Current AIs Provide Nearly No Data Relevant to AGI Alignment · 2024-03-19T12:37:39.426Z · LW · GW

One additional probably important distinction / nuance: there are also theoretical results for why CoT shouldn't just help with one-forward-pass expressivity, but also with learning. E.g. the result in Auto-Regressive Next-Token Predictors are Universal Learners is about learning; similarly for Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks, Why Can Large Language Models Generate Correct Chain-of-Thoughts?, Why think step by step? Reasoning emerges from the locality of experience.

The learning aspect could be strategically crucial with respect to what the first transformatively-useful AIs should look like; also see e.g. discussion here and here. In the sense that this should add further reasons to think the first such AIs should probably (differentially) benefit from learning from data using intermediate outputs like CoT; or at least have a pretraining-like phase involving such intermediate outputs, even if this might be later distilled or modified some other way - e.g. replaced with [less transparent] recurrence.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-03-19T12:05:07.984Z · LW · GW

- For scheming, the model could reason about "should I still stay undercover", "what should I do in case I should stay undercover" and "what should I do in case it's time to attack" in parallel, finally using only one serial step to decide on its action.

I am also very interested in e.g. how one could operationalize the number of hops of inference of out-of-context reasoning required for various types of scheming, especially scheming in one-forward-pass; and especially in the context of automated AI safety R&D.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-03-19T12:02:16.928Z · LW · GW

I would expect, generally, solving tasks in parallel to be fundamentally hard in one-forward pass for pretty much all current SOTA architectures (especially Transformers and modern RNNs like MAMBA). See e.g. this comment of mine; and other related works like https://twitter.com/bohang_zhang/status/1664695084875501579, https://twitter.com/bohang_zhang/status/1664695108447399937 (video presentation), Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks, RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval

There might be more such results I'm currently forgetting about, but they should be relatively easy to find by e.g. following citation trails (to and from the above references) with Google Scholar (or by looking at my recent comments / short forms).
 

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Neuroscience and Alignment · 2024-03-19T09:58:21.497Z · LW · GW

First is that I don't really expect us to come up with a fully general answer to this problem in time. I wouldn't be surprised if we had to trade off some generality for indexing on the system in front of us - this gets us some degree of non-robustness, but hopefully enough to buy us a lot more time before stuff like the problem behind deep deception breaks a lack of True Names. Hopefully then we can get the AI systems to solve the harder problem for us in the time we've bought, with systems more powerful than us. The relevance here is that if this is the case, then trying to generalize our findings to an entirely non-ML setting, while definitely something we want, might not be something we get, and maybe it makes sense to index lightly on a particular paradigm if the general problem seems really hard.

yes, e.g. https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/bogdan-ionut-cirstea-s-shortform?commentId=GRjfMwLDFgw6qLnDv 


 

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-03-19T09:55:54.106Z · LW · GW

Or maybe not, apparently LLMs are (mostly) not helped by filler tokens.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Neuroscience and Alignment · 2024-03-19T09:29:59.661Z · LW · GW

When I think of how I approach solving this problem in practice, I think of interfacing with structures within ML systems that satisfy an increasing list of desiderata for values, covering the rest with standard mech interp techniques, and then steering them with human preferences. I certainly think it's probable that there are valuable insights to this process from neuroscience, but I don't think a good solution to this problem (under the constraints I mention above) requires that it be general to the human brain as well. We steer the system with our preferences (and interfacing with internal objectives seems to avoid the usual problems with preferences) - while something that allows us to actually directly translate our values from our system to theirs would be great, I expect that constraint of generality to make it harder than necessary.

I think the seeds of an interdisciplinary agenda on this are already there, see e.g. https://manifund.org/projects/activation-vector-steering-with-bci, https://www.lesswrong.com/posts/GfZfDHZHCuYwrHGCd/without-fundamental-advances-misalignment-and-catastrophe?commentId=WLCcQS5Jc7NNDqWi5, https://www.lesswrong.com/posts/GfZfDHZHCuYwrHGCd/without-fundamental-advances-misalignment-and-catastrophe?commentId=D6NCcYF7Na5bpF5h5, https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/bogdan-ionut-cirstea-s-shortform?commentId=A8muL55dYxR3tv5wp and maybe my other comments on this post. 

I might have a shortform going into more details on this soon, ot at least by the time of https://foresight.org/2024-foresight-neurotech-bci-and-wbe-for-safe-ai-workshop/

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Neuroscience and Alignment · 2024-03-19T09:22:14.128Z · LW · GW

If I had access to a neuroscience or tech lab (and the relevant skills), I'd be doing that rather than ML.

It sounds to me like you should seriously consider doing work which might look like e.g.  https://www.lesswrong.com/posts/eruHcdS9DmQsgLqd4/inducing-human-like-biases-in-moral-reasoning-lms, Getting aligned on representational alignment, Training language models to summarize narratives improves brain alignment; also see a lot of recent works from Ilia Sucholutsky and from this workshop.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Neuroscience and Alignment · 2024-03-19T08:59:47.000Z · LW · GW

Why is value alignment different from these? Because we have working example of a value-aligned system right in front of us: The human brain. This permits an entirely scientific approach, requiring minimal philosophical deconfusion. And in contrast to corrigibility solutions, biological and artificial neural-networks are based upon the same fundamental principles, so there's a much greater chance that insights from the one easily work in the other.

The similarities go even deeper, I'd say, see e.g. The neuroconnectionist research programme for a review and quite a few of my past linkposts (e.g. on representational alignment and how it could be helpful for value alignment, on evidence of [by default] (some) representational alignment between LLMs and humans, etc.); and https://www.lesswrong.com/posts/eruHcdS9DmQsgLqd4/inducing-human-like-biases-in-moral-reasoning-lms

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Neuroscience and Alignment · 2024-03-19T08:54:03.060Z · LW · GW

I've been in many conversations where I've mentioned the idea of using neuroscience for outer alignment, and the people who I'm talking to usually seem pretty confused about why I would want to do that. Well, I'm confused about why one wouldn't want to do that, and in this post I explain why.

I've had related thoughts (and still do, though it's become more of a secondary research agenda); might be interesting to chat (more) during https://foresight.org/2024-foresight-neurotech-bci-and-wbe-for-safe-ai-workshop/

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on More people getting into AI safety should do a PhD · 2024-03-15T04:19:35.467Z · LW · GW

Perhaps even better: 'more people who already have PhD's should be recruited to do AI safety'.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Many arguments for AI x-risk are wrong · 2024-03-15T03:09:09.499Z · LW · GW

It seems like the question you're asking is close to (2) in my above decomposition.

Yup.

Aren't you worried that long before human obsoleting AI (or AI safety researcher obsoleting AI), these architectures are very uncompetitive and thus won't be viable given realistic delay budgets?

Quite uncertain about all this, but I have short timelines and expect likely not many more OOMs of effective compute will be needed to e.g. something which can 30x AI safety research (as long as we really try). I expect shorter timelines / OOM 'gaps' to come along with e.g. fewer architectural changes, all else equal. There are also broader reasons why I think it's quite plausible the high level considerations might not change much even given some architectural changes, discussed in the weak-forward-pass comment (e.g. 'the parallelism tradeoff').

I have some hope for a plan like:

  • Control early transformative AI
  • Use these AIs to make a safer approach much more competitive. (Maybe the approach consists of AIs which are too weak to scheme in a forward pass but which are combined into some crazy bureaucracy and made very cheap to run.)
  • Use those next AIs to do something.

(This plan has the downside that it probably requires doing a bunch of general purpose capabilities which might make the situation much more unstable and volatile due to huge compute overhang if fully scaled up in the most performant (but unsafe) way.)

Sounds pretty good to me, I guess the crux (as hinted at during some personal conversations too) might be that I'm just much more optimistic about this being feasible without huge capabilities pushes (again, some arguments in the weak-forward-pass comment, e.g. about CoT distillation seeming to work decently - helping with, for a fixed level of capabilities, more of it coming from scaffolding and less from one forward pass; or on CoT length / inference complexity tradeoffs).

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Many arguments for AI x-risk are wrong · 2024-03-14T05:03:36.321Z · LW · GW

A big part of how I'm thinking about this is a very related version:

If you keep scaling up networks with pretraining and light RLHF and various differentially transparent scaffolding, what comes first, AI safety researcher obsoleting or scheming?

One can also focus on various types of scheming and when / in which order they'd happen. E.g. I'd be much more worried about scheming in one forward pass than about scheming in CoT (which seems more manageable using e.g. control methods), but I also expect that to happen later. Where 'later' can be operationalized in terms of requiring more effective compute.

I think The direct approach could provide a potential (rough) upper bound for the effective compute required to obsolete AI safety researchers. Though I'd prefer more substantial empirical evidence based on e.g. something like scaling laws on automated AI safety R&D evals.

Similarly, one can imagine evals for different capabilities which would be prerequisites for various types of scheming and doing scaling laws on those; e.g. for out-of-context reasoning, where multi-hop ouf-of-context reasoning seems necessary for instrumental deceptive reasoning in one forward pass (as prerequisite for scheming in one forward pass).

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Many arguments for AI x-risk are wrong · 2024-03-14T04:23:53.762Z · LW · GW

I think of "light RLHF" as "RLHF which doesn't teach the model qualitatively new things, but instead just steers the model at a high level". In practice, a single round of DPO on <100,000 examples surely counts, but I'm unsure about the exact limits.

(In principle, a small amount of RL can update a model very far, I don't think we see this in practice.)

Empirical evidence about this indeed being the case for DPO.

Note that even if RLHF takes a non-scheming pretrained model and converts it to a schemer, if this RLHF doesn't teach the model new things, then we can always train another copy of the pretrained model to be a monitor for the RLHF'd model and this monitor will be just as competent as the RLHF'd model. So scheming seems like substantially less of a problem in this case. (We'd need to use this monitor for all potentially dangerous actions to get safety properties.) (This is similar to the proposal in Appendix G of the weak-to-strong generalization paper, but with this addition that you deploy the reward model as a monitor which is required for any interesting guarantees.)

Also see Interpreting the learning of deceit for another proposal/research agenda to deal with this threat model.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-03-11T16:55:43.919Z · LW · GW

Hmm, unsure about this. E.g. the development models of many in the alignment community before GPT-3 (often heavily focused on RL or even on GOFAI) seem quite substantially worse in retrospect than those of some of the most famous deep learning people (e.g. LeCun's cake); of course, this may be an unfair/biased comparison using hindsight. Unsure how much theory results were influencing the famous deep learners (and e.g. classic learning theory results would probably have been misleading), but doesn't seem obvious they had 0 influence? For example, Bengio has multiple at least somewhat conceptual / theoretical (including review) papers motivating deep/representation learning; e.g. Representation Learning: A Review and New Perspectives.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-03-11T07:10:56.285Z · LW · GW

Some more (somewhat) related papers:

Rethinking Model-based, Policy-based, and Value-based Reinforcement Learning via the Lens of Representation Complexity ('We first demonstrate that, for a broad class of Markov decision processes (MDPs), the model can be represented by constant-depth circuits with polynomial size or Multi-Layer Perceptrons (MLPs) with constant layers and polynomial hidden dimension. However, the representation of the optimal policy and optimal value proves to be NP-complete and unattainable by constant-layer MLPs with polynomial size. This demonstrates a significant representation complexity gap between model-based RL and model-free RL, which includes policy-based RL and value-based RL. To further explore the representation complexity hierarchy between policy-based RL and value-based RL, we introduce another general class of MDPs where both the model and optimal policy can be represented by constant-depth circuits with polynomial size or constant-layer MLPs with polynomial size. In contrast, representing the optimal value is P-complete and intractable via a constant-layer MLP with polynomial hidden dimension. This accentuates the intricate representation complexity associated with value-based RL compared to policy-based RL. In summary, we unveil a potential representation complexity hierarchy within RL -- representing the model emerges as the easiest task, followed by the optimal policy, while representing the optimal value function presents the most intricate challenge.').

On Representation Complexity of Model-based and Model-free Reinforcement Learning ('We prove theoretically that there exists a broad class of MDPs such that their underlying transition and reward functions can be represented by constant depth circuits with polynomial size, while the optimal Q-function suffers an exponential circuit complexity in constant-depth circuits. By drawing attention to the approximation errors and building connections to complexity theory, our theory provides unique insights into why model-based algorithms usually enjoy better sample complexity than model-free algorithms from a novel representation complexity perspective: in some cases, the ground-truth rule (model) of the environment is simple to represent, while other quantities, such as Q-function, appear complex. We empirically corroborate our theory by comparing the approximation error of the transition kernel, reward function, and optimal Q-function in various Mujoco environments, which demonstrates that the approximation errors of the transition kernel and reward function are consistently lower than those of the optimal Q-function. To the best of our knowledge, this work is the first to study the circuit complexity of RL, which also provides a rigorous framework for future research.').

Demonstration-Regularized RL ('Incorporating expert demonstrations has empirically helped to improve the sample efficiency of reinforcement learning (RL). This paper quantifies theoretically to what extent this extra information reduces RL's sample complexity. In particular, we study the demonstration-regularized reinforcement learning that leverages the expert demonstrations by KL-regularization for a policy learned by behavior cloning. Our findings reveal that using NE expert demonstrations enables the identification of an optimal policy at a sample complexity of order O˜(Poly(S,A,H)/(ε^2 * N^E)) in finite and O˜(Poly(d,H)/(ε^2 * N^E)) in linear Markov decision processes, where ε is the target precision, H the horizon, A the number of action, S the number of states in the finite case and d the dimension of the feature space in the linear case. As a by-product, we provide tight convergence guarantees for the behaviour cloning procedure under general assumptions on the policy classes. Additionally, we establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback (RLHF). In this respect, we provide theoretical evidence showing the benefits of KL-regularization for RLHF in tabular and linear MDPs. Interestingly, we avoid pessimism injection by employing computationally feasible regularization to handle reward estimation uncertainty, thus setting our approach apart from the prior works.').

Limitations of Agents Simulated by Predictive Models ('There is increasing focus on adapting predictive models into agent-like systems, most notably AI assistants based on language models. We outline two structural reasons for why these models can fail when turned into agents. First, we discuss auto-suggestive delusions. Prior work has shown theoretically that models fail to imitate agents that generated the training data if the agents relied on hidden observations: the hidden observations act as confounding variables, and the models treat actions they generate as evidence for nonexistent observations. Second, we introduce and formally study a related, novel limitation: predictor-policy incoherence. When a model generates a sequence of actions, the model's implicit prediction of the policy that generated those actions can serve as a confounding variable. The result is that models choose actions as if they expect future actions to be suboptimal, causing them to be overly conservative. We show that both of those failures are fixed by including a feedback loop from the environment, that is, re-training the models on their own actions. We give simple demonstrations of both limitations using Decision Transformers and confirm that empirical results agree with our conceptual and formal analysis. Our treatment provides a unifying view of those failure modes, and informs the question of why fine-tuning offline learned policies with online learning makes them more effective.').
 

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-03-11T06:19:08.529Z · LW · GW

A brief list of resources with theoretical results which seem to imply RL is much more (e.g. sample efficiency-wise) difficult than IL - imitation learning (I don't feel like I have enough theoretical RL expertise or time to scrutinize hard the arguments, but would love for others to pitch in). Probably at least somewhat relevant w.r.t. discussions of what the first AIs capable of obsoleting humanity could look like: 

Paper: Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning? (quote: 'This work shows that, from the statistical viewpoint, the situation is far subtler than suggested by the more traditional approximation viewpoint, where the requirements on the representation that suffice for sample efficient RL are even more stringent. Our main results provide sharp thresholds for reinforcement learning methods, showing that there are hard limitations on what constitutes good function approximation (in terms of the dimensionality of the representation), where we focus on natural representational conditions relevant to value-based, model-based, and policy-based learning. These lower bounds highlight that having a good (value-based, model-based, or policy-based) representation in and of itself is insufficient for efficient reinforcement learning, unless the quality of this approximation passes certain hard thresholds. Furthermore, our lower bounds also imply exponential separations on the sample complexity between 1) value-based learning with perfect representation and value-based learning with a good-but-not-perfect representation, 2) value-based learning and policy-based learning, 3) policy-based learning and supervised learning and 4) reinforcement learning and imitation learning.')

Talks (very likely with some redundancy): 

Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning - Sham Kakade
What is the Statistical Complexity of Reinforcement Learning? (and another two versions)

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Going Beyond Linear Mode Connectivity: The Layerwise Linear Feature Connectivity · 2024-03-09T06:14:43.564Z · LW · GW

Seems relevant (but I've only skimmed): Training-Free Pretrained Model Merging.