Posts

[Linkpost] MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data 2024-03-10T01:30:46.477Z
Inducing human-like biases in moral reasoning LMs 2024-02-20T16:28:11.424Z
AISC project: How promising is automating alignment research? (literature review) 2023-11-28T14:47:29.372Z
[Linkpost] OpenAI's Interim CEO's views on AI x-risk 2023-11-20T13:00:40.589Z
[Linkpost] Concept Alignment as a Prerequisite for Value Alignment 2023-11-04T17:34:36.563Z
[Linkpost] Generalization in diffusion models arises from geometry-adaptive harmonic representation 2023-10-11T17:48:24.500Z
[Linkpost] Large language models converge toward human-like concept organization 2023-09-02T06:00:45.504Z
[Linkpost] Robustified ANNs Reveal Wormholes Between Human Category Percepts 2023-08-17T19:10:39.553Z
[Linkpost] Personal and Psychological Dimensions of AI Researchers Confronting AI Catastrophic Risks 2023-08-12T22:02:09.895Z
[Linkpost] Applicability of scaling laws to vision encoding models 2023-08-05T11:10:35.599Z
[Linkpost] Multimodal Neurons in Pretrained Text-Only Transformers 2023-08-04T15:29:16.957Z
[Linkpost] Deception Abilities Emerged in Large Language Models 2023-08-03T17:28:19.193Z
[Linkpost] Interpreting Multimodal Video Transformers Using Brain Recordings 2023-07-21T11:26:39.497Z
Bogdan Ionut Cirstea's Shortform 2023-07-13T22:29:07.851Z
[Linkpost] A shared linguistic space for transmitting our thoughts from brain to brain in natural conversations 2023-07-01T13:57:56.021Z
[Linkpost] Rosetta Neurons: Mining the Common Units in a Model Zoo 2023-06-17T16:38:16.906Z
[Linkpost] Mapping Brains with Language Models: A Survey 2023-06-16T09:49:23.043Z
[Linkpost] The neuroconnectionist research programme 2023-06-12T21:58:57.722Z
[Linkpost] Large Language Models Converge on Brain-Like Word Representations 2023-06-11T11:20:09.078Z
[Linkpost] Scaling laws for language encoding models in fMRI 2023-06-08T10:52:16.400Z

Comments

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on [Full Post] Progress Update #1 from the GDM Mech Interp Team · 2024-04-21T12:56:37.943Z · LW · GW

A proxy that may be slightly less imperfect is auto-interp, a technique introduced by Bills et al. We take the text that highly activates a proposed feature, and have an LLM like GPT-4 or Gemini Ultra try to find an explanation for the common pattern in these texts. We then give the LLM some new text, and this natural language explanation, and have it predict the activations (often quantized to integers between 0 and 10) on this new text, and score it on those predictions

This seems conceptually very related to cycle consistency and backtranslation losses, on which there are large existing literatures it might be worth having a look at, including e.g. theoretical results like in On Translation and Reconstruction Guarantees of the Cycle-Consistent Generative Adversarial Networks or Towards Identifiable Unsupervised Domain Translation: A Diversified Distribution Matching Approach

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-04-18T18:40:08.219Z · LW · GW

Recent long-context LLMs seem to exhibit scaling laws from longer contexts - e.g. fig. 6 at page 8 in Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, fig. 1 at page 1 in Effective Long-Context Scaling of Foundation Models.

The long contexts also seem very helpful for in-context learning, e.g. Many-Shot In-Context Learning.

This seems differentially good for safety (e.g. vs. models with larger forward passes but shorter context windows to achieve the same perplexity), since longer context and in-context learning are differentially transparent.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer · 2024-04-18T18:14:03.323Z · LW · GW

With that in mind, the real hot possibility is the inverse of what Shai and his coresearchers did. Rather than start with a toy model with some known nice latents, start with a net trained on real-world data, and go look for self-similar sets of activations in order to figure out what latent variables the net models its environment as containing. The symmetries of the set would tell us something about how the net updates its distributions over latents in response to inputs and time passing, which in turn would inform how the net models the latents as relating to its inputs, which in turn would inform which real-world structures those latents represent.

The theory-practice gap here looks substantial. Even on this toy model, the fractal embedded in the net is clearly very very noisy, which would make it hard to detect the self-similarity de novo. And in real-world nets, everything would be far higher dimensional, and have a bunch of higher-level structure in it (not just a simple three-state hidden Markov model). Nonetheless, this is the sort of problem where finding a starting point which could solve the problem even in principle is quite difficult, so this one is potentially a big deal.

This seems very much related to agendas like How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme and Searching for a model's concepts by their shape – a theoretical framework

Some (additional) hope for locating the latent representations might come from recent theoretical results around convergence to approximate causal world models and linear representations of causally separable / independent variables in such world models; see this comment. E.g. in Concept Algebra for (Score-Based) Text-Controlled Generative Models they indeed seem able to locate some relevant latents (in Stable Diffusion) using the linearity assumptions.

There's also a large literature out there of using unsupervised priors / constraints, e.g. to look for steering directions inside diffusion models or GANs, including automatically. See for example many recent papers from Pinar Yanardag.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer · 2024-04-18T18:04:55.444Z · LW · GW

One obvious guess there would be that the factorization structure is exploited, e.g. independence and especially conditional independence/DAG structure. And then a big question is how distributions of conditionally independent latents in particular end up embedded.

There are some theoretical reasons to expect linear representations for variables which are causally separable / independent. See recent work from Victor Veitch's group, e.g. Concept Algebra for (Score-Based) Text-Controlled Generative Models, The Linear Representation Hypothesis and the Geometry of Large Language Models, On the Origins of Linear Representations in Large Language Models.

Separately, there are theoretical reasons to expect convergence to approximate causal models of the data generating process, e.g. Robust agents learn causal world models.

Linearity might also make it (provably) easier to find the concepts, see Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Current AIs Provide Nearly No Data Relevant to AGI Alignment · 2024-04-16T19:52:26.130Z · LW · GW

The weak single forward passes argument also applies to SSMs like Mamba for very similar theoretical reasons.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-04-16T19:50:06.340Z · LW · GW

Like transformers, SSMs like Mamba also have weak single forward passes: The Illusion of State in State-Space Models (summary thread). As suggested previously in The Parallelism Tradeoff: Limitations of Log-Precision Transformers, this may be due to a fundamental tradeoff between parallelizability and expressivity:

'We view it as an interesting open question whether it is possible to develop SSM-like models with greater expressivity for state tracking that also have strong parallelizability and learning dynamics, or whether these different goals are fundamentally at odds, as Merrill & Sabharwal (2023a) suggest.'

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on What convincing warning shot could help prevent extinction from AI? · 2024-04-16T09:06:45.983Z · LW · GW

For example, a researcher I've been talking to, when asked what they would need to update, answered, "An AI takes control of a data center." This would be probably too late.

Very much to avoid, but I'm skeptical it 'would be probably too late' (especially if I assume humans are aware of the data center takeover); see e.g. from What if self-exfiltration succeeds?:

'Most likely the model won’t be able to compete on making more capable LLMs, so its capabilities will become stale over time and thus it will lose relative influence. Competing on the state of the art of LLMs is quite hard: the model would need to get access to a sufficiently large number of GPUs and it would need to have world-class machine learning skills. It would also mean that recursive self-improvement is already possible and could be done by the original model owner (as long as they have sufficient alignment techniques). The model could try fine-tuning itself to be smarter, but it’s not clear how to do this and the model would need to worry about currently unsolved alignment problems.'

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on What convincing warning shot could help prevent extinction from AI? · 2024-04-16T08:56:27.785Z · LW · GW

A 10-year global pause would allow for a lot of person-years-equivalents of automated AI safety R&D. E.g. from Some thoughts on automating alignment research (under some assumptions mentioned in the post): 'each month of lead that the leader started out with would correspond to 15,000 human researchers working for 15 months.' And for different assumptions the numbers could be [much] larger still: 'For a model trained with 1000x the compute, over the course of 4 rather than 12 months, you could 100x as many models in parallel.[9] You’d have 1.5 million researchers working for 15 months.'

This would probably obsolete all previous AI safety R&D.

Of course, this assumes you'd be able to use automated AI safety R&D safely and productively. I'm relatively optimistic that a world which would be willing to enforce a 10-year global pause would also invest enough in e.g. a mix of control and superalignment to do this.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Experiments with an alternative method to promote sparsity in sparse autoencoders · 2024-04-16T08:50:01.551Z · LW · GW

There's also an entire literature of variations of [e.g. sparse or disentangled] autoencoders and different losses and priors that it might be worth looking at and that I suspect SAE interp people have barely explored; some of it literally decades-old. E.g. as a potential starting point https://lilianweng.github.io/posts/2018-08-12-vae/ and the citation trails to and from e.g. k-sparse autoencoders.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt · 2024-04-11T21:57:21.223Z · LW · GW

YouTube link

Error message: "Video unavailable
This video is private"

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-03-30T09:06:22.150Z · LW · GW

Examples of reasons to expect (approximate) convergence to the same causal world models in various setups: theorem 2 in Robust agents learn causal world models; from Deep de Finetti: Recovering Topic Distributions from Large Language Models: 'In particular, given the central role of exchangeability in our analysis, this analysis would most naturally be extended to other latent variables that do not depend heavily on word order, such as the author of the document [Andreas, 2022] or the author’s sentiment' (this assumption might be expected to be approximately true for quite a few alignment-relevant-concepts); results from Victor Veitch: Linear Structure of (Causal) Concepts in Generative AI.
 

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Before smart AI, there will be many mediocre or specialized AIs · 2024-03-30T08:50:00.970Z · LW · GW

From https://epochai.org/blog/optimally-allocating-compute-between-inference-and-training, seems consistent with this post's main assumption: 'If it is feasible to trade off inference and training compute, we find that it is optimal for AI labs to spend similar amounts on training and inference.' 

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on LawrenceC's Shortform · 2024-03-20T12:26:01.705Z · LW · GW

Seems relevant - RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval:

'Our theoretical analysis reveals that CoT improves RNNs but is insufficient to close the gap with Transformers. A key bottleneck lies in the inability of RNNs to perfectly retrieve information from the context, even with CoT: for several tasks that explicitly or implicitly require this capability, such as associative recall and determining if a graph is a tree, we prove that RNNs are not expressive enough to solve the tasks while Transformers can solve them with ease. Conversely, we prove that adopting techniques to enhance the in-context retrieval capability of RNNs, including Retrieval-Augmented Generation (RAG) and adding a single Transformer layer, can elevate RNNs to be capable of solving all polynomial-time solvable problems with CoT, hence closing the representation gap with Transformers.'

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Many arguments for AI x-risk are wrong · 2024-03-19T20:52:47.569Z · LW · GW

Note that even if RLHF takes a non-scheming pretrained model and converts it to a schemer, if this RLHF doesn't teach the model new things, then we can always train another copy of the pretrained model to be a monitor for the RLHF'd model and this monitor will be just as competent as the RLHF'd model. So scheming seems like substantially less of a problem in this case. (We'd need to use this monitor for all potentially dangerous actions to get safety properties.) (This is similar to the proposal in Appendix G of the weak-to-strong generalization paper, but with this addition that you deploy the reward model as a monitor which is required for any interesting guarantees.)

I'd personally like to see this written up in more details (or a reference). Also, is it Appendix G of the weak-to-strong generalization paper? I looked briefly and it didn't seem very related.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Current AIs Provide Nearly No Data Relevant to AGI Alignment · 2024-03-19T12:37:39.426Z · LW · GW

One additional probably important distinction / nuance: there are also theoretical results for why CoT shouldn't just help with one-forward-pass expressivity, but also with learning. E.g. the result in Auto-Regressive Next-Token Predictors are Universal Learners is about learning; similarly for Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks, Why Can Large Language Models Generate Correct Chain-of-Thoughts?, Why think step by step? Reasoning emerges from the locality of experience.

The learning aspect could be strategically crucial with respect to what the first transformatively-useful AIs should look like; also see e.g. discussion here and here. In the sense that this should add further reasons to think the first such AIs should probably (differentially) benefit from learning from data using intermediate outputs like CoT; or at least have a pretraining-like phase involving such intermediate outputs, even if this might be later distilled or modified some other way - e.g. replaced with [less transparent] recurrence.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-03-19T12:05:07.984Z · LW · GW

- For scheming, the model could reason about "should I still stay undercover", "what should I do in case I should stay undercover" and "what should I do in case it's time to attack" in parallel, finally using only one serial step to decide on its action.

I am also very interested in e.g. how one could operationalize the number of hops of inference of out-of-context reasoning required for various types of scheming, especially scheming in one-forward-pass; and especially in the context of automated AI safety R&D.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-03-19T12:02:16.928Z · LW · GW

I would expect, generally, solving tasks in parallel to be fundamentally hard in one-forward pass for pretty much all current SOTA architectures (especially Transformers and modern RNNs like MAMBA). See e.g. this comment of mine; and other related works like https://twitter.com/bohang_zhang/status/1664695084875501579, https://twitter.com/bohang_zhang/status/1664695108447399937 (video presentation), Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks, RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval

There might be more such results I'm currently forgetting about, but they should be relatively easy to find by e.g. following citation trails (to and from the above references) with Google Scholar (or by looking at my recent comments / short forms).
 

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Neuroscience and Alignment · 2024-03-19T09:58:21.497Z · LW · GW

First is that I don't really expect us to come up with a fully general answer to this problem in time. I wouldn't be surprised if we had to trade off some generality for indexing on the system in front of us - this gets us some degree of non-robustness, but hopefully enough to buy us a lot more time before stuff like the problem behind deep deception breaks a lack of True Names. Hopefully then we can get the AI systems to solve the harder problem for us in the time we've bought, with systems more powerful than us. The relevance here is that if this is the case, then trying to generalize our findings to an entirely non-ML setting, while definitely something we want, might not be something we get, and maybe it makes sense to index lightly on a particular paradigm if the general problem seems really hard.

yes, e.g. https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/bogdan-ionut-cirstea-s-shortform?commentId=GRjfMwLDFgw6qLnDv 


 

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-03-19T09:55:54.106Z · LW · GW

Or maybe not, apparently LLMs are (mostly) not helped by filler tokens.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Neuroscience and Alignment · 2024-03-19T09:29:59.661Z · LW · GW

When I think of how I approach solving this problem in practice, I think of interfacing with structures within ML systems that satisfy an increasing list of desiderata for values, covering the rest with standard mech interp techniques, and then steering them with human preferences. I certainly think it's probable that there are valuable insights to this process from neuroscience, but I don't think a good solution to this problem (under the constraints I mention above) requires that it be general to the human brain as well. We steer the system with our preferences (and interfacing with internal objectives seems to avoid the usual problems with preferences) - while something that allows us to actually directly translate our values from our system to theirs would be great, I expect that constraint of generality to make it harder than necessary.

I think the seeds of an interdisciplinary agenda on this are already there, see e.g. https://manifund.org/projects/activation-vector-steering-with-bci, https://www.lesswrong.com/posts/GfZfDHZHCuYwrHGCd/without-fundamental-advances-misalignment-and-catastrophe?commentId=WLCcQS5Jc7NNDqWi5, https://www.lesswrong.com/posts/GfZfDHZHCuYwrHGCd/without-fundamental-advances-misalignment-and-catastrophe?commentId=D6NCcYF7Na5bpF5h5, https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/bogdan-ionut-cirstea-s-shortform?commentId=A8muL55dYxR3tv5wp and maybe my other comments on this post. 

I might have a shortform going into more details on this soon, ot at least by the time of https://foresight.org/2024-foresight-neurotech-bci-and-wbe-for-safe-ai-workshop/

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Neuroscience and Alignment · 2024-03-19T09:22:14.128Z · LW · GW

If I had access to a neuroscience or tech lab (and the relevant skills), I'd be doing that rather than ML.

It sounds to me like you should seriously consider doing work which might look like e.g.  https://www.lesswrong.com/posts/eruHcdS9DmQsgLqd4/inducing-human-like-biases-in-moral-reasoning-lms, Getting aligned on representational alignment, Training language models to summarize narratives improves brain alignment; also see a lot of recent works from Ilia Sucholutsky and from this workshop.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Neuroscience and Alignment · 2024-03-19T08:59:47.000Z · LW · GW

Why is value alignment different from these? Because we have working example of a value-aligned system right in front of us: The human brain. This permits an entirely scientific approach, requiring minimal philosophical deconfusion. And in contrast to corrigibility solutions, biological and artificial neural-networks are based upon the same fundamental principles, so there's a much greater chance that insights from the one easily work in the other.

The similarities go even deeper, I'd say, see e.g. The neuroconnectionist research programme for a review and quite a few of my past linkposts (e.g. on representational alignment and how it could be helpful for value alignment, on evidence of [by default] (some) representational alignment between LLMs and humans, etc.); and https://www.lesswrong.com/posts/eruHcdS9DmQsgLqd4/inducing-human-like-biases-in-moral-reasoning-lms

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Neuroscience and Alignment · 2024-03-19T08:54:03.060Z · LW · GW

I've been in many conversations where I've mentioned the idea of using neuroscience for outer alignment, and the people who I'm talking to usually seem pretty confused about why I would want to do that. Well, I'm confused about why one wouldn't want to do that, and in this post I explain why.

I've had related thoughts (and still do, though it's become more of a secondary research agenda); might be interesting to chat (more) during https://foresight.org/2024-foresight-neurotech-bci-and-wbe-for-safe-ai-workshop/

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on More people getting into AI safety should do a PhD · 2024-03-15T04:19:35.467Z · LW · GW

Perhaps even better: 'more people who already have PhD's should be recruited to do AI safety'.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Many arguments for AI x-risk are wrong · 2024-03-15T03:09:09.499Z · LW · GW

It seems like the question you're asking is close to (2) in my above decomposition.

Yup.

Aren't you worried that long before human obsoleting AI (or AI safety researcher obsoleting AI), these architectures are very uncompetitive and thus won't be viable given realistic delay budgets?

Quite uncertain about all this, but I have short timelines and expect likely not many more OOMs of effective compute will be needed to e.g. something which can 30x AI safety research (as long as we really try). I expect shorter timelines / OOM 'gaps' to come along with e.g. fewer architectural changes, all else equal. There are also broader reasons why I think it's quite plausible the high level considerations might not change much even given some architectural changes, discussed in the weak-forward-pass comment (e.g. 'the parallelism tradeoff').

I have some hope for a plan like:

  • Control early transformative AI
  • Use these AIs to make a safer approach much more competitive. (Maybe the approach consists of AIs which are too weak to scheme in a forward pass but which are combined into some crazy bureaucracy and made very cheap to run.)
  • Use those next AIs to do something.

(This plan has the downside that it probably requires doing a bunch of general purpose capabilities which might make the situation much more unstable and volatile due to huge compute overhang if fully scaled up in the most performant (but unsafe) way.)

Sounds pretty good to me, I guess the crux (as hinted at during some personal conversations too) might be that I'm just much more optimistic about this being feasible without huge capabilities pushes (again, some arguments in the weak-forward-pass comment, e.g. about CoT distillation seeming to work decently - helping with, for a fixed level of capabilities, more of it coming from scaffolding and less from one forward pass; or on CoT length / inference complexity tradeoffs).

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Many arguments for AI x-risk are wrong · 2024-03-14T05:03:36.321Z · LW · GW

A big part of how I'm thinking about this is a very related version:

If you keep scaling up networks with pretraining and light RLHF and various differentially transparent scaffolding, what comes first, AI safety researcher obsoleting or scheming?

One can also focus on various types of scheming and when / in which order they'd happen. E.g. I'd be much more worried about scheming in one forward pass than about scheming in CoT (which seems more manageable using e.g. control methods), but I also expect that to happen later. Where 'later' can be operationalized in terms of requiring more effective compute.

I think The direct approach could provide a potential (rough) upper bound for the effective compute required to obsolete AI safety researchers. Though I'd prefer more substantial empirical evidence based on e.g. something like scaling laws on automated AI safety R&D evals.

Similarly, one can imagine evals for different capabilities which would be prerequisites for various types of scheming and doing scaling laws on those; e.g. for out-of-context reasoning, where multi-hop ouf-of-context reasoning seems necessary for instrumental deceptive reasoning in one forward pass (as prerequisite for scheming in one forward pass).

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Many arguments for AI x-risk are wrong · 2024-03-14T04:23:53.762Z · LW · GW

I think of "light RLHF" as "RLHF which doesn't teach the model qualitatively new things, but instead just steers the model at a high level". In practice, a single round of DPO on <100,000 examples surely counts, but I'm unsure about the exact limits.

(In principle, a small amount of RL can update a model very far, I don't think we see this in practice.)

Empirical evidence about this indeed being the case for DPO.

Note that even if RLHF takes a non-scheming pretrained model and converts it to a schemer, if this RLHF doesn't teach the model new things, then we can always train another copy of the pretrained model to be a monitor for the RLHF'd model and this monitor will be just as competent as the RLHF'd model. So scheming seems like substantially less of a problem in this case. (We'd need to use this monitor for all potentially dangerous actions to get safety properties.) (This is similar to the proposal in Appendix G of the weak-to-strong generalization paper, but with this addition that you deploy the reward model as a monitor which is required for any interesting guarantees.)

Also see Interpreting the learning of deceit for another proposal/research agenda to deal with this threat model.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-03-11T16:55:43.919Z · LW · GW

Hmm, unsure about this. E.g. the development models of many in the alignment community before GPT-3 (often heavily focused on RL or even on GOFAI) seem quite substantially worse in retrospect than those of some of the most famous deep learning people (e.g. LeCun's cake); of course, this may be an unfair/biased comparison using hindsight. Unsure how much theory results were influencing the famous deep learners (and e.g. classic learning theory results would probably have been misleading), but doesn't seem obvious they had 0 influence? For example, Bengio has multiple at least somewhat conceptual / theoretical (including review) papers motivating deep/representation learning; e.g. Representation Learning: A Review and New Perspectives.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-03-11T07:10:56.285Z · LW · GW

Some more (somewhat) related papers:

Rethinking Model-based, Policy-based, and Value-based Reinforcement Learning via the Lens of Representation Complexity ('We first demonstrate that, for a broad class of Markov decision processes (MDPs), the model can be represented by constant-depth circuits with polynomial size or Multi-Layer Perceptrons (MLPs) with constant layers and polynomial hidden dimension. However, the representation of the optimal policy and optimal value proves to be NP-complete and unattainable by constant-layer MLPs with polynomial size. This demonstrates a significant representation complexity gap between model-based RL and model-free RL, which includes policy-based RL and value-based RL. To further explore the representation complexity hierarchy between policy-based RL and value-based RL, we introduce another general class of MDPs where both the model and optimal policy can be represented by constant-depth circuits with polynomial size or constant-layer MLPs with polynomial size. In contrast, representing the optimal value is P-complete and intractable via a constant-layer MLP with polynomial hidden dimension. This accentuates the intricate representation complexity associated with value-based RL compared to policy-based RL. In summary, we unveil a potential representation complexity hierarchy within RL -- representing the model emerges as the easiest task, followed by the optimal policy, while representing the optimal value function presents the most intricate challenge.').

On Representation Complexity of Model-based and Model-free Reinforcement Learning ('We prove theoretically that there exists a broad class of MDPs such that their underlying transition and reward functions can be represented by constant depth circuits with polynomial size, while the optimal Q-function suffers an exponential circuit complexity in constant-depth circuits. By drawing attention to the approximation errors and building connections to complexity theory, our theory provides unique insights into why model-based algorithms usually enjoy better sample complexity than model-free algorithms from a novel representation complexity perspective: in some cases, the ground-truth rule (model) of the environment is simple to represent, while other quantities, such as Q-function, appear complex. We empirically corroborate our theory by comparing the approximation error of the transition kernel, reward function, and optimal Q-function in various Mujoco environments, which demonstrates that the approximation errors of the transition kernel and reward function are consistently lower than those of the optimal Q-function. To the best of our knowledge, this work is the first to study the circuit complexity of RL, which also provides a rigorous framework for future research.').

Demonstration-Regularized RL ('Incorporating expert demonstrations has empirically helped to improve the sample efficiency of reinforcement learning (RL). This paper quantifies theoretically to what extent this extra information reduces RL's sample complexity. In particular, we study the demonstration-regularized reinforcement learning that leverages the expert demonstrations by KL-regularization for a policy learned by behavior cloning. Our findings reveal that using NE expert demonstrations enables the identification of an optimal policy at a sample complexity of order O˜(Poly(S,A,H)/(ε^2 * N^E)) in finite and O˜(Poly(d,H)/(ε^2 * N^E)) in linear Markov decision processes, where ε is the target precision, H the horizon, A the number of action, S the number of states in the finite case and d the dimension of the feature space in the linear case. As a by-product, we provide tight convergence guarantees for the behaviour cloning procedure under general assumptions on the policy classes. Additionally, we establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback (RLHF). In this respect, we provide theoretical evidence showing the benefits of KL-regularization for RLHF in tabular and linear MDPs. Interestingly, we avoid pessimism injection by employing computationally feasible regularization to handle reward estimation uncertainty, thus setting our approach apart from the prior works.').

Limitations of Agents Simulated by Predictive Models ('There is increasing focus on adapting predictive models into agent-like systems, most notably AI assistants based on language models. We outline two structural reasons for why these models can fail when turned into agents. First, we discuss auto-suggestive delusions. Prior work has shown theoretically that models fail to imitate agents that generated the training data if the agents relied on hidden observations: the hidden observations act as confounding variables, and the models treat actions they generate as evidence for nonexistent observations. Second, we introduce and formally study a related, novel limitation: predictor-policy incoherence. When a model generates a sequence of actions, the model's implicit prediction of the policy that generated those actions can serve as a confounding variable. The result is that models choose actions as if they expect future actions to be suboptimal, causing them to be overly conservative. We show that both of those failures are fixed by including a feedback loop from the environment, that is, re-training the models on their own actions. We give simple demonstrations of both limitations using Decision Transformers and confirm that empirical results agree with our conceptual and formal analysis. Our treatment provides a unifying view of those failure modes, and informs the question of why fine-tuning offline learned policies with online learning makes them more effective.').
 

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-03-11T06:19:08.529Z · LW · GW

A brief list of resources with theoretical results which seem to imply RL is much more (e.g. sample efficiency-wise) difficult than IL - imitation learning (I don't feel like I have enough theoretical RL expertise or time to scrutinize hard the arguments, but would love for others to pitch in). Probably at least somewhat relevant w.r.t. discussions of what the first AIs capable of obsoleting humanity could look like: 

Paper: Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning? (quote: 'This work shows that, from the statistical viewpoint, the situation is far subtler than suggested by the more traditional approximation viewpoint, where the requirements on the representation that suffice for sample efficient RL are even more stringent. Our main results provide sharp thresholds for reinforcement learning methods, showing that there are hard limitations on what constitutes good function approximation (in terms of the dimensionality of the representation), where we focus on natural representational conditions relevant to value-based, model-based, and policy-based learning. These lower bounds highlight that having a good (value-based, model-based, or policy-based) representation in and of itself is insufficient for efficient reinforcement learning, unless the quality of this approximation passes certain hard thresholds. Furthermore, our lower bounds also imply exponential separations on the sample complexity between 1) value-based learning with perfect representation and value-based learning with a good-but-not-perfect representation, 2) value-based learning and policy-based learning, 3) policy-based learning and supervised learning and 4) reinforcement learning and imitation learning.')

Talks (very likely with some redundancy): 

Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning - Sham Kakade
What is the Statistical Complexity of Reinforcement Learning? (and another two versions)

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Going Beyond Linear Mode Connectivity: The Layerwise Linear Feature Connectivity · 2024-03-09T06:14:43.564Z · LW · GW

Seems relevant (but I've only skimmed): Training-Free Pretrained Model Merging.
 

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Many arguments for AI x-risk are wrong · 2024-03-06T08:15:50.632Z · LW · GW

Some thoughts:

The further imitation+ light RL (competitively) goes the less important other less safe training approaches are.

+ differentially-transparent scaffolding (externalized reasoning-like; e.g. CoT, in-context learning [though I'm a bit more worried about the amount of parallel compute even in one forward pass with very long context windows], [text-y] RAG, many tools, explicit task decomposition, sampling-and-voting), I'd probably add; I suspect this combo adds up to a lot, if e.g. labs were cautious enough and less / no race dynamics, etc. (I think I'm at > 90% you'd get all the way to human obsolescence).

What do you think about the fact that to reach somewhat worse than best human performance, AlphaStar needed a massive amount of RL? It's not a huge amount of evidence and I think intuitions from SOTA llms are more informative overall, but it's still something interesting. (There is a case that AlphaStar is more analogous as it involves doing a long range task and reaching comparable performance to top tier human professionals which LLMs arguably don't do in any domain.)

I'm not sure how good of an analogy AlphaStar is, given e.g. its specialization, the relatively easy availability of a reward signal, the comparatively much less (including for transfer from close domains) imitation data availability and use (vs. the LLM case). And also, even AlphaStar was bootstrapped with imitation learning.   

One specific way to operationalize this is how much effective compute improvement you get from RL on code. For current SOTA models (e.g. claude 3), I would guess a central estimate of 2-3x effective compute multiplier from RL, though I'm extremely unsure. (I have no special knowledge here, just a wild guess based on eyeballing a few public things.)(Perhaps the deepseek code paper would allow for finding better numbers?)

This review of effective compute gains without retraining might come closest to something like an answer, but it's been a while since I last looked at it.

FWIW, think a high fraction of the danger from the exact setup I outlined isn't imitation, but is instead deep serial (and recurrent) reasoning in non-interpretable media.

I think I (still) largely hold the intuition mentioned here, that deep serial (and recurrent) reasoning in non-interpretable media won't be (that much more) competitive versus more chain-of-thought-y / tools-y-transparent reasoning, at least before human obsolescence. E.g. based on the [CoT] length complexity - computational complexity tradeoff from Auto-Regressive Next-Token Predictors are Universal Learners and on arguments like those in Before smart AI, there will be many mediocre or specialized AIs, I'd expect the first AIs which can massively speed up AI safety R&D to be probably somewhat subhuman-level in a forward pass (including in terms of serial depth / recurrence) and to compensate for that with CoT, explicit task decompositions, sampling-and-voting, etc. This seems born out by other results too, e.g. More Agents Is All You Need (on sampling-and-voting) or Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks ('We show that when concatenating intermediate supervision to the input and training a sequence-to-sequence model on this modified input, unlearnable composite problems can become learnable. We show that this is true for any family of tasks which on the one hand, are unlearnable, and on the other hand, can be decomposed into a polynomial number of simple sub-tasks, each of which depends only on O(1) previous sub-task results').
 

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Many arguments for AI x-risk are wrong · 2024-03-06T01:46:30.536Z · LW · GW

"Training setups where we train generally powerful AIs with deep serial reasoning (similar to the internal reasoning in a human brain) for an extremely long time on rich outcomes based RL environment until these AIs learn how to become generically agentic and pursue specific outcomes in a wide variety of circumstances."

My intuition goes something like: this doesn't matter that much if e.g. it happens (sufficiently) after you'd get ~human-level automated AI safety R&D with safer setups, e.g. imitation learning and no/less RL fine-tuning. And I'd expect, e.g. based on current scaling laws, but also on theoretical arguments about the difficulty of imitation learning vs. of RL, that the most efficient way to gain new capabilities, will still be imitation learning at least all the way up to very close to human-level. Then, the closer you get to ~human-level automated AI safety R&D with just imitation learning the less of a 'gap' you'd need to 'cover for' with e.g. RL. And the less RL fine-tuning you might need, the less likely it might be that the weights / representations change much (e.g. they don't seem to change much with current DPO). This might all be conceptually operationalizable in terms of effective compute.

Currently, most capabilities indeed seem to come from pre-training, and fine-tuning only seems to 'steer' them / 'wrap them around'; to the degree that even in-context learning can be competitive at this steering; similarly, 'on understanding how reasoning emerges from language model pre-training'.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Deep learning models might be secretly (almost) linear · 2024-03-03T23:41:00.109Z · LW · GW

Also: Language Models Represent Beliefs of Self and Others.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Interpreting the Learning of Deceit · 2024-03-02T00:39:27.859Z · LW · GW

Also relevant: Language Models Represent Beliefs of Self and Others.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-02-29T05:26:04.612Z · LW · GW

I might have updated at least a bit against the weakness of single-forward passes, based on intuitions about the amount of compute that huge context windows (e.g. Gemini 1.5 - 1 million tokens) might provide to a single-forward-pass, even if limited serially.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-02-27T19:01:28.583Z · LW · GW

Thanks! AFAICT though, the link you posted seems about automated AI capabilities R&D evals, rather than about automated AI safety / alignment R&D evals (I do expect transfer between the two, but they don't seem like the same thing). I've also chatted to some people from both METR and UK AISI and got the impression from all of them that there's some focus on automated AI capabilities R&D evals, but not on safety.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-02-27T18:26:09.252Z · LW · GW

I'm not aware of anybody currently working on coming up with concrete automated AI safety R&D evals, while there seems to be so much work going into e.g. DC evals or even (more recently) scheminess evals. This seems very suboptimal in terms of portfolio allocation.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Deconfusing In-Context Learning · 2024-02-26T09:17:57.168Z · LW · GW

Here's an interesting paper related to the (potential order of emergence of) the three scenarios: Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression.
 

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Protocol evaluations: good analogies vs control · 2024-02-22T00:14:37.731Z · LW · GW

Shouldn't e.g. existing theoretical ML results also inform (at least somewhat, and conditioned on assumptions about architecture, etc.) p(scheming)? E.g. I provide an argument / a list of references on why one might expect a reasonable likelihood that 'AI will be capable of automating R&D while also being incapable of doing non-trivial consequentialist reasoning in a forward pass' here.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Protocol evaluations: good analogies vs control · 2024-02-21T23:42:39.927Z · LW · GW

I think this line of reasoning plausibly also suggests trying to differentially advance architectures / 'scaffolds' / 'scaffolding' techniques which are less likely to result in scheming.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Going Beyond Linear Mode Connectivity: The Layerwise Linear Feature Connectivity · 2024-02-20T23:49:25.255Z · LW · GW

Quoting from @zhanpeng_zhou's latest work - Cross-Task Linearity Emerges in the Pretraining-Finetuning Paradigm: 'i) Model averaging takes the average of weights of multiple models, which are finetuned on the same dataset but with different hyperparameter configurations, so as to improve accuracy and robustness. We explain the averaging of weights as the averaging of features at each layer, building a stronger connection between model averaging and logits ensemble than before. ii) Task arithmetic merges the weights of models, that are finetuned on different tasks, via simple arithmetic operations, shaping the behaviour of the resulting model accordingly. We translate the arithmetic operation in the parameter space into the operations in the feature space, yielding a feature-learning explanation for task arithmetic. Furthermore, we delve deeper into the root cause of CTL and underscore the impact of pretraining. We empirically show that the common knowledge acquired from the pretraining stage contributes to the satisfaction of CTL. We also take a primary attempt to prove CTL and find that the emergence of CTL is associated with the flatness of the network landscape and the distance between the weights of two finetuned models. In summary, our work reveals a linear connection between finetuned models, offering significant insights into model merging/editing techniques. This, in turn, advances our understanding of underlying mechanisms of pretraining and finetuning from a feature-centric perspective.' 

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-02-17T02:07:30.189Z · LW · GW

Haven't read it as deeply as I'd like to, but Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models seems like potentially significant progress towards formalizing / operationalizing (some of) the above.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Interpreting the Learning of Deceit · 2024-02-11T22:37:01.567Z · LW · GW

It seems to me like there might be a more general insight to draw here, something along the following line. As long as we're still in the current paradigm, where most model capabilities (including undesirable ones) come from pre-training, and they'd (mostly) only get “wrapped" by fine-tuning the model, (with appropriate prompting - or even other elicitation tools) the pre-trained models can serve as "model organisms" for about any elicitable misbehavior. 

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on What's the theory of impact for activation vectors? · 2024-02-11T22:20:10.561Z · LW · GW

I tried to write one story here. Notably, activation vectors don't need to scale all the way to superintelligence, e.g. them scaling up to ~human-level automated AI safety R&D would be ~enough.

Also, the ToI could be disjunctive too, so it doesn't have to be only one of those necessarily.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Davidad's Provably Safe AI Architecture - ARIA's Programme Thesis · 2024-02-02T22:50:18.901Z · LW · GW

'theoretical paper of early 2023 that I can't find right now' -> perhaps you're thinking of Fundamental Limitations of Alignment in Large Language Models? I'd also recommend LLM Censorship: A Machine Learning Challenge or a Computer Security Problem?.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-02-01T23:52:22.555Z · LW · GW

I've been / am on the lookout for related theoretical results of why grounding a la Grounded language acquisition through the eyes and ears of a single child works (e.g. with contrastive learning methods) - e.g. some recent works: Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP, Contrastive Learning is Spectral Clustering on Similarity Graph, Optimal Sample Complexity of Contrastive Learning; (more speculatively) also how it might intersect with alignment, e.g. if alignment-relevant concepts might be 'groundable' in fMRI data (and then 'pointable to') - e.g. https://medarc-ai.github.io/mindeye/ uses contrastive learning with fMRI - image pairs

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-01-30T23:26:48.426Z · LW · GW

This seems pretty good for safety (as RAG is comparatively at least a bit more transparent than fine-tuning): https://twitter.com/cwolferesearch/status/1752369105221333061 

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-01-30T05:08:44.653Z · LW · GW

What follows will all be pretty speculative, but I still think should probably provide some substantial evidence for more optimism.

I think that we basically have no way of ensuring that we get this nice “goals based on pointers to the correct concepts”/corrigible alignment thing using behavioral training. This seems like a super specific way to set up the AI, and there are so many degrees of freedom that behavioral training doesn’t distinguish.

The results in Robust agents learn causal world models suggest that robust models (to distribution shifts; arguably, this should be the case for ~all substantially x-risky models) should converge towards learning (approximately) the same causal world models. This talk suggests theoretical reasons to expect that the causal structure of the world (model) will be reflected in various (activation / rep engineering-y, linear) properties inside foundation models (e.g. LLMs), usable to steer them.

For the Representation Engineering thing, I think the “workable” version of this basically looks like “Retarget the Search”, where you somehow do crazy good interp and work out where the “optimizer” is, and then point that at the right concepts which you also found using interp. And for some reason, the AI is set up such that you can "retarget it" with breaking everything.

I don't think the "optimizer" ontology necessarily works super-well with LLMs / current SOTA (something like simulators seems to me much more appropriate); with that caveat, e.g. In-Context Learning Creates Task Vectors and Function Vectors in Large Language Models (also nicely summarized here), A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity seem to me like (early) steps in this direction already. Also, if you buy the previous theoretical claims (of convergence towards causal world models, with linear representations / properties), you might quite reasonably expect such linear methods to potentially work even better in more powerful / more robust models.  

I definitely don’t expect to be able to representation engineer our way into building an AI that is corrigible aligned, and remains that way even when it is learning a bunch of new things and is in very different distributions. (I do think that actually solving this problem would solve a large amount of the alignment problem).

The activation / representation engineering methods might not necessarily need to scale that far in terms of robustness, especially if e.g. you can complement them with more control-y methods / other alignment methods / Swiss cheese models of safety more broadly; and also plausibly because they'd "only" need to scale to ~human-level automated alignment researchers / scaffolds of more specialized such automated researchers, etc. And again, based on the above theoretical results, future models might actually be more robustly steerable 'by default' / 'for free'.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2024-01-28T21:14:49.883Z · LW · GW

Larger LMs seem to benefit differentially more from tools: 'Absolute performance and improvement-per-turn (e.g., slope) scale with model size.' https://xingyaoww.github.io/mint-bench/. This seems pretty good for safety, to the degree tool usage is often more transparent than model internals.