Comment by cfoster0 on tailcalled's Shortform · 2024-03-29T15:59:51.149Z · LW · GW

You want to learn an embedding of the opportunities you have in a given state (or for a given state-action), rather than just its potential rewards. Rewards are too sparse of a signal.

More formally, let's say instead of the Q function, we consider what I would call the Hope function: which given a state-action pair (s, a), gives you a distribution over states it expects to visit, weighted by the rewards it will get. This can still be phrased using the Bellman equation:

Hope(s, a) = rs' + f Hope(s', a')

The "successor representation" is somewhat close to this. It encodes the distribution over future states a partcular policy expects to visit from a particular starting state, and can be learned via the Bellman equation / TD learning.

Comment by cfoster0 on AI #42: The Wrong Answer · 2023-12-14T19:01:02.673Z · LW · GW

On reflection these were bad thresholds, should have used maybe 20 years and a risk level of 5%, and likely better defined transformational. The correlation is certainly clear here, the upper right quadrant is clearly the least popular, but I do not think the 4% here is lizardman constant.

Wait, what? Correlation between what and what? 20% of your respondents chose the upper right quadrant (transformational/safe). You meant the lower left quadrant, right?

Comment by cfoster0 on Apocalypse insurance, and the hardline libertarian take on AI risk · 2023-11-28T03:20:45.992Z · LW · GW

Very surprised there's no mention here of Hanson's "Foom Liability" proposal:

Comment by cfoster0 on Thoughts on open source AI · 2023-11-03T16:22:57.428Z · LW · GW

I appreciate that you are putting thought into this. Overall I think that "making the world more robust to the technologies we have" is a good direction.

In practice, how does this play out?

Depending on the exact requirements, I think this would most likely amount to an effective ban on future open-sourcing of generalist AI models like Llama2 even when they are far behind the frontier. Three reasons that come to mind:

  1. The set of possible avenues for "novel harms" is enormous, especially if the evaluation involves "the ability to finetune [...], external tooling which can be built on top [...], and API calls to other [SOTA models]". I do not see any way to clearly establish "no novel harms" with such a boundless scope. Heck, I don't even expect proprietary, closed-source models to be found safe in this way.
  2. There are many, many actors in the open-source space, working on many, many AI models (even just fine-tunes of LLaMA/Llama2). That is kind of the point of open sourcing! It seems unlikely that outside evaluators would be able to evaluate all of these, or for all these actors to do high-quality evaluation themselves. In that case, this requirement turns into a ban on open-sourcing for all but the largest & best-resourced actors (like Meta).
  3. There aren't incentives for others to robustify existing systems or to certify "OK you're allowed to open-source now", in the way as there are for responsible disclosure. By default, I expect those steps to just not happen, & for that to chill open-sourcing.
Comment by cfoster0 on Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk · 2023-11-02T21:17:23.021Z · LW · GW

If we are assessing the impact of open-sourcing LLMs, it seems like the most relevant counterfactual is the "no open-source LLM" one, right?

Comment by cfoster0 on Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk · 2023-11-02T21:01:33.539Z · LW · GW

Noted! I think there is substantial consensus within the AIS community on a central claim that the open-sourcing of certain future frontier AI systems might unacceptably increase biorisks. But I think there is not much consensus on a lot of other important claims, like about for which (future or even current) AI systems open-sourcing is acceptable and for which ones open-sourcing unacceptably increases biorisks.

Comment by cfoster0 on Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk · 2023-11-02T20:00:02.790Z · LW · GW

(explaining my disagree reaction)

The open source community seems to consistently assume the case that the concerns are about current AI systems and the current systems are enough to lead to significant biorisk. Nobody serious is claiming this

I see a lot of rhetorical equivocation between risks from existing non-frontier AI systems, and risks from future frontier or even non-frontier AI systems. Just this week, an author of the new "Will releasing the weights of future large language models grant widespread access to pandemic agents?" paper was asserting that everyone on Earth has been harmed by the release of Llama2 (via increased biorisks, it seems). It is very unclear to me which future systems the AIS community would actually permit to be open-sourced, and I think that uncertainty is a substantial part of the worry from open-weight advocates.

Comment by cfoster0 on Snapshot of narratives and frames against regulating AI · 2023-11-01T18:07:22.903Z · LW · GW

Note that the outlook from MIRI folks appears to somewhat agree with this, that there does not exist an authority that can legibly and correctly regulate AI, except by stopping it entirely.

Comment by cfoster0 on [deleted post] 2023-10-31T15:38:34.948Z

At face value it seems not-credible that "everyone on Earth" has been actually harmed by the release of Llama2. You could try to make a case that there's potential for future harms downstream of Llama2's release, and that those speculative future harms could impact everyone on Earth, but I have no idea how one would claim that they have already been harmed.

Comment by cfoster0 on AI as a science, and three obstacles to alignment strategies · 2023-10-26T17:18:03.526Z · LW · GW

I agree that they are related. In the context of this discussion, the critical difference between SGD and evolution is somewhat captured by your Assumption 1:

Fixed 'fitness function' or objective function mapping genome to continuous 'fitness score'

Evolution does not directly select/optimize the content of minds. Evolution selects/optimizes genomes based (in part) on how they distally shape what minds learn and what minds do (to the extent that impacts reproduction), with even more indirection caused by selection's heavy dependence on the environment. All of that creates a ton of optimization "slack", such that large-brained human minds with language could steer optimiztion far faster & more decisively than natural selection could. This what 1a3orn was pointing to earlier with

evolution does not grow minds, it grows hyperparameters for minds. When you look at the actual process for how we actually start to like ice-cream -- namely, we eat it, and then we get a reward, and that's why we like it -- then the world looks a a lot less hostile, and misalignment a lot less likely.

SGD does not have that slack by default. It acts directly on cognitive content (associations, reflexes, decision-weights), without slack or added indirection. If you control the training dataset/environment, you control what is rewarded and what is penalized, and if you are using SGD, then this lets you directly mold the circuits in the model's "brain" as desired. That is one of the main alignment-relevant intuitions that gets lost when blurring the evolution/SGD distinction.

Comment by cfoster0 on Buck's Shortform · 2023-10-26T00:01:06.151Z · LW · GW

An attempt was made last year, as an outgrowth of some assorted shard theory discussion, but I don't think it got super far:

Comment by cfoster0 on AI as a science, and three obstacles to alignment strategies · 2023-10-25T23:57:43.698Z · LW · GW

Note: I just watched the videos. I personally would not recommend the first video as an explanation to a layperson if I wanted them to come away with accurate intuitions around how today's neural networks learn / how we optimize them. What it describes is a very different kind of optimizer, one explicitly patterned after natural selection such as a genetic algorithm or population-based training, and the follow-up video more or less admits this. I would personally recommend they opt for videos these instead:

Comment by cfoster0 on unRLHF - Efficiently undoing LLM safeguards · 2023-10-13T16:49:53.120Z · LW · GW

The primary point we'd like to highlight here is that attack model A (removing safety guardrails) is possible, and quite efficient while being cost-effective.

Definitely. Despite my frustrations, I still upvoted your post because I think exploring cost-effective methods to steer AI systems is a good thing.

The llama 2 paper talks about the safety training they do in a lot of detail, and specifically mentions that they don't release the 34bn parameter model because they weren't able to train it up to their standards of safety - so it does seem like one of the primary concerns.

I understand you as saying (1) "[whether their safety guardrails can be removed] does seem like one of the primary concerns". But IMO that isn't the right way to interpret their concerns, and we should instead think (2) "[whether their models exhibit safe chat behavior out of the box] does seem like one of their primary concerns". Interpretation 2 explains the decisions made by the Llama2 authors, including why they put safety guardrails on the chat-tuned models but not the base models, as well as why they withheld the 34B one (since they could not get it to exhibit safe chat behavior out of the box). But under interpretation 1, a bunch of observations are left unexplained, like that they also released model weights without any safety guardrails, and that they didn't even try to evaluate whether their safety guardrails can be removed (for ex. by fine-tuning the weights). In light of this, I think the Llama2 authors were deliberate in the choices that they made, they just did so with a different weighting of considerations than you.

Comment by cfoster0 on unRLHF - Efficiently undoing LLM safeguards · 2023-10-13T01:30:49.394Z · LW · GW

In doing so, we hope to demonstrate a failure mode of releasing model weights - i.e., although models are often red-teamed before they are released (for example, Meta's LLaMA 2 "has undergone testing by external partners and internal teams to identify performance gaps and mitigate potentially problematic responses in chat use cases",) adversaries can modify the model weights in a way that makes all the safety red-teaming in the world ineffective.

I feel a bit frustrated about the way this work is motivated, specifically the way it assumes a very particular threat model. I suspect that if you had asked the Llama2 researchers whether they were trying to make end-users unable to modify the model in unexpected and possibly-harmful ways, they would have emphatically said "no". The rationale for training + releasing in the manner they did is to give everyday users a convenient model they can have normal/safe chats with right out of the box, while still letting more technical users arbitrarily modify the behavior to suit their needs. Heck, they released the base model weights to make this even easier! From their perspective, calling the cheap end-user modifiability of their model a "failure mode" seems like an odd framing.

EDIT: On reflection, I think my frustration is something like "you show that X is vulnerable under attack model A, but it was designed for a more restricted attack model B, and that seems like an unfair critique of X". I would rather you just argue directly for securing against the more pessimistic attack model.

Comment by cfoster0 on Why aren't more people in AIS familiar with PDP? · 2023-09-01T19:11:34.111Z · LW · GW

Parallel distributed processing (as well as "connectionism") is just an early name for the line of work that was eventually rebranded as "deep learning". They're the same research program.

Comment by cfoster0 on Introducing the Center for AI Policy (& we're hiring!) · 2023-08-30T19:16:40.319Z · LW · GW

Credit for changing the wording, but I still feel this does not adequately convey how sweeping the impact of the proposal would be if implemented as-is. Foundation model-related work is a sizeable and rapidly growing chunk of active AI development. Of the 15K pre-print papers posted on arXiv under the CS.AI category this year, 2K appear to be related to language models. The most popular Llama2 model weights alone have north of 500K downloads to date, and foundation-model related repos have been trending on Github for months. "People working with [a few technical labs'] models" is a massive community containing many thousands of developers, researchers, and hobbyists. It is important to be honest about how they will likely be impacted by this proposed regulation.

Comment by cfoster0 on Measuring and Improving the Faithfulness of Model-Generated Reasoning · 2023-07-31T15:27:46.830Z · LW · GW

If you have checkpoints from different points in training of the same models, you could do a comparison between different-size models at the same loss value (performance). That way, you're actually measuring the effect of scale alone, rather than scale confounded by performance.

Comment by cfoster0 on How LLMs are and are not myopic · 2023-07-29T01:38:03.467Z · LW · GW

It would definitely move the needle for me if y'all are able to show this behavior arising in base models without forcing, in a reproducible way.

Comment by cfoster0 on GPT-2's positional embedding matrix is a helix · 2023-07-21T20:18:52.280Z · LW · GW

Good question. I don't have a tight first-principles answer. The helix puts a bit of positional information in the variable magnitude (otherwise it'd be an ellipse, which would alias different positions) and a bit in the variable rotation, whereas the straight line is the far extreme of putting all of it in the magnitude. My intuition is that (in a transformer, at least) encoding information through the norm of vectors + acting on it through translations is "harder" than encoding information through (almost-) orthogonal subspaces + acting on it through rotations.

Relevant comment from Neel Nanda:

Comment by cfoster0 on GPT-2's positional embedding matrix is a helix · 2023-07-21T14:58:14.804Z · LW · GW

Very cool! I believe this structure allows expressing the "look back N tokens" operation (perhaps even for different Ns across different heads) via a position-independent rotation (and translation?) of the positional subspace of query/key vectors. This sort of operation is useful if many patterns in the dataset depend on the relative arrangement of tokens (for ex. common n-grams) rather than their absolute positions. Since all these models use absolute positional embeddings, the positional embeddings have to contort themselves to make this happen.

Comment by cfoster0 on [deleted post] 2023-07-19T21:57:11.856Z

It's absolutely fine if you want to use AI to help summarize content, and then you check that content and endorse it.

I still ask if you could please flag it as such, so the reader can make an informed decision about how to read/respond to the content?

Comment by cfoster0 on [deleted post] 2023-07-19T16:57:18.560Z

Is this an AI summary (or your own writing)? If so, would you mind flagging it as such?

Comment by cfoster0 on Winners of AI Alignment Awards Research Contest · 2023-07-13T19:06:13.712Z · LW · GW

The main takeaway (translated to standard technical language) is it would be useful to have some structured representation of the relationship between terminal values and instrumental values (at many recursive “layers” of instrumentality), analogous to how Bayes nets represent the structure of a probability distribution. That would potentially be more useful than a “flat” representation in terms of preferences/utility, much like a Bayes net is more useful than a “flat” probability distribution.

That’s an interesting and novel-to-me idea. That said, the paper offers [little] technical development of the idea.

I believe Yoav Shoham has done a bit of work on this, attempting to create a formalism & graphical structure similar to Bayes nets for reasoning about terminal/instrumental value. See these two papers:

Comment by cfoster0 on Foom Liability · 2023-07-01T00:12:43.564Z · LW · GW

I think we're more or less on the same page now. I am also confused about the applicability of existing mechanisms. My lay impression is that there isn't much clarity right now.

For example this uncertainty about who's liable for harms from AI systems came up multiple times during the recent AI hearings before the US Senate, in the context of Section 230's shielding of computer service providers from certain liabilities, to what extent that it & other laws extend here. In response to Senator Graham asking about this, Sam Altman straight up said "We’re claiming we need to work together to find a totally new approach. I don’t think Section 230 is the even the right framework."

Comment by cfoster0 on Foom Liability · 2023-06-30T23:02:09.241Z · LW · GW

I see. The liability proposal isn't aimed at near-miss scenarios with no actual harm. It is aimed at scenarios with actual harm, but where that actual harm falls short of extinction + the conditions contributing to the harm were of the sort that might otherwise contribute to extinction.

You said no one had named "a specific actionable harm that's less than extinction" and I offered one (the first that came to mind) that seemed plausible, specific, and actionable under Hanson's "negligent owner monitoring" condition.

To be clear, though, if I thought that governments could just prevent negligent owner monitoring (& likewise with some of the other conditions) as you suggested, I would be in favor of that!

EDIT: Someone asked Hanson to clarify what he meant by "near-miss" such that it'd be an actionable threshold for liability, and he responded:

Any event where A causes a hurt to B that A had a duty to avoid, the hurt is mediated by an AI, and one of those eight factors I list was present.

Comment by cfoster0 on Foom Liability · 2023-06-30T22:19:30.258Z · LW · GW

Can you re-state that? I find the phrasing of your question confusing.

(Are you saying there is no harm in the near-miss scenarios, so liability doesn't help? If so I disagree.)

Comment by cfoster0 on Foom Liability · 2023-06-30T20:21:08.546Z · LW · GW

Hanson does not ignore this, he is very clear about it

it seems plausible that for every extreme scenario like [extinction by foom] there are many more “near miss” scenarios which are similar, but which don’t reach such extreme ends. For example, where the AI tries but fails to hide its plans or actions, where it tries but fails to wrest control or prevent opposition, or where it does these things yet its abilities are not broad enough for it to cause existential damage. So if we gave sufficient liability incentives to AI owners to avoid near-miss scenarios, with the liability higher for a closer miss, those incentives would also induce substantial efforts to avoid the worst-case scenarios.

The purpose of this kind of liability is to provide an incentive gradient pushing actors away from the preconditions of harm. Many of those preconditions are applicable to harms at differing scales. For example, if an actor allowed AI systems to send emails in an unconstrained and unmonitored way, that negligence is an enabler for both automated spear-phishing scams (a "lesser harms") and for AI-engineered global pandemics.

Comment by cfoster0 on Why Not Subagents? · 2023-06-24T18:56:24.560Z · LW · GW

As I understand this, the rough sketch of this approach is basically to realize that incomplete preferences are compatible with a family of utility functions rather than a single one (since they don't specify how to trade-off between incomparable outcomes), and that you can use randomization to select within this family (implemented via contracts), thereby narrowing in on completed preferences / a utility function. Is that description on track?

If so, is it a problem that the subagents/committee/market may have preferences that are a function of this dealmaking process, like preferences about avoiding the coordination/transaction costs involved, or preferences about how to do randomization? Like, couldn't you end up with a situation where "completing the preferences" is dispreferred, such that the individual subagents do not choose to aggregate into a single utility maximizer?

Comment by cfoster0 on Critiques of prominent AI safety labs: Conjecture · 2023-06-13T00:28:38.528Z · LW · GW

Having known some of Conjecture's founders and their previous work in the context of "early-stage EleutherAI", I share some[1] of the main frustrations outlined in this post. At the organizational level, even setting aside the departure of key researchers, I do not think that Conjecture's existing public-facing research artifacts have given much basis for me to recommend the organization to others (aside from existing personal ties). To date, only[2] a few posts like their one on the polytope lens and their one on circumventing interpretability were at the level of quality & novelty I expected from the team. Maybe that is a function of the restrictive information policies, maybe a function of startup issues, maybe just the difficulty of research. In any case, I think that folks ought to require more rigor and critical engagement from their future research outputs[3].

  1. ^

    I didn't find the critiques of Connor's "character and trustworthiness" convincing, but I already consider him a colleague & a friend, so external judgments like these don't move the needle for me.

  2. ^

    The main other post I have in mind was their one on simulators. AFAICT the core of "simulator theory" predated (mid-2021, at least) Conjecture, and yet even with a year of additional incubation, the framework was not brought to a sufficient level of technical quality.

  3. ^

    For example, the "cognitive emulation" work may benefit from review by outside experts, since the nominal goal seems to be to do cognitive science entirely inside of Conjecture.

Comment by cfoster0 on Think carefully before calling RL policies "agents" · 2023-06-02T16:19:45.367Z · LW · GW

I’m a bit confused about what you’re proposing. AlphaZero has an input (board state) and an output (move). Are you proposing to call this input-output function “a policy”?

If so, sure we can say that, but I think people would find it confusing—because there’s a tree search in between the input and output, and one ingredient of the tree search is the “policy network” (or maybe just “policy head”, I forget), but here the relation between the “policy network” and the final input-output function is very indirect, such that it seems odd to use (almost) the same term for them.

In my head, a policy is just a situation-dependent way of acting. Sometimes that way of acting makes use of foresight, sometimes that way of acting is purely reflexive. I mentally file the AlphaZero policy network + tree search combination as a "policy", one separate from the "reactive policy" defined by just using the policy network without tree search. Looking back at Sutton & Barto, they define "policy" similarly:

A policy defines the learning agent’s way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states. It corresponds to what in psychology would be called a set of stimulus–response rules or associations. In some cases the policy may be a simple function or lookup table, whereas in others it may involve extensive computation such as a search process. The policy is the core of a reinforcement learning agent in the sense that it alone is sufficient to determine behavior. In general, policies may be stochastic, specifying probabilities for each action.

(emphasis mine) along with this later description of planning in a model-based RL context:

The word planning is used in several different ways in different fields. We use the term to refer to any computational process that takes a model as input and produces or improves a policy for interacting with the modeled environment

which seems compatible with thinking of planning algorithms like MCTS as components of an improved policy at runtime (not just in training).

That being said, looking at the AlphaZero paper, a quick search did not turn up usages of the term "policy" in this way. So maybe this usage is less widespread than I had assumed.

Comment by cfoster0 on Yudkowsky vs Hanson on FOOM: Whose Predictions Were Better? · 2023-06-02T03:34:20.592Z · LW · GW

I think requiring a "common initialization + early training trajectory" is a pretty huge obstacle to knowledge sharing, and would de-facto make knowledge sharing among the vast majority of large language models infeasible.

Agreed. That part of my comment was aimed only at the claim about weight averaging only working for diffusion/image models, not about knowledge sharing more generally.

I do think stuff like stitching via cross-attention is kind of interesting, but it feels like a non-scalable way of knowledge sharing, unless I am misunderstanding how it works.

Not sure I see any particular argument against the scalability of knowledge exchange between LLMs in general or via cross-attention, though. Especially if we're comparing the cost of transfer to the cost of re-running the original training. That's why people are exploring this, especially smaller/independent researchers. There's a bunch of concurrent recent efforts to take frozen unimodal models and stitch them into multimodal ones (example from a few days ago Heck, the dominant approach in the community of LLM hobbyists seems to be transferring behaviors and knowledge from GPT-4 into LLaMa variants via targeted synthetic data generation. What kind of scalability are you thinking of?

Comment by cfoster0 on Yudkowsky vs Hanson on FOOM: Whose Predictions Were Better? · 2023-06-02T01:58:07.596Z · LW · GW

The part where you can average weights is unique to diffusion models, as far as I can tell, which makes sense because the 2-d structure of the images is very local, and so this establishes a strong preferred basis for the representations of different networks.

Exchanging knowledge between two language models currently seems approximately impossible? Like, you can train on the outputs, but I don't think there is really any way for two language models to learn from each other by exchanging any kind of cognitive content, or to improve the internal representations of a language model by giving it access to the internal representations of another language model.

There's a pretty rich literature on this stuff, transferring representational/functional content between neural networks.

Averaging weights to transfer knowledge is not unique to diffusion models. It works on image models trained with non-diffusion setups (, as well as on non-image tasks such as language modeling (, Exchanging knowledge between language models via weight averaging is possible provided that the models share a common initialization + early training trajectory. And if you allow for more methods than weight averaging, simple stuff like Knowledge Distillation or stitching via cross-attention ( are tricks known to work for transferring knowledge.

Comment by cfoster0 on Steering GPT-2-XL by adding an activation vector · 2023-05-15T17:30:18.368Z · LW · GW

I wonder if this is related to how GPT-J runs the attention and MLP sublayers in parallel, as opposed to sequentially?

Comment by cfoster0 on A Case for the Least Forgiving Take On Alignment · 2023-05-08T02:16:43.620Z · LW · GW

I didn't mean "learning from experience" to be restrictive in that way. Animals learn by observing others & from building abstract mental models too. But unless one acquires abstracted knowledge via communication, learning requires some form of experience: even abstracted knowledge is derived from experience, whether actual or imagined. Moreover, I don't think that some extra/different planning machinery was required for language itself, beyond the existing abstraction and model-based RL capabilities that many other animals share. But ultimately that's an empirical question.

Hmm, we may have reached the point from which we're not going to move on without building mathematical frameworks and empirically testing them, or something.

Yeah I am probably going to end my part of the discussion tree here.

My overall take remains:

  • There may be general purpose problem-solving strategies that humans and non-human animals alike share, which explain our relative capability gains when combined with the unlocks that came from language/culture.
  • We don't need any human-distinctive "general intelligence" property to explain the capability differences among human-, non-human animal-, and artificial systems, so we shouldn't assume that there's any major threshold ahead of us corresponding to it.
Comment by cfoster0 on A Case for the Least Forgiving Take On Alignment · 2023-05-08T01:17:29.620Z · LW · GW

I guess I don't see much support for such mutual dependence. Other animals have working memory + finite state control, and learn from experience in flexible ways. It appears pretty useful to them despite the fact they don't have language/culture. The vast majority of our useful computing is done by systems that have Turing-completeness but not language/cultural competence. Language models sure look like they have language ability without Turing-completeness and without having picked up some "universal planning algorithm" that would render our previous work w/ NNs ~useless.

Why choose a theory like "the capability gap between humans and other animals is because the latter is missing language/culture and also some binary GI property" over one like "the capability gap between humans and other animals is just because the latter is missing language/culture"? IMO the latter is simpler and better fits the evidence.

Comment by cfoster0 on A Case for the Least Forgiving Take On Alignment · 2023-05-07T23:58:11.386Z · LW · GW

I think what I'm trying to get at, here, is that the ability to use these better, self-derived abstractions for planning is nontrivial, and requires a specific universal-planning algorithm to work. Animals et al. learn new concepts and their applications simultaneously: they see e. g. a new fruit, try eating it, their taste receptors approve/disapprove of it, and they simultaneously learn a concept for this fruit and a heuristic "this fruit is good/bad". They also only learn new concepts downstream of actual interactions with the thing; all learning is implemented by hard-coded reward circuitry.

Humans can do more than that. As in my example, you can just describe to them e. g. a new game, and they can spin up an abstract representation of it and derive heuristics for it autonomously, without engaging hard-coded reward circuitry at all, without doing trial-and-error even in simulations. They can also learn new concepts in an autonomous manner, by just thinking about some problem domain, finding a connection between some concepts in it, and creating a new abstraction/chunking them together.

Hmm I feel like you're underestimating animal cognition / overestimating how much of what humans can do comes from unique algorithms vs. accumulated "mental content". Non-human animals don't have language, culture, and other forms of externalized representation, including the particular human representations behind "learning the rules of a game". Without these in place, even if one was using the "universal planning algorithm", they'd be precluded from learning through abstract description and from learning through manipulation of abstract game-structure concepts. All they've got is observation, experiment, and extrapolation from their existing concepts. But lacking the ability to receive abstract concepts via communication doesn't mean that they cannot synthesize new abstractions as situations require. I think there's good evidence that other animals can indeed do that.

General intelligence is an algorithm for systematic derivation of such "other changes".

Does any of that make sense to you?

I get what you're saying but disbelieve the broader theory. I think the "other changes" (innovations/useful context-specific improvements) we see in reality aren't mostly attributable to the application of some simple algorithm, unless we abstract away all of the details that did the actual work. There are general purpose strategies (for ex. the "scientific method" strategy, which is an elaboration of the "model-based RL" strategy, which is an elaboration of the "trial and error" strategy) that are widely applicable for deriving useful improvements. But those strategies are at a very high level of abstraction, whereas the bulk of improvement comes from using strategies to accumulate lower-level concrete "content" over time, rather than merely from adopting a particular strategy.

(Would again recommend Hanson's blog on "The Betterness Explosion" as expressing my side of the discussion here.)

Comment by cfoster0 on A Case for the Least Forgiving Take On Alignment · 2023-05-06T22:53:08.700Z · LW · GW

Ok I think this at least clears things up a bit.

To become universally capable, a system needs two things:

  1. "Turing-completeness": A mechanism by which it can construct arbitrary mathematical objects to describe new environments (including abstract environments).
  2. "General intelligence": an algorithm that can take in any arbitrary mathematical object produced by (1), and employ it for planning.

General intelligence isn't Turing-completeness itself. Rather, it's a planning algorithm that has Turing-completeness as a prerequisite. Its binariness is inherited from the binariness of Turing-completeness.

Based on the above, I don't understand why you expect what you say you're expecting. We blew past the Turing-completeness threshold decades ago with general purpose computers, and we've combined them with planning algorithms in lots of ways.

Take AIXI, which uses the full power of Turing-completeness to do model-based planning with every possible abstraction/model. To my knowledge, switching over to that kind of fully-general planning (or any of its bounded approximations) hasn't actually produced corresponding improvements in quality of outputs, especially compared to the quality gains we get from other changes. I think our default expectation should be that the real action is in accumulating those "other changes". On the theory that the gap between human- and nonhuman animal- cognition is from us accumulating better "content" (world model concepts, heuristics, abstractions, etc.) over time, it's no surprise that there's no big phase change from combining Turing machines with planning!

General intelligence is the capability that makes this possible, the algorithm you employ for this "abstract analysis". As I'd stated, it main appeal is that it doesn't require practical experience with the problem domain (simulated or otherwise) — only knowledge of its structure.

I think what you describe here and in the content prior is more or less "model-based reinforcement learning with state/action abstraction", which is the class of algorithms that answer the question "What if we did planning towards goals but with learned/latent abstractions?" As far I can tell, other animals do this as well. Yes, it takes a more impressive form in humans because language (and the culture + science it enabled) has allowed us to develop more/better abstractions to plan with, but I see no need to posit some novel general capability in addition.

Comment by cfoster0 on A Case for the Least Forgiving Take On Alignment · 2023-05-06T00:00:50.198Z · LW · GW

I think I am confused where you're thinking the "binary/sharp threshold" is.

Are you saying there's some step-change in the architecture of the mind, in the basic adaption/learning algorithms that the architecture runs, in the content those algorithms learn? (or in something else?)

If you're talking about...

  • ... an architectural change → Turing machines and their neural equivalents, for example, over, say, DFAs and simple associative memories. There is a binary threshold going from non-general to general architectures, where the latter can support programs/algorithms that the former cannot emulate. This includes whatever programs implement "understanding an arbitrary new domain" as you mentioned. But once we cross that very minimal threshold (namely, combining memory with finite state control), remaining improvements come mostly from increasing memory capacity and finding better algorithms to run, neither of which are a single binary threshold. Humans and many non-human animals alike seem to have similarly general architectures, and likewise general artificial architectures have existed for a long time, so I would say "there indeed is a binary/sharp threshold [in architectures] but it is very low, such that we've already crossed it".
  • ... a change in algorithm → Model-based RL, General Problem Solver, AIXI, the Gödel machine algorithm, gradient descent over sufficiently massive datasets are candidates for algorithms that can do or learn to do "general-purpose problem-solving". But none of these are efficient in general, and I don't see any reason to think that there's some secret-sauce algorithm like them distinguishing human thinking from that of non-human animals. Other animals remember their experiences, pursue goals, creatively experiment with different strategies, etc. It seems much more plausible to me that other animals (incl. our primate cousins) are running similar base learning/processing algorithms on similar (but possibly smaller capacity) hardware, & the game-changer was that humans were able to accumulate more/better learned content for those algorithms to leverage.
  • ... a change in content→ I agree that there was a massive change here, and I think this is responsible for the apparent differences in capabilities. Earlier I claimed that this happened because the advent of language & culture allowed content to accumulate in ways that were previously not feasible. But the accumulation of content was a continuous process, we didn't acquire some new binary property. Moreover, these continuous changes in content as a function of our learning process + data are exactly the kind of changes that we're already used to supervising in ML, & exactly where we are already expending our efforts. Why will this blindside us?

Consider an algorithm implementing a simple arithmetic calculator, or a symbolic AI from a FPS game, or LLMs as they're characterized in this post. These cognitive systems do not have the machinery to arrive at understanding this way. There are no execution-paths of their algorithms such that they arrive at understanding; no algorithm-states that correspond to "this system has just learned a new physics discipline". [...]

If true generality doesn't exist, it would stand to reason that humans are the same. There should be aspects of reality such that there's no brain-states of us that correspond to us understanding them; there should only be a limited range of abstract objects our mental representations can support.

When you say "machinery" here it makes me think you're talking about architecture, but in that case the lack of execution-paths that arrive at learning new physics seems like it is explained by "simple arithmetic calculators + FPS AIs + LLMs are not Turing-complete systems / have too little memory / are not running learning algorithms at all", without the need to hypothesize a separate "general intelligence" variable.

(Incidentally, it doesn't seem obvious to me that scaffolded LLMs are particularly non-general in their understanding 🤔 Especially if we are willing to say yes to questions like "Can humans understand how 16-dimensional space works?" despite the fact that we cannot natively/reliably manipulate those in our minds whereas there are computer programs that can.)

To me, this sounds like you're postulating the existence of a simple algorithm for general-purpose problem-solving which is such that it would be convergently learned by circa-1995 RNNs. Rephrasing, this hypothetical assumes that the same algorithm can be applied to efficiently solve a wide variety of problems, and that it can usefully work even at the level of complexity at which 1995-RNNs were operating.

Sounds like I miscommunicated here. No, my position (and what I was asking about in the hypothetical) is that there are general-purpose architectures + general-purpose problem-solving algorithms that run on those architectures, that they are simple and inefficient (especially given their huge up-front fixed costs), that they aren't new or mysterious (the architectures are used already, far predating humans, & the algorithms are simple), and that we already can see that this sort of generality is not really "where the action is", so to speak.

Conversely, my position is that the algorithm for general intelligence is only useful if it's operating on a complicated world-model + suite of heuristics: there's a threshold of complexity and compute requirements (which circa-1995 RNNs were far below), and general intelligence is an overkill to use for simple problems (so RNNs wouldn't have convergently learned it; they would've learned narrow specialized algorithms instead).

Agreed? This is compatible with an alternative theory, that many other animals do have "the algorithm for general intelligence" you refer to, but that they're running it with less impressive content (world models & heuristics). And likewise with a theory that AI folks already had/have the important discrete generalist algorithmic insights, & instead what they need is a continuous pileup of good cognitive content. Why do you prefer the theory that in both cases, there is some missing binary thing?

Comment by cfoster0 on A Case for the Least Forgiving Take On Alignment · 2023-05-03T18:00:43.552Z · LW · GW

Thanks! Appreciate that you were willing to go through with this exercise.

I would expect to observe much greater diversity in cognitive capabilities of animals, for humans to generalize poorer, and for the world overall to be more incomprehensible to us.


we'd look at the world, and see some systemic processes that are not just hard to understand, but are fundamentally beyond reckoning.

Per reductionism, nothing should be fundamentally incomprehensible or fundamentally beyond reckoning, unless we posit some binary threshold of reckoning-generality. Everything that works reliably operates by way of lawful/robust mechanisms, so arriving at comprehension should look like gradually unraveling those mechanisms, searching for the most important pockets of causal/computational reducibility. That requires investment in the form of time and cumulative mental work.

I think that the behavior of other animals & especially the universe as a whole in fact did start off as very incomprehensible to us, just as incomprehensible as it was to other species. In my view, what caused the transformation from incomprehensibility to comprehensibility was not humans going over a sharp cognitive/architectural threshold, such that on one side their minds were fundamentally unable to understand these things and on the other they were able. Rather, the advent of language & cultural transmission enabled humans over time to pool/chain together their existing abilities to observe the world, retain knowledge, & build better tools such as mental models and experimental instruments. (I believe these "lifetime learning abilities" are shared with many other animals despite their lacking language.) That accumulation of mental work over time is what enabled the seemingly-sharp change relative to historical timescales when humans entered the scene, in my view.

Yup. But I think there are some caveats here. General intelligence isn't just "some cognitive system that has a Turing-complete component inside it", it's "a Turing-complete system for manipulating some specific representations". [...] (Though it may not be a good idea to discuss the specifics publicly.)

I don't think I understand you here, but it sounds like this is something sensitive so I won't probe.

What I would expect to observe if that weren't the case... I would expect GOFAI to have worked. If universally-capable cognition is not only conceptually simple at a high level (which I believe it is), but also doesn't require a mountain of complexly-formatted data on which to work, I'd expect us to have cracked it last century. No need for all this ML business.

(emphasis mine) Hold on: why is that particular additional assumption relevant? A low threshold for generality does not imply that cognitive capabilities are easy or efficient to acquire once you've crossed the threshold. It does not imply that you just have to turn on one of these "universally-capable cognition" machines, without requiring additional hard/expensive/domain-specific work (trial & error, gradient descent over lots of examples, careful engineering, cultural accumulation, etc.) to search for useful cognitive strategies to run on that machine. Analogously, the fact that even very minimal systems can act as Universal Turing Machines does not mean that it is easy to find programs for those systems that exhibit a desired behavior, or that Turing completeness provides some sort of efficient/general-purpose shortcut.

For the record, I think GOFAI did/does work! We now have high-level programming languages, planning algorithms, proof assistants and computer algebra systems, knowledge graphs, decision trees, constraint solvers, etc. etc. all of which are working + economically productive and fall under symbolic AI. It just turned out that different cognitive capabilities benefit from different algorithms, so as we crack different capabilities, the boundaries of "AI" are redrawn to focus on problems that haven't been automated yet.

Comment by cfoster0 on A Case for the Least Forgiving Take On Alignment · 2023-05-03T00:37:16.775Z · LW · GW

Fair. I think this is indeed a nitpick. 😊 In case it wasn't clear, the point remains something like: When we observe/build computational systems in our world that are "better" along some axis than other systems, that "betterness" is not generally derived from having gone over a new threshold of "even more general" computation (they definitely aren't deriving it from hypercomputation, and often aren't even deriving it from universal Turing computation), but through being better suited to the capability in question.

Comment by cfoster0 on A Case for the Least Forgiving Take On Alignment · 2023-05-02T23:29:38.254Z · LW · GW

Agreed that this (or something near it) appears to be a relatively central difference between people's models, and probably at the root of a lot of our disagreement. I think this disagreement is quite old; you can see bits of it crop up in Hanson's posts on the "AI foom" concept way back when. I would put myself in the camp of "there is no such binary intelligence property left for us to unlock". What would you expect to observe, if a binary/sharp threshold of generality did not exist?

A possibly-relevant consideration in the analogy to computation is that the threshold of Turing completeness is in some sense extremely low (see one-instruction set computer, Turing tarpits, Rule 110), and is the final threshold. Rather than a phase shift at the high end, where one must accrue a bunch of major insights before one has a system that they can learn about "computation in general" from, with Turing completeness, one can build very minimal systems and then--in a sense--learn everything that there is to learn from the more complex systems. It seems plausible to me that cognition is just like this. This raises an additional question beyond the first: What would you expect to observe, if there indeed is binary/sharp threshold but it is very low, such that we've already crossed it? (Say, if circa-1995 recurrent neural nets already had the required stuff to be past the threshold.) That would be compatible with thinking that insights from interpretability etc. work on pre-threshold systems wouldn't generalize to post-threshold systems, but also compatible with believing that we can do iterative design right now.

Re: LLMs, I dunno if I buy your story. At face value, what we've seen appears like another instance of the pattern where capabilities we once thought required some core of generality (doing logic & math, planning, playing strategic games, understanding language, creating art, etc.) turned out to be de-composable as any other technology is. That this pattern continues again and again over the decades makes me skeptical that we'll be unable to usefully/safely get the capabilities we want out of AI systems due to the sort of sharp threshold you imagine.

Comment by cfoster0 on Maze-solving agents: Add a top-right vector, make the agent go to the top-right · 2023-04-11T15:44:57.769Z · LW · GW

Super interesting. Did some quick and dirty investigations with LLMs following up on this, to test some hunches. In any case I'm excited to see y'all's subsequent posts.

Comment by cfoster0 on Understanding and controlling a maze-solving policy network · 2023-03-28T02:07:39.302Z · LW · GW
Comment by cfoster0 on Understanding and controlling a maze-solving policy network · 2023-03-28T02:06:02.942Z · LW · GW

This sequence of mental moves, where one boils talk about "motivations" or "goals" or "trying" down into non-motivational, purely mechanical circuit and feedback control patterns, and then also the reverse sequence of mental moves, where one reassembles motivational abstractions out of primitive operations, is possibly the biggest thing I wish I could get folks to learn. I think this is a pretty central pattern in "shard theory" discussions that feels missing from many other places.

Comment by cfoster0 on Remarks 1–18 on GPT (compressed) · 2023-03-21T00:29:06.539Z · LW · GW

It is hard for me to give an honest, thorough, and charitable response to this post. Possibly no fault of the author: this has been a persistent problem for me on many Simulator Theory posts. I always come away with an impression of "I see interesting intuitions mixed with some conceptually/mathematically obscure restatements of existing content mixed with a lot of strained analogy-making mixed with a handful of claims that seem wrong or at least quite imprecise." I'll try to think about how to tease out these different components and offer better feedback, but I figured it'd be worth expressing my frustrated state of mind more directly first.

Comment by cfoster0 on The algorithm isn't doing X, it's just doing Y. · 2023-03-17T01:34:29.146Z · LW · GW

I'm confused what you mean to claim. Understood that a language model factorizes the joint distribution over tokens autoregessively, into the product of next-token distributions conditioned on their prefixes. Also understood that it is possible to instead factorize the joint distribution over tokens into a conditional distribution over tokens conditioned on a latent variable (call it s) weighted by the prior over s. These are claims about possible factorizations of a distribution, and about which factorization the language model uses.

What are you claiming beyond that?

  • Are you claiming something about the internal structure of the language model?
  • Are you claiming something about the structure of the true distribution over tokens?
  • Are you claiming something about the structure of the generative process that produces the true distribution over tokens?
  • Are you claiming something about the structure of the world more broadly?
  • Are you claiming something about correspondences between the above?
Comment by cfoster0 on The algorithm isn't doing X, it's just doing Y. · 2023-03-17T00:46:39.842Z · LW · GW

AFAICT the bit where there's substantive disagreement is always in the middle regime, not the super-close extreme or the super-far extreme. This is definitely where I feel like debates over the use of frames like simulator theory are.

For example, is the Godot game engine a light transport simulator? In certain respects Godot captures the typical overall appearance of a scene, in a subset of situations. But it actually makes a bunch of weird simplifications and shortcuts under the hood that don't correspond to any real dynamics. That's because it isn't trying to simulate the underlying dynamics of light, it's trying to reproduce certain broad-strokes visible patterns that light produces.

That difference really matters! If you wanna make reliable and high-fidelity predictions about light transport, or if you wanna know what a scene that has a bunch of weird reflective and translucent materials looks like, you may get more predictive mileage thinking about the actual generating equations (or using a physically-based renderer, which does so for you), rather than treating Godot as a "light transport simulator" in this context. Otherwise you've gotta maintain a bunch of special-casing in your reasoning to keep maintaining the illusion.

Comment by cfoster0 on The algorithm isn't doing X, it's just doing Y. · 2023-03-16T23:49:46.404Z · LW · GW

If two tasks reduce to one another, then it is meaningless to ask if a machine is 'really doing' one task versus the other.

It is rare that two tasks exactly reduce to one another. When there's only a partial reduction between two tasks X and Y, it can be genuinely helpful to distinguish "doing X" from "doing Y", because this lossy mapping causes the tails to come apart, such that one mental model extrapolates correctly and the other fails to do so. To the extent that we care about making high-confidence predictions in situations that are significantly out of distribution, or where the stakes are high, this can matter a whole lot.

Comment by cfoster0 on To determine alignment difficulty, we need to know the absolute difficulty of alignment generalization · 2023-03-14T04:36:00.396Z · LW · GW

Definitely worth thinking about.

A core insight from Eliezer is that AI "capabilities generalize further than alignment once capabilities start to generalize far".

That doesn't seem clear to me, but I agree that capabilities generalize by default in the way we'd want them to in the limit, whereas alignment does not do so by default in the limit. But I also think there's a good case to be made that an agent will aim its capabilities towards its current goals including by reshaping itself and its context to make itself better-targeted at those goals, creating a virtuous cycle wherein increased capabilities lock in & robustify initial alignment, so long as that initial alignment was in a "basin of attraction", so to speak. (Of course, otherwise this is a vicious cycle)

So the question is, how hard is it build systems that are so aligned they want, in a robust way, to stay aligned with you as they get way more powerful?

Robust in what sense? If we've intent-aligned the AI thus far (it makes its decisions predominantly downstream of the right reasons, given its current understanding), and if the AI is capable, then the AI will want to keep itself aligned with its existing predominant motivations (goal-content integrity), so to the extent that it knows or learns about crucial robustness gaps in itself (even quite abstract knowledge like "I've been wrong about things like this before"), it will make decisions that attempt to fix / avoid / route around those gaps when possible, including by steering itself away from the sorts of situations that would require unusually-high robustness levels (this may remind you of conservatism). So I'm not sure exactly how much robustness we will need to engineer to be actually successful here. Though it would certainly be nice to have as much robustness as we can, all else equal.

Comment by cfoster0 on Why do we assume there is a "real" shoggoth behind the LLM? Why not masks all the way down? · 2023-03-10T02:02:34.636Z · LW · GW

Possibly relevant aside:

There may be some confusion here about behavioral vs. mechanistic claims.

I think when some people talk about a model "having a goal" they have in mind something purely behavioral. So when they talk about there being something in GPT that "has a goal of predicting the next token", they mean it in this purely behavioral way. Like that there are some circuits in the network whose behavior has the effect of predicting the next token well, but whose behavior is not motivated by / steering on the basis of trying to predict the next token well.

But when I (and possibly you as well?) talk about a model "having a goal" I mean something much more specific and mechanistic: a goal is a certain kind of internal representation that the model maintains, such that it makes decisions downstream of comparisons between that representation and its perception. That's a very different thing! To claim that a model has such a goal is to make a substantive claim about its internal structure and how its cognition generalizes!

When people talk about the shoggoth, it sure sounds like they are making claims that there is in fact an agent behind the mask, an agent that has goals. But maybe not? Like, when Ronny talked of the shoggoth having a goal, I assumed he was making the latter, stronger claim about the model having hidden goal-directed cognitive gears, but maybe he was making the former, weaker claim about how we can describe the model's behaviors?