Posts

When and why did 'training' become 'pretraining'? 2024-03-08T14:29:37.948Z
Theories of Change for AI Auditing 2023-11-13T19:33:43.928Z
[Linkpost] Biden-Harris Executive Order on AI 2023-10-30T15:20:22.582Z
Preference Aggregation as Bayesian Inference 2023-07-27T17:59:36.270Z
Thoughts on Loss Landscapes and why Deep Learning works 2023-07-25T16:41:39.562Z
BCIs and the ecosystem of modular minds 2023-07-21T15:58:27.081Z
Hedonic Loops and Taming RL 2023-07-19T15:12:42.327Z
[Linkpost] Introducing Superalignment 2023-07-05T18:23:18.419Z
The case for removing alignment and ML research from the training dataset 2023-05-30T20:54:36.596Z
Announcing Apollo Research 2023-05-30T16:17:19.767Z
A small update to the Sparse Coding interim research report 2023-04-30T19:54:38.342Z
Deep learning models might be secretly (almost) linear 2023-04-24T18:43:28.188Z
Scaffolded LLMs as natural language computers 2023-04-12T10:47:42.904Z
The surprising parameter efficiency of vision models 2023-04-08T19:44:36.186Z
The Computational Anatomy of Human Values 2023-04-06T10:33:24.989Z
Orthogonality is expensive 2023-04-03T10:20:43.964Z
RLHF does not appear to differentially cause mode-collapse 2023-03-20T15:39:45.353Z
Against ubiquitous alignment taxes 2023-03-06T19:50:44.886Z
Addendum: basic facts about language models during training 2023-03-06T19:24:45.246Z
Basic facts about language models during training 2023-02-21T11:46:12.256Z
Validator models: A simple approach to detecting goodharting 2023-02-20T21:32:25.957Z
Empathy as a natural consequence of learnt reward models 2023-02-04T15:35:23.894Z
AGI will have learnt utility functions 2023-01-25T19:42:11.139Z
Gradient hacking is extremely difficult 2023-01-24T15:45:46.518Z
Scaling laws vs individual differences 2023-01-10T13:22:44.394Z
Basic Facts about Language Model Internals 2023-01-04T13:01:35.223Z
An ML interpretation of Shard Theory 2023-01-03T20:30:28.830Z
The ultimate limits of alignment will determine the shape of the long term future 2023-01-02T12:47:01.419Z
Evidence on recursive self-improvement from current ML 2022-12-30T20:53:22.462Z
Human sexuality as an interesting case study of alignment 2022-12-30T13:37:20.176Z
[Interim research report] Taking features out of superposition with sparse autoencoders 2022-12-13T15:41:48.685Z
Deconfusing Direct vs Amortised Optimization 2022-12-02T11:30:46.754Z
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable 2022-11-28T12:54:52.399Z
Current themes in mechanistic interpretability research 2022-11-16T14:14:02.030Z
Interpreting Neural Networks through the Polytope Lens 2022-09-23T17:58:30.639Z

Comments

Comment by beren on When and why did 'training' become 'pretraining'? · 2024-03-12T11:29:40.188Z · LW · GW

Thanks for these points! I think I understand the history of what has happened here better now -- and the reasons for my misapprehension. Essentially, what I think happened is

a.) LLM/NLP research always (?) used 'pretraining' for a long time back at least to 2017 era for a general training of a model not specialised for a certain NLP task (such as NER, syntax parsing, etc)

b.) rest of ML mostly used 'training' because they by and by large didn't do massive unsupervised training on unrelated tasks -- i.e. CV just had imagenet or whatever

c.) In 2020-2022 period NLP with transformers went from fairly niche subfield of ML to memetically dominant due to massive success of transformer GPT models

d.) This meant both that their linguistic descriptions of 'pretraining' spread much more widely due to uptake of similar methods in other subfields and that I got much more involved in looking at NLP / LLM research than I had in the past where I personally had focused more on CV and RL leading to its sudden appearance in my personal experience (which turned out to be wrong). 

Comment by beren on AISC team report: Soft-optimization, Bayes and Goodhart · 2024-03-08T15:00:53.756Z · LW · GW

I like this post very much and in general I think research like this is on the correct lines towards solving potential problems with Goodheart's law -- in general Bayesian reasoning and getting some representation of the agent's uncertainty (including uncertainty over our values!) seems very important and naturally ameliorates a lot of potential problems. The correctness and realizability of the prior are very general problems with Bayesianism but often do not thwart its usefulness in practice although they allow people to come up with various convoluted counterexamples of failure. The key is to have sufficiently conservative priors such that you can (ideally) prove bounds about the maximum degree of goodhearting that can occur under realistic circumstances and then translate these into algorithms which are computationally efficient enough to be usable in practice. People have already done a fair bit of work on this in RL in terms of 'cautious' RL which tries to take into account uncertainty in the world model to avoid accidentally falling into traps in the environment.

Comment by beren on Many arguments for AI x-risk are wrong · 2024-03-05T23:05:04.412Z · LW · GW

While I agree with a lot of points of this post, I want to quibble with the RL not maximising reward point. I agree that model-free RL algorithms like DPO do not directly maximise reward but instead 'maximise reward' in the same way self-supervised models 'minimise crossentropy' -- that is to say, the model is not explicitly reasoning about minimising cross entropy but learns distilled heuristics that end up resulting in policies/predictions with a good reward/crossentropy. However, it is also possible to produce architectures that do directly optimise for reward (or crossentropy). AIXI is incomputable but it definitely does maximise reward. MCTS algorithms also directly maximise rewards. Alpha-Go style agents contain both direct reward maximising components initialized and guided by amortised heuristics (and the heuristics are distilled from the outputs of the maximising MCTS process in a self-improving loop).  I wrote about the distinction between these two kinds of approaches -- direct vs amortised optimisation here. I think it is important to recognise this because I think that this is the way that AI systems will ultimately evolve and also where most of the danger lies vs simply scaling up pure generative models. 

Comment by beren on Counting arguments provide no evidence for AI doom · 2024-02-28T15:50:34.185Z · LW · GW

This monograph by Bertsekas on the interrelationship between offline RL and online MCTS/search might be interesting -- http://www.athenasc.com/Frontmatter_LESSONS.pdf -- since it argues that we can conceptualise the contribution of MCTS as essentially that of a single Newton step from the offline start point towards the solution of the Bellman equation. If this is actually the case (I haven't worked through all details yet) then this seems to be able to be used to provide some kind of bound on the improvement / divergence you can get once you add online planning to a model-free policy.

Comment by beren on A case for AI alignment being difficult · 2024-01-02T11:55:25.937Z · LW · GW

Thanks for writing this! Here are some of my rough thoughts and comments.

One of my big disagreements with this threat model is that it assumes it is hard to get an AGI to understand / successfully model 'human values'. I think this is obviously false. LLMs already have a very good understanding of 'human values' as they are expressed linguistically, and existing alignment techniques like RLHF/RLAIF seem to do a reasonably good job of making the models' output align with these values (specifically generic corporate wokeness for OpenAI/Anthropic) which does appear to generalise reasonably well to examples which are highly unlikely to have been seen in training (although it errs on the side of overzealousness of late in my experience). This isn't that surprising because such values do not have to be specified by the fine-tuning from scratch but should already be extremely well represented as concepts in the base model latent space  and merely have to be given primacy. Things would be different, of course, if we wanted to align the LLMs to some truly arbitrary blue and orange morality not represented in the human text corpus, but naturally we don't. 

Of course such values cannot easily be represented as some mathematical utility function, but I think this is an extremely hard problem in general verging on impossible -- since this is not the natural type of human values in the first place, which are naturally mostly linguistic constructs existing in the latent space and not in reality. This is not just a problem with human values but almost any kind of abstract goal you might want to give the AGI -- including things like 'maximise paperclips'. This is why almost certainly AGI will not be a direct utility maximiser but instead use a learnt utility function using latents from its own generative model, but in this case it can represent human values and indeed any goal expressible in natural language which of course it will understand.

On a related note this is also why I am not at all convinced by the supposed issues over indexicality. Having the requisite theory of mind to understand that different agents have different indexical needs should be table stakes to any serious AGI and indeed hardly any humans have issues with this, except for people trying to formalise it into math. 

There is still a danger of over-optimisation, which is essentially a kind of overfitting and can be dealt with in a number of ways which are pretty standard now. In general terms, you would want the AI to represent its uncertainty over outcomes and utility approximator and use this to derive a conservative rather than pure maximising policy which can be adjusted over time.

I broadly agree with you about agency and consequentialism being broadly useful and ultimately we won't just be creating short term myopic tool agents but fully long term consequentialists. I think the key thing here is just to understand that long term consequentialism has fundamental computational costs over short term consequentialism and much more challenging credit assignment dynamics so that it will only be used where it actually needs to be. Most systems will not be long term consequentialist because it is unnecessary for them.  

I also think that breeding animals to do tasks or looking at humans subverting social institutions is not necessarily a good analogy to AI agents performing deception and treacherous turns. Evolution endowed humans and other animals with intrinsic selfish drives for survival and reproduction and arguably social deception which do not have to exist in AGIs. Moreover, we have substantially more control over AI cognition than evolution does over our cognition and gradient descent is fundamentally a more powerful optimiser which makes it challenging to produce deceptive agents. There is basically no evidence for deception occurring with current myopic AI systems and if it starts to occur with long term consequentialist agents it will be due to either a breakdown of credit assignment over long horizons (potentially due to being forced to use worse optimisers such as REINFORCE variants rather than pure BPTT) or the functional prior of such networks turning malign. Of course if we directly design AI agents via survival in some evolutionary sim or explicitly program in Omohundro drives then we will run directly into these problems again.

Comment by beren on Intelligence Enhancement (Monthly Thread) 13 Oct 2023 · 2023-10-15T21:19:58.178Z · LW · GW

Thanks for the response! Very helpful and enlightening.

The reason for this is actually pretty simple: genes with linear effects have an easier time spreading throughout a population.

This is interesting -- I have never come across this. Can you expand the intuition of this model a little more? Is the intuition something like in the fitness landscape genes with linear effects are like gentle slopes that are easy to traverse vs extremely wiggly 'directions'? 

Also how I am thinking about linearity is maybe slightly different to the normal ANOVA/factor analysis way, I think. I.e. let's suppose that we have some protein which is good so that more of it is better and we have 100 different genes which can either upregulate or down regulate it. However, at some large number, say 80x the usual amount, the benefit saturates. So a normal person is very unlikely to have 80/100 positive variants but if we go in and edit all 100 to be positive, we only get the maximum benefit far below what we would have predicted since it maxes out at 80. I guess to detect this nonlinearity in a normal population you basically need to get an 80+th order interaction of all of them interacting in just the right way which is exceedingly unlikely. Is this your point about sample size?

I'll talk about this in more detail within the post, but yes we have examples of monogenic diseases and cancers being cured via gene therapy.

This is very cool. Are the cancer cures also monogenic? Has anybody done any large scale polygenic editing in mice or any other animal before humans? This seems the obvious place to explicitly test the causality and linearity directly. Are we bottlenecked on GWAS equivalents for other animals?

Comment by beren on Intelligence Enhancement (Monthly Thread) 13 Oct 2023 · 2023-10-15T12:54:47.531Z · LW · GW

This would be very exciting if true! Do we have a good (or any) sense of the mechanisms by which these genetic variants work -- how many are actually causal, how many are primarily active in development vs in adults, how much interference there is between different variants etc? 

I am also not an expert at all here -- do we have any other examples of traits being enhanced or diseases cured by genetic editing in adults (even in other animals) like this? It seems also like this would be easy to test in the lab -- i.e. for mice which we can presumably sequence and edit more straightforwardly and also can measure some analogues of IQ with reasonable accuracy and reliability. Looking forward to the longer post.

Comment by beren on Deep learning models might be secretly (almost) linear · 2023-10-04T20:13:02.509Z · LW · GW

This is an interesting idea. I feel this also has to be related to increasing linearity with scale and generalization ability -- i.e. if you have a memorised solution, then nonlinear representations are fine because you can easily tune the 'boundaries' of the nonlinear representation to precisely delineate the datapoints (in fact the nonlinearity of the representation can be used to strongly reduce interference when memorising as is done in the recent research on modern hopfield networks) . On the other hand, if you require a kind of reasonably large-scale smoothness of the solution space, as you would expect from a generalising solution in a flat basin, then this cannot work and you need to accept interference between nearly orthogonal features as the cost of preserving generalisation of the behaviour across many different inputs which activate the same vector.

Comment by beren on Thoughts on Loss Landscapes and why Deep Learning works · 2023-07-26T23:27:44.732Z · LW · GW

Looks like I really need to study some SLT! I will say though that I haven't seen many cases in transformer language models where the eigenvalues of the Hessian are 90% zeros -- that seems extremely high.

Comment by beren on Hedonic Loops and Taming RL · 2023-07-21T15:19:16.235Z · LW · GW

I also think this is mostly a semantic issue. The same process can be described in terms of implicit prediction errors where e.g. there is some baseline level of leptin in the bloodstream that the NPY/AgRP neurons in the arcuate nucleus 'expect' and then if there is less leptin this generates an implicit 'prediction error' in those neurons that cause them to increase firing which then stimulates various food-consuming reflexes and desires which ultimately leads to more food and hence 'correcting' the prediction error. It isn't necessary that anywhere there are explicit 'prediction error neurons' encoding prediction errors although for larger systems it is often helpful to modularize it this way. 

 

Ultimately, though I think it is more a conceptual question of how to think about control systems -- is it best to think in terms of implicit prediction errors or just in terms of the feedback loop dynamics but it amounts to the same thing

Comment by beren on Hedonic Loops and Taming RL · 2023-07-21T15:14:57.205Z · LW · GW

This is where I disagree! I don't think the Morrison and Berridge experiment demonstrates model-based side. It is consistent with model-based RL but is also consistent with model-free algorithms that can flexibly adapt to changing reward functions such as linear RL. Personally, I think this latter is more likely since it is such a low level response which can be modulated entirely by subcortical systems and so seems unlikely to require model-based planning to work

Comment by beren on Hedonic Loops and Taming RL · 2023-07-20T14:13:52.633Z · LW · GW

Thanks for linking to your papers and definitely interesting you have been thinking along similar lines. I think the key reason I think studying this is important is that I think that these hedonic loops demonstrate that a.) Mammals including humans are actually exceptionally aligned to basic homeostatic needs and basic hedonic loops I'm practice. It is extremely hard and rare for people to choose not to follow homeostatic drives. I think humans are mostly 'misaligned' about higher level things like morality, empathy etc is because we dont actually have direct drives hardcoded in the hypothalamus for them the way we do for primary rewards. Higher level behaviours either socio-culturally learned through unsupervised critically based learning or derived from RL extrapolations from primary rewards. It is no surprise that alignment to these ideals is weaker. B.) That relatively simple control loops are very effective at controlling vastly more complex unsupervised cognitive systems.

I also agree this is similar to steven Byrnes agenda and maybe just my way to arrive at it

Comment by beren on Hedonic Loops and Taming RL · 2023-07-20T12:42:44.834Z · LW · GW

This is definitely possible and is essentially augmenting the state variables with additional homeostatic variables and then learning policies on the joint state space. However there are some clever experiments such as the linked Morrison and Berridge one demonstrating that this is not all that is going on -- specifically many animals appear to be able to perform zero-shot changes in policy when rewards change even if they have not experienced this specific homeostatic variable before -- I.e. mice suddenly chase after salt water which they previously disliked when put in a state of salt deprivation which they had never before experienced

Comment by beren on [Linkpost] Introducing Superalignment · 2023-07-06T12:15:49.967Z · LW · GW

The 'four years' they explicitly mention does seem very short to me for ASI unless they know something we don't...

Comment by beren on Ways I Expect AI Regulation To Increase Extinction Risk · 2023-07-06T12:00:46.855Z · LW · GW

AI x-risk is not far off at all, it's something like 4 years away IMO

Can I ask where this four years number is coming from? It was also stated prominently in the new 'superalignment' announcement (https://openai.com/blog/introducing-superalignment). Is this some agreed upon median timelines at OAI? Is there an explicit plan to build AGI in four years? Is there strong evidence behind this view -- i.e. that you think you know how to build AGI explicitly and it will just take four years more compute/scaling?

Comment by beren on Deconfusing Direct vs Amortised Optimization · 2023-07-04T18:27:58.971Z · LW · GW

Hi there! Thanks for this comment.  Here are my thoughts:

  1. Where do highly capable proposals/amortised actions come from?
  • (handwave) lots of 'experience' and 'good generalisation'?

Pretty much this. We know empirically that deep learning generalizes pretty well from a lot of data as long as it is reasonable representative. I think that fundamentally this is due to the nature of our reality that there are generalizable patterns which is ultimately due to the sparse underlying causal graph. It is very possible that there are realities where this isn't true and in those cases this kind of 'intelligence' would not be possible. 

r...? This seems to be to be where active learning and deliberate/creative exploration come in

  • It's a Bayes-adaptivity problem, i.e. planning for value-of-information
  • This is basically what 'science' and 'experimentalism' are in my ontology
    • 'Play' and 'practice' are the amortised equivalent (where explorative heuristics are baked in)

Again, I completely agree here. In practice in large environments it is necessary to explore if you can't reach all useful states from a random policy. In these cases, it is very useful to a.) have an explicit world model so you can learn from sensory information which is much higher bandwidth than reward usually and generalizes further and in an uncorrelated way, and b.) do some kind of active exploration. Exploring according to maximizing info-gain is probably close to optimal, although whether this is actually theoretically optimal is I tihnk still an open question. The main issue is that info-gain is hard to cmopute/approximate tractably, since it requires keeping a close track of your uncertainty, and DL models are computationally tractable by explicitly throwing away all the uncertainty and only really maintaining point predictions.

animals are evidence that some amortised play heuristics are effective! Even humans only rarely 'actually do deliberate experimentalism'

  • but when we do, it's maybe the source of our massive technological dominance?

Like I don't know to what extent there are 'play heuristics' at a behavioural level vs some kind of intrinsic drive for novelty / information gain but yes, having these drives 'added to your reward function' is generally useful in RL settings and we know this happens in the brain as well -- i.e. there are dopamine neurons responsive to proxies of information gain (and exactly equal to information gain in simple bandit-like settings where this is tractable)

  1. When is deliberation/direct planning tractable?
  • In any interestingly-large problem, you will never exhaustively evaluate
    • e.g. maybe no physically realisable computer in our world can ever evaluate all Go strategies, much less evaluating strategies for 'operate in the world itself'!
  • What properties of options/proposals lend themselves?
    • (handwave) 'Interestingly consequential' - the differences should actually matter enough to bother computing!
    • Temporally flexible
      • The 'temporal resolution' of the strategy-value landscape may vary by orders of magnitude
      • so the temporal resolution of the proposals (or proposal-atoms) should too, on pain of intractability/value-loss/both

So there are a number of circumstances where direct planning is valuable and useful. I agree about your conditions and especially the correct action step-size as well as discrete actions and known not super stochastic dynamics. Other useful conditions are when it's easy to evaluate the branches of the tree without having gone all the way down to the leaves -- i.e. in games like Chess/GO it's often very easy to know that some move tree is intrinsically doomed without having explored all of it. This is a kind of convexity to the state space (not literally mathematically, but intuitively) which makes optimization much easier. Similarly, when good proposals can be made due to linearity / generalizability in the action space it is easy to prune actions and trees. 

  1. Where does strong control/optimisation come from?

Strong control comes from where strong learning in general comes from -- lots of compute and data -- and for planning especially compute. The optimal trade-off between amortized and direct optimization given a fixed compute budget is super interesting and I don't think we have any good models of this yet.

Another thing that I think is fairly underestimated among people on LW compared to people doing deep RL is that open-loop planning is actually very hard and bad at dealing with long time horizons. This is basically due to stochasticity and chaos theory -- future prediction is hard. Small mistakes in either modelling or action propagate very rapidly to create massive uncertainties about the future so that your optimal posterior rapidly dwindles to a maximum entropy distribution. The key thing in long term planning is really adaptability and closed-loop control -- i.e. seeing feedback and adjusting your actions in response to feedback. This is how almost all practical control systems actually work and in practice in deep RL with planning everybody actually uses MPC so replans every step.

Comment by beren on What in your opinion is the biggest open problem in AI alignment? · 2023-07-04T09:32:29.805Z · LW · GW

The problem is not so much which one of 1,2,3 to pick but whether 'we' get a chance to pick it at all. If there is space, free energy, and diversity, there will be evolution going on among populations and evolution will consistently push things in the direction towards more reproduction up until it hits a Malthusian limit at which point it will push towards greater competition and economic/reproductive efficiency. The only way to avoid this is to remove the preconditions for evolution -- any of variation, selection, heredity -- but these seem quite natural in a world of large AI populations so in practice this will require some level of centralized control

Comment by beren on Using (Uninterpretable) LLMs to Generate Interpretable AI Code · 2023-07-02T23:51:26.754Z · LW · GW

This is obviously true; any AI complete problem can be trivially reduced to the problem of writing an AI program that solves the problem. That isn't really a problem for the proposal here. The point isn't that we could avoid making AGI by doing this, the point is that we can do this in order to get AI systems that we can trust without having to solve interpretability.

Maybe I'm being silly but then I don't understand the safety properties of this approach. If we need an AGI based on uninterpretable DL to build this, then how do we first check if this AGI is safe?

Comment by beren on Horizontal and Vertical Integration · 2023-07-02T22:02:10.466Z · LW · GW

I moderately agree here but I still think the primary factor is centralization of the value chain. The more of the value chain is centralized, the easier it is to control. My guess we can make this argument more formalized by thinking of things in terms of a dependency graph -- if we imagine the economic process from sand + energy -> DL models then the important measure is the centrality of the hubs in this graph. If we can control and/or cut these hubs, then the entire DL ecosystem falls apart. Conveniently/unfortunately this is also where most of the economic profit is likely to be accumulating by standard industrial economic laws, and hence this is also where there will be the most resources resisting regulation.

Comment by beren on Using (Uninterpretable) LLMs to Generate Interpretable AI Code · 2023-07-02T21:58:50.182Z · LW · GW

As I see it, there are two fundamental problems here:

1.) Generating an interpretable expert system code for an AGI is probably already AGI complete. It seems unlikely that a non-AGI DL model can output code for an AGI -- especially given that it is highly unlikely that there would be expert system AGIs in its training set -- or even things close to expert-system AGIs if deep learning keeps far out pacing GOFAI techniques.

2.) Building an interpretable expert system AGI is likely not just AGI complete but a fundamentally much harder problem than building a DL AGI system. Intelligence is extremely detailed, messy, and highly heuristic. All our examples of intelligent behaviour come from large blobs of optimized compute --  both brains and DL systems -- and none from expert systems. The actual inner workings of intelligence might just be fundamentally uninterpretable in their complexity except at a high level -- i.e. 'this optimized blob is the output of approximate bayesian inference over this extremely massive dataset'

Comment by beren on resolving some neural network mysteries · 2023-06-19T16:11:01.384Z · LW · GW

Interesting post! Do you have papers for the claims on why mixed activation functions perform worse? This is something I have thought about a little bit but not looked deeply into. Would appreciate links here? My naive thinking is that it mostly doesn't work due to difficulties of conditioning and keeping the loss landscape smooth and low curvature with different activation functions in a layer. With a single activation function, it is relatively straightforward to design an initialization that doesn't blow up -- with mixed ones it seems your space of potential numerical difficulties increases massively.

Comment by beren on The ants and the grasshopper · 2023-06-06T21:31:29.540Z · LW · GW

Exactly this. This is the relationship in RL between the discount factor and the probability of transitioning into an absorbing state (death)

Comment by beren on Optimization happens inside the mind, not in the world · 2023-06-04T13:06:57.208Z · LW · GW

I think this is a really good post. You might be interested in these two posts which explore very similar arguments on the interactions between search in the world model and more general 'intuitive policies' as well as the fact that we are always optimizing for our world/reward model and not reality and how this affects how agents act.

Comment by beren on The case for removing alignment and ML research from the training dataset · 2023-06-01T15:37:18.190Z · LW · GW

Yes! This would be valuable. Generally, getting a sense of the 'self-awareness' of a model in terms of how much it knows about itself would be a valuable thing to start testing for.

Comment by beren on The case for removing alignment and ML research from the training dataset · 2023-06-01T15:36:45.049Z · LW · GW

I don't think model's currently have this ability by default anyway. But we definitely should think very hard before letting them do this!

Comment by beren on The case for removing alignment and ML research from the training dataset · 2023-06-01T15:35:34.144Z · LW · GW

Yes, I think what I proposed here is the broadest and crudest thing that will work. It can of course be much more targeted to specific proposals or posts that we think are potentially most dangerous. Using existing language models to rank these is an interesting idea.

Comment by beren on AI Will Not Want to Self-Improve · 2023-05-16T22:43:33.131Z · LW · GW

I'm very glad you wrote this. I have had similar musings previously as well, but it is really nice to see this properly written up and analyzed in a more formal manner.

Comment by beren on How Many Bits Of Optimization Can One Bit Of Observation Unlock? · 2023-04-26T09:12:47.501Z · LW · GW

Interesting thoughts! By the way, are you familiar with Hugo Touchette's work on this? which looks very related and I think has a lot of cool insights about these sorts of questions.

Comment by beren on Deep learning models might be secretly (almost) linear · 2023-04-25T13:09:06.172Z · LW · GW

I think this is a good intuition. I think this comes down to the natural structure of the graph and the fact that information disappears at larger distances. This means that for dense graphs such as lattices etc regions only implicitly interact through much lower dimensional max-ent variables which are then additive while for other causal graph structures such as the power-law small-world graphs that are probably sensible for many real-world datasets, you also get a similar thing where each cluster can be modelled mostly independently apart from a few long-range interactions which can be modelled as interacting with some general 'cluster sum'. Interestingly, this is how many approximate bayesian inference algorithms for factor graphs look like -- such as the region graph algorithm. ( http://pachecoj.com/courses/csc665-1/papers/Yedidia_GBP_InfoTheory05.pdf). 

I definitely agree it would be really nice to have the math of this all properly worked out as I think this, as well as the region why we see power-law spectra of features so often in natural datasets (which must have a max-ent explanation) is a super common and deep feature of the world. 

Comment by beren on Deep learning models might be secretly (almost) linear · 2023-04-25T12:54:01.625Z · LW · GW

Maybe this linearity story would work better for generative models, where adding latent vector representations of two different objects would lead the network to generate an image with both objects included (an image that would have an ambiguous class label to a second network). It would need to be tested whether this sort of thing happens by default (e.g., with Stable Diffusion) or whether I'm just making stuff up here.
 

Yes this is exactly right. This is precisely the kind of linearity that I am talking about not the input->output mapping which is clearly nonlinear. The idea being that hidden inside the network is a linear latent space where we can perform linear operations and they (mostly) work. In the points of evidence in the post there is discussion of exactly this kind of latent space editing for stable diffusion. A nice example is this paper. Interestingly this also works for fine-tuning weight diffs for e.g. style transfer.

Comment by beren on The Computational Anatomy of Human Values · 2023-04-25T12:50:47.727Z · LW · GW

Thanks for the typos! Fixed now.

Comment by beren on The Computational Anatomy of Human Values · 2023-04-25T12:49:04.127Z · LW · GW

Doesn't this imply that people with exceptionally weak autobiographical memory (e.g., Eliezer) have less self-understanding/sense of self? Or maybe you think this memory is largely implicit, not explicit? Or maybe it's enough to have just a bit of it and it doesn't "impair" unless you go very low?
 

This is an interesting question and I would argue that it probably does lead to a less-understanding and sense-of-self ceteris paribus. I think that the specific sense of self is mostly an emergent combination of having autobiographical memories -- i.e. at each moment a lot of what we do is heavily informed by consistency and priors from our previous actions and experiences. If you just completely switched your memories with somebody else then I wold argue that this is not 'you' anymore. The other place sense of self is created from is social roles where the external environment plays a big role in creating and maintaining a coherent 'you'. You interact people who remember and know you. You have specific roles such as jobs, relationships etc which bring you back to a default state etc. This is a natural result of having a predictive unsupervised world model -- you are constantly predicting what to expect in the world and the world has its own memory about you which alters its behaviour towards you.

I don't know if there is a direct linear relationship between sense of self and strength of autobiographical memory and it might be some kind of nonlinear or threshold thing but I suspect it affects it. 

One thing that your model of unsupervised learning of the world model(s) doesn't mention is that humans apparently have strong innate inductive biases for inferring the presence of norms and behaving based on perception (e.g., by punishing transgressors) of those norms, even when they're not socially incentivized to do so (see this SEP entry).[1] I guess you would explain it as some hardcoded midbrain/brainstem circuit that encourages increased attention to socially salient information, driving norm inferrence and development of value concepts, which then get associatively satured with valence and plugged into the same or some other socially relevant circuits for driving behavior?
 

I definitely think there is some of this. According to RL and basic drives you are encouraged to pay more attention to some things than others. Your explanation of it is pretty much exactly what I would say except that I would stress that many of the 'norms' you are paying attention to are learnt and socially constructed in the neocortex.
 

  I'm not sure. It's not obvious to me that more powerful models won't be able to model human behavior using abstractions very unlike human values, and possible quite incomprehensible to us.

This is maybe the case but it seems unlikely. Human concepts and abstractions emerge from precisely the kind of unsupervised learning of human behaviour that DL systems do. Our concepts are also directly in the training data we discuss them among ourselves and so the DL system would be strongly encouraged to learn these as well. It might learn additional concepts which are very subtle and hard for us to understand but it will probably also learn a pretty good approximation of our concepts (about as good as I would argue exists between humans who usually have slightly different concepts of the same thing which sometimes impedes communication but doesn't make it impossible).

Can you elaborate on what it means for concepts encoded in the cortex to exist in a ~linear vector space? How would a world where that wasn't the case look like?
 

I discuss this slightly more here (https://www.lesswrong.com/posts/JK9nxcBhQfzEgjjqe/deep-learning-models-might-be-secretly-almost-linear). Essentially, just that there is a semantic mapping between 'concepts' and directions in some high level vector space which permits linear operations -- i.e. we can do natural 'scaling' and linear combinations of these directions with the results that you would intuitively expect. There is a fair amount of evidence for this in both DL systems (including super basic ones like Word2Vec which is where it was originally found) and the brain. 

In a world where this wasn't the case, a lot of current neuroscience models which depend on linear decoding would not work. There would not be neurons or groups of neurons that encode for specific recognisable concept features.  Neither would lots of methods in DL such as the latent space addition results of word2vec -- i.e. the king - man + woman = queen style addition (which also largely work with transformer models), and editing methods like ROME or https://arxiv.org/abs/2212.03827.

Comment by beren on Deep learning models might be secretly (almost) linear · 2023-04-25T12:26:44.537Z · LW · GW

Yes. The idea is that the latent space of the neural network's 'features' are 'almost linear' which is reflected in both the linear-ish properties of the weights and activations. Not that the literal I/O mapping of the NN is linear, which is clearly false.

 

More concretely, as an oversimplified version of what I am saying, it might be possible to think of neural networks as a combined encoder and decoder to a linear vector space. I.e. we have nonlinear function f and  g which encode the input x to a latent space z and g which decodes it to the output y -i.e. f(x) = z and g(z) = y. We the hypothesise that the latent space z is approximately linear such that we can perform addition and weighted sums of zs as well as scaling individual directions in z and these get decoded to the appropriate outputs which correspond to sums or scalings of 'natural' semantic features we should expect in the input or output.

Comment by beren on Deep learning models might be secretly (almost) linear · 2023-04-25T12:25:27.625Z · LW · GW

Thanks for these links! This is exactly what I was looking for as per Cunningham's law. For the mechanistic mode connectivity, I still need to read the paper, but there is definitely a more complex story relating to the symmetries rendering things non-connected by default but once you account for symmetries and project things into an isometric space where all the symmetries are collapsed things become connected and linear again. Is this different to that?

 

I agree about the NTK. I think this explanation is bad in its specifics although I think the NTK does give useful explanations at a very coarse level of granularity. In general, to put a completely uncalibrated number on it, I feel like NNs are probably '90% linear' in their feature representations. Of course they have to have somewhat nonlinear representations as well. But otoh if we could get 90% of the way to features that would be massive progress and might be relatively easy. 

Comment by beren on "Aligned" foundation models don't imply aligned systems · 2023-04-14T12:17:29.302Z · LW · GW

Interesting point, which I broadly agree with. I do think however, that this post has in some sense over updated on recent developments around agentic LLMs and the non-dangers of foundation models. Even 3-6 months ago, in the intellectual zeitgeist it was unclear whether autoGPT style agentic LLM wrappers were the main threat and people were primarily worried about foundation models being directly dangerous. It now seems clearer that at least at current capability levels, foundation models are not directly goal-seeking at present, although adding agency is relatively straightforward. This may also change in future such as if we were to do direct goal-driven RL training of the base models to create agents that way -- this would make direct alignment and interpretability of base models still necessary for safety. 

Comment by beren on Scaffolded LLMs as natural language computers · 2023-04-14T12:09:13.810Z · LW · GW

Thanks for these points! 

Equivalence token to bits

Why did you decide to go with the equivalence of 1 token = 1 bit? Since a token can usually take on the order of 10k to 100k possible values, wouldn't 1 token equal 13-17 bits a more accurate equivalence?

My thinking here is that the scaffolded LLM is a computer which operates directly in the natural language semantic space so it makes more sense to define the units of its context in terms of its fundamental units such as tokens. Of course each token has a lot more information-theoretic content than a single bit -- but this is why a single NLOP is much more powerful than a single FLOP. I agree that tokens directly are probably not the correct measure since they are too object level and there is likely some kind of 'semantic bit' idealisation which needs to be worked out.

Processor register as a better analog for the context window

One caveat I'd like to discuss: in the post, you describe the context window of NLPU as the analog for the RAM of computers. I think a more accurate analog could be processor registers

Similarly to the context window, they are the memory bits directly connected to the computing unit. Whereas, it takes an instruction to load information from RAM before it can be used by the CPU. The RAM sits in the middle of the memory hierarchy, while registers are at its top.

I think I discuss this in the memory hierarchy section of the post. I agree that it is unclear what the best conceptualisation of the context window is. I agree it is not necessarily directly compatible with the RAM and may be more like processor registers. I think the main point is that currently scaffolded LLM systems have a 2 level memory hierarchy and computers have evolved a fairly complex and highly optimised multi-step system. It may be that we also eventually develop such a system or its equivalent for LLMs. I actually do not know how the memory hierarchy for the earliest computers worked -- did they already have a register -> RAM -> disk distinction? 

I think this might be an additional factor -- on top of the increased power and reliability of LLM -- that made us wait for so long after GPT3 before beginning to design complicated chaining of LLM calls. A single LM can store enough data in its context window to do many useful tasks: as you describe, there are many NLPU primitives to discover and exploit. On the other hand, a CPU with no RAM is basically an over-engineered calculator. It becomes truly useful once embedded in a von-Neumann architecture.

This is an interesting hypothesis. My alternate hypothesis is essentially a combination of a.) reliability and instruction following with GPT3 was just too bad for this to work appreciably and we broke through some kind of barrier with GPT4 and secondly just that there actually was not that much time. GPT3 API only became widely useable in mid-2021 IIRC so that is about a year and a bit between that and ChatGPT release which is hardly any time to start iterating on this stuff.

Multimodal models


If the natural type signature of a CPU is bits -> bits, the natural type of the natural language processing unit (NLPU) is strings -> strings.

With the rise of multimodal (image + text) models, NLPU could be required to deal with other data types than "string" like image embeddings, as images cannot be efficiently converted into natural text. 
 

Indeed. Should be interesting to see if we converge to some canonical datatype or not. The reason strings are so nice is that they compose easily and are incredibly flexible. The alternative is having directly chained architectures which communicate in embeddings, which can then be arbitrarily multimodal. Whether this works or not depends on how 'internalised' the cognition of the system is. Current agentic LLM trend is to externalise which is, imho, good from an interpretability and steer ability perspective. It may reverse. 

Comment by beren on Agentized LLMs will change the alignment landscape · 2023-04-10T20:51:27.888Z · LW · GW

I perceive a lot of inferential distance on my end as well. My model here is informed by a number of background conclusions that I'm fairly confident in, but which haven't actually propagated into the set of commonly-assumed background assumptions.

I have found this conversation very interesting. Would be very interested if you could do a quick summary or writeup of the background conclusions you are referring to. I have my own thoughts about the feasibility of massive agency gains from AutoGPT like wrappers but would be interested to hear your thoughts

Comment by beren on The Computational Anatomy of Human Values · 2023-04-10T08:17:17.376Z · LW · GW

I think you're saying that this is evidence that artificial systems which have similar architectures to human brains will also be able to solve this pointers problems, and if so, I agree.

I'm skeptical that anyone will succeed in building such a system before others build other kinds of systems, but I do think it is a good thing to try, if done carefully.

I think our biggest crux is this. My idea here is that by default we get systems that look like this -- DL systems look like this! and my near-term prediction is that DL systems will scale all the way to AGI. Almost any near-term AGI will almost certainly look 'human-like' in a way -- some combination of model-free and model-based RL wrapped around an unsupervised world model. In the even nearer-term you might even scale to AGI with pure AutoGPT-style agents which are just doing iterative planning by conditioning the LLM! Both potential AGI designs look way closer to human-like than a pure EY-style utility maximiser. Now EY might still be right in the limit of super intelligence and RSI but  that is not what near-term systems seem likely to look like.

 

Though, a world where such systems are easy to build is not one I'd call "benign", since if it's easy to "just ask for alignment", it's probably also pretty easy to ask for not-alignment. Put another way, in the world where CoEms are the first kind of strong AGIs to get built, I think p(s-risk) goes up dramatically, though p(good outcomes) also goes up, perhaps even more dramatically, and p(molecular squiggles) goes down. I mostly think we're not in that world, though.

Yeah I completely agree with this point and I think this is going to be almost inevitable for any alignment strategy. As a consequence of orthogonality thesis, it is likely that given you can align a system at all then you can choose to align it to something bad -- like making people suffer -- if you want to. I think this is true across almost all worlds -- and so we definitely get increasing p(s-risk) along with increased p(survival). This is not a problem technical alignment can solve but instead needs to involve some level of societal agreement / governance.

Comment by beren on The Computational Anatomy of Human Values · 2023-04-09T20:08:26.274Z · LW · GW

Most of these claims seem plausibly true of average humans today, but false about smarter (and more reflective) humans now and in the future.

On the first point, most of the mundane things that humans do involve what looks to me like pretty strong optimization; it's just that the things they optimize for are nice-looking, normal (but often complicated) human things. Examples of people explicitly applying strong optimization in various domains: startup founders, professional athletes, AI capabilities researchers, AI alignment researchers, dating.

My claim is not that humans do not optimise for outcomes -- they clearly do and this is a crucial part of our intelligence. Instead, my claim is about the computational architecture of this optimisation process -- that humans are primarily (but not entirely) amortised optimisers who have learnt approximations of direct optimisation through meta-RL in the PFC. This does not mean we cannot exert optimisation power, just that we are not cognitively built as utility maximisers. 

Definitely different people have different levels of optimization power they can exert and can optimise more or less strongly, but on the scale of average human -> true utility maximiser even the most agentic humans are probably closer to the average than the utility maximiser. 

Now, there are good computational reasons for this. Actually performing direct optimisation like this is extremely computationally costly in complex and unbounded environments so we use computational shortcuts, as by and by large do our DL systems. This does not necessarily hold in the limit but seems to be the case at the moment. 

My own view is that the best Future of humanity involves pretty drastic re-arrangements of most of the atoms in the lightcone.  Maybe you think I'm personally not likely to succeed or work very hard at actually doing this, but if I only knew more, though faster, had more time and energy... I think it becomes apparent pretty quickly where that ends up. 

Yeah so I am not claiming that this is necessarily reflectively stable and is the optimal thing to do with infinite resources. The point is that humans (and also AI systems) do not have these infinite resources in practice and hence take computational shortcuts which move them away from being pure utility maximisers (if this is actually the reflective endpoint for humanity which I am unclear of). The goal of this post isn't to describe hypothetical strong AIs but to describe how humans form values as well as how more human-like near-term AGIs are likely to function. Success at aligning these AGIs only gets us to the first step and we will ultimately have to solve the aligning-superintelligence problem as well, but later. 

I think the idea of Coherent Extrapolated Volition captures pretty crisply what it is that I (and many others), are optimizing for. My CEV is complicated, and there might be contradictions and unknown parts of it within me, but it sure doesn't feel situationally dependent or unknown on a meta-level.

This is the point of the post! CEV is not a base-level concept. You don't have primary reward sensors hooked up to the CEV. Nor is it a function of sensory observations. CEV is an entity that only exists in a highly abstract and linguistic/verbal latent space of your world model, and yet you claim to be aligned to it -- even though it might be contradictory and have unknown parts. You value it even though the base RL in your brain does not have direct 'hooks' into it. Somehow, your brain has solved a complex pointers problem to get you to intrinsically care about a concept that is very far from primary rewards.

Comment by beren on The surprising parameter efficiency of vision models · 2023-04-09T19:17:58.468Z · LW · GW

The Op is mistaken about visual transformers, they can also exploit parameter sharing just in a different way.

Can you expand on this? How do vision transformers exploit parameter sharing in a way that is not available to standard LLMs?

Comment by beren on The Computational Anatomy of Human Values · 2023-04-08T11:11:55.727Z · LW · GW

Nice. My main issue is that just because humans have values a certain way, doesn't mean we want to build an AI that way, and so I'd draw pretty different implications for alignment. I'm pessimistic about anything that even resembles "make an AI that's like a human child," and more interested in "use a model of a human child to help an inhuman AI understand humans in the way we want."
 

I pretty much agree with this sentiment. I don't literally think we should build AGI like a human and expect it to be aligned. Humans themselves are far from aligned enough for my taste! However, trying to understand how human values and their value learning system works is extremely important and undoubtedly has lessons for how to align brain-like AGI systems which I think are what we will end up with in the near-term.

Comment by beren on The Computational Anatomy of Human Values · 2023-04-08T11:09:23.200Z · LW · GW

So, I agree and I think we are getting at the same thing (though not completely sure what you are pointing at). The way to have a model-y critic and actor is to have the actor and critic perform model-free RL over the latent space of your unsupervised world model. This is the key point of my post and why humans can have 'values' and desires for highly abstract linguistic concepts such as 'justice' as opposed to pure sensory states or primary rewards.

Comment by beren on The Computational Anatomy of Human Values · 2023-04-08T11:07:42.278Z · LW · GW

afaict, a big fraction of evolution's instructions for humans (which made sense in the ancestral environment) are encoded as what you pay attention to. Babies fixate on faces, not because they have a practical need to track faces at 1 week old, but because having a detailed model of other humans will be valuable later. Young children being curious about animals is a human universal. Etc.
 

This is true but I don't think is super important for this argument. Evolution definitely encodes inductive biases into learning about relevant things which ML architectures do not, but this is primarily to speed up learning and handle limited initial data. Most of the things evolution focuses on such as faces are natural abstractions anyway and would be learnt by pure unsupervised learning systems.

Patterns of behavior (some of which I'd include in my goals) encoded in my model can act in a way that's somewhere between unconscious and too obvious to question - you might end up doing things not because you have visceral feelings about the different options, but simply because your model is so much better at some of the options that the other options never even get considered. 

Yes, there are also a number of ways to short-circuit model evaluation entirely. The classic one is having a habit policy which is effectively your action prior. There are also cases where you just follow the default model-free policy and only in cases where you are even more uncertain do you actually deploy the full model-based evaluation capacities that you have.

Comment by beren on The Computational Anatomy of Human Values · 2023-04-07T22:30:21.078Z · LW · GW

I always say that the whole brain (including not only the basal ganglia but also the thalamocortical system, medulla, etc.) operates as a model-based RL system. You’re saying that the BG by itself operates as a model-free RL system. So I don’t think we’re disagreeing, because “the cortex is the model”?? (Well, we definitely have some disagreements about the BG, but we don’t have to get into them, I don’t think they’re very important for present purposes.)
 

I think there is some disagreement here, at least in the way I am using model-based / model-free RL (not sure exactly how you are using it). Model-based RL, at least to me, is not just about explicitly having some kind of model, which I think we both agree exists in cortex, but rather the actual action selection system using that model to do some kind of explicit rollouts for planning. I do not think the basal ganglia does this, while I think the PFC has some meta-learned ability to do this. In this sense, the BG is 'model-free' while the cortex is 'model-based'. 

I don’t really find “meta-RL” as a great way to think about dlPFC (or whatever the exact region-in-question is). See Rohin’s critique of that DeepMind paper here. I might instead say that “dlPFC can learn good ideas / habits that are defined at a higher level of abstraction” or something like that. For example, if I learn through experience (or hearsay) that it’s a good idea to use Anki flashcards, you can call that Meta-RL (“I am learning how to learn”). But you can equally well describe it as “I am learning to take good actions that will eventually lead to good consequences”. Likewise, I’d say “learning through experience that I should suck up to vain powerful people” is probably is in the same category as “learning through experience that I should use Anki flashcards”—I suspect they’re learned in the same way by the same part of PFC—but “learning to suck up” really isn’t the kind of thing that one would call “meta-RL”, I think. There’s no “meta”—it’s just a good (abstract) type of action that I have learned by RL.
 

This is an interesting point. At some level of abstraction, I don't think there is a huge amount of difference between meta-RL and 'learning highly abstract actions/habits'. What I am mostly pointing towards this is the PFC learns high-level actions including how to optimise and perform RL over long horizons effectively including learning high-level cognitive habits like how to do planning etc, which is not an intrinsic ability but rather has to be learned. My understanding of what exactly the dlPFC does and how exactly it works is the place where I am most uncertain at present.

I agree in the sense of “it’s hard to look at the brainstem and figure out what a developed-world adult is trying to do at any given moment, or more generally in life”. I kinda disagree in the sense of “a person who is not hungry or cold will still be motivated by social status and so on”. I don’t think it’s right to put “eating when hungry” in the category of “primary reward” but say that “impressing one’s friends” is in a different, lesser category (if that’s what you’re saying). I think they’re both in the same category.

I agree that even when not immediately hungry or cold etc we still get primary rewards from increasing social status etc. I don't completely agree with Robin Hanson that almost all human behaviour can be explained by this drive directly though. I think we act on more complex linguistic values, or at least our behaviour to fulfil these primary rewards of social status is mediated through these. 

I don’t particularly buy the importance of words-in-particular here. For example, some words have two or more definitions, but we have no trouble at all valuing one of those definitions but not the other. And some people sometimes have difficulty articulating their values. From what I understand, internal monologue plays a bigger or smaller role in the mental life of different people. So anyway, I don’t see any particular reason to privilege words per se over non-linguistic concepts, at least if the goal is a descriptive theory of humans. If we’re talking about aligning LLMs, I’m open to the idea that linguistic concepts are sufficient to point at the right things.

So for words literally, I agree with this. By 'linguistic' I am more pointing at abstract high-level cortical representations. I think that for the most part these line up pretty well with and are shaped by our linguistic representations and that the ability of language to compress and communicate complex latent states is one of the big reasons for humanity's success. 

I think I would have made the weaker statement “There is no particular reason to expect this project to be possible at all.” I don’t see a positive case that the project will definitely fail. Maybe the philosophers will get very lucky, or whatever. I’m just nitpicking here, feel free to ignore.

This is fair. I personally have very low odds on success but it is not a logical impossibility. 

I think (?) you’re imagining a different AGI development model than me, one based on LLMs, in which more layers + RLHF scales to AGI. Whereas I’m assuming (or at least, “taking actions conditional on the assumption”) that LLM+RLHF will plateau at some point before x-risk, and then future AI researchers will pivot to architectures more obviously & deeply centered around RL, e.g. AIs for which TD learning is happening not only throughout training but also online during deployment (as it is in humans).

I am not sure we actually imagine that different AGI designs. Specifically, my near-term AGI model is essentially a multi-modal DL-trained world model, likely with an LLM as a centrepiece but also potentially vision and other modalities included, and then trained with RL either end to end or as some kind of wrapper on a very large range of tasks. I think, given that we already have extremely powerful LLMs in existence, almost any future AGI design will use them at least as part of the general world model. In this case, then there will be a very general and highly accessible linguistic latent space which will serve as the basis of policy and reward model inputs. 

Comment by beren on The Computational Anatomy of Human Values · 2023-04-07T21:58:15.887Z · LW · GW

1. Evolution needed to encode not only drives for food or shelter, but also drives for evolutionary desirable states like reproduction; this likely leads to drives which are present and quite active, such as "seek social status" => as a consequence I don't think the evolutionary older drives are out of play and the landscape is flat as you assume, and dominated by language-model-based values
 

Yes, I think drives like this are important on two levels. At the first level, we are experience them as primary rewards -- i.e. as social status gives direct dopamine hits. Secondly, they shape the memetic selection environment which creates and evolves linguistic memes of values. However, it's important to note that almost all of these drives such as for social status are mediated through linguistic cortical abstractions. I.e. people will try to get social status by fulfilling whatever the values of their environment are, which can lead to very different behaviours being shown and rewarded in different environments, even though powered by the same basic drive. 

 3. The world model isn't a value-indepedent goal-orthogonal model; the stuff it learned is implicitly goal-oriented by being steered by the reward model

The world model is learnt mostly by unsupervised predictive learning and so is somewhat orthogonal to the specific goal. Of course in practice in a continual learning setting, what you do and pay attention to (which is affected by your goal) will affect the data input to the unsupervised learning process? 


Also, in my impression, these 'verbal' values sometimes seem to basically hijack some deeper drive and channel it to meme-replicating efforts. ("So you do care? And have compassion? That's great - here is language-based analytical framework which maps your caring onto this set of symbols, and as a consequence, the best way how to care is to do effective altruism community building")
 

This is definitely true for humans but it is unclear that this is necessarily bad. This is at least somewhat aligned and this is how any kind of intrinsic motivation to external goals has to work -- i.e. the external goal gets supported by and channels an intrinsic motivation. 


5. I don't think that "when asked, many humans want to try to reduce the influence of their ‘instinctual’ and habitual behaviours and instead subordinate more of their behaviours to explicit planning" is much evidence of anything. My guess is actually many humans would enjoy more of the opposite - being more embodied, spontaneous, instinctive, and this is also true for some of the smartest people around. 

Yeah,  in the post I say I am unclear as to whether this is stable under reflection. I see alignment techniques that would follow from this as being only really applicable to near-term systems and not under systems undergoing strong RSI.


6. Broadly, I don't think the broad conclusion human values are primarily linguistic concepts encoded via webs of association and valence in the cortex learnt through unsupervised (primarily linguistic) learning is stable upon reflection. 

Similarly. 

Comment by beren on The Computational Anatomy of Human Values · 2023-04-07T21:40:27.496Z · LW · GW

Thanks for your comment. 

The most substantive disagreement in relation to alignment is on how much of our values is determined by the basic reward system, and how much is essentially arbitrary from there. I tend to side with you, but I'm not sure, and I do think that adult human values and behavior is still shaped in important ways by our innate reward signals. But the important question is whether we could do without those, or perhaps with a rough emulation of them, in an AGI that's loosely brainlike.

I am not sure how much we actually disagree here. I definitely agree that our adult behaviours and values are shaped significantly by our innate reward signals. It is a continuum and is clearly not all or nothing. In general, in this post I was mostly trying to emphasise the social and linguistic aspects since I think they are more under appreciated. In general, I also feel that most of the 'dangerous' and aspects of humans comes from our evolutionarily innate drives -- i.e. status and power-seeking as well as survival etc, and it would be ideal if we don't encode these into our AI systems if it is not necessary. 

I'm currently working on a post to be titled something like "we're likely to get loosely brainlike AGI", on the theory that most people in alignment don't care much how the brain does things, because they see ANNs and LLMs in particular to be wildly unlike the brain. I think there's a vector space of similarities, and I agree with you that AGI research appears to be converging on something with important similarities to brain function. And that this could provide a real advantage in alignment efforts.

I also pretty strongly agree with this take that current ML models are already very brain like and are likely to get more brain like closer to AGI and that this is very helpful for our chances of alignment. Funnily enough, I also have a bunch of draft posts about this

Comment by beren on Orthogonality is expensive · 2023-04-04T09:19:12.650Z · LW · GW

Fair point. I need to come up with a better name than 'orthogonality' for what I am thinking about here -- 'well factoredness?'

Will move the footnote into the main text.

Comment by beren on Orthogonality is expensive · 2023-04-03T12:17:52.102Z · LW · GW

No worries! I'm happy you went to the effort of summarising it. I was pretty slow in crossposting anyhow. 

Comment by beren on Basic facts about language models during training · 2023-02-22T16:46:53.850Z · LW · GW

Yes, I guess I am overstating the possible speedup if I call it 'much much faster', but there ought to at least be a noticeable speedup by cutting out the early steps if it's basically just wasting time/data/compute to fix the distributions. It might also converge to a better and different optimum.

I think we agree here. Testing whether it converges to a better optimum would also be interesting. 

Perhaps more interestingly is the consequences for the training and arch: a lot of stuff with Transformers, like special burnin schedules or heavy (ab)use of normalization has long struck me as potentially just hacks around bad initializations that are trying to cause divergence

Yes. I feel that this might help especially with warmup which could just plausibly be because at the start there are very large and mostly non-informative gradients towards just being the right distribution, which would be removed if you start out at the right gradient.

Comment by beren on Basic facts about language models during training · 2023-02-22T16:43:46.023Z · LW · GW

Good idea -- will run this experiment!