AGI-Automated Interpretability is Suicide 2023-05-10T14:20:14.419Z


Comment by __RicG__ (TRW) on AGI-Automated Interpretability is Suicide · 2023-08-29T21:47:55.268Z · LW · GW

Sorry for taking long to get back to you.

So I take this to be a minor, not a major, concern for alignment, relative to others.

Oh sure, this was more a "look at this cool thing intelligent machines could do that should shut up people from saying things like 'foom is impossible because training run are expensive'".

  1. learning is at least as important as runtime speed. Refining networks to algorithms helps with one but destroys the other
  2. Writing poems, and most cognitive activity, will very likely not resolve to a more efficient algorithm like arithmetic does. Arithmetic is a special case; perception and planning in varied environments require broad semantic connections. Networks excel at those. Algorithms do not.

Please don't read this as me being hostile, but... why? How sure can we be of this? How sure are you that things-better-than-neural-networks are not out there?

Do we have any (non-trivial) equivalent algorithm that works best inside a NN rather than code?

Btw I am no neuroscientists, so I could be missing a lot of the intuitions you got.

At the end of the day you seem to think that it can be possible to fully interpret and reverse engineer neural networks, but you just don't believe that Good Old Fashioned AGI can exists and/or be better than training NNs weights?

Comment by __RicG__ (TRW) on AGI-Automated Interpretability is Suicide · 2023-08-24T13:31:13.683Z · LW · GW

Thanks for coming back to me.

"OK good point, but it's hardly "suicide" to provide just one more route to self-improvement"

I admit the title is a little bit clickbaity, but given my list of assumption (which do include that NNs can be made more efficient by interpreting them) it does elucidate a path to foom (which does look like suicide without alignment).

Unless there's an equally efficient way to do that in closed form algorithms, they have a massive disadvantage in any area where more learning is likely to be useful.

I'd like to point out that in this instance I was talking about the learned algorithm not the learning algorithm. Learning to learn is a can of worms I am not opening rn, even though it's probably the area that you are referring to, but, still, I don't really see a reason that there could not be more efficient undiscovered learning algorithms (and NN+GD was not learned, it was intelligently designed by us humans. Is NN+GD the best there is?).

Maybe I should clarify how I imagined the NN-AGI in this post: a single huge inscrutable NN like GPT. Maybe a different architecture, maybe a bunch of NNs in trench coat, but still mostly NN. If that is true then there is a lot of things that can be upgraded by writing them in code rather than keeping them in NNs (arithmetic is the easy example, MC tree search is another...). Whatever MC tree search the giant inscrutable matrices have implemented, they are probably really bad compared to sturdy old fashioned code.


Even if NNs are the best way to learn algorithms, they are not be the best way to design them. I am talking about the difference between evolvable and designable.

NN allow us to evolve algorithms, code allows us to intelligently design them: if there is no easy evolvable path to an algorithm, neural networks will fail.

The parallel to evolution is: evolution cannot make bones out of steel (even though they would be much better) because there is no shallow gradient to get steel (no way to have the recipe for steel-bones be in a way that if the recipe is slightly changed you still get something steel-like and useful). Evolution needs a smooth path from not-working to working while design doesn't.

With intelligence the computations don't need to be evolved (or learned) it can be designed, shaped with intent.

Are you really that confident that the steel equivalent of algorithms doesn't exist? Even though as humans we have barely explored that area (nothing hard-coded comes close to even GPT-2)?

Do we have any (non-trivial) equivalent algorithm that works best inside a NN rather than code? I guess those might be the hardest to design/interpret so we won't know for certain for a long time...


Arithmetic is a closed cognitive function; we know exactly how it works and don't need to learn more. 

If we knew exactly how make poems of math theorems (like GPT-4 does) that would make it a "closed cognitive function" too, right? Can that learned algorithm be reversed engineered from GPT-4? My answer is yes => foom => we ded.

Comment by __RicG__ (TRW) on AGI-Automated Interpretability is Suicide · 2023-08-20T18:57:55.818Z · LW · GW

Uhm, by interpretability I mean things like this where the algorithm that the NN implements is revered engineered, written down as code or whatever which would allow for easier recursive self improvement (by improving just the code and getting rid of the spaghetti NN).

Also by the looks of things (induction heads and circuits in general) there does seem to be a sort of modularity in how NN learn, so it does seem likely that you can interpret piece by piece. If this wasn't true I don't think mechanistic interpretability as a field would even exist.

Comment by __RicG__ (TRW) on Jailbreaking GPT-4's code interpreter · 2023-07-13T20:41:37.953Z · LW · GW

BTW, if anyone is interested the virtual machine has these specs:

System: Linux 4.4.0 #1 SMP Sun Jan 10 15:06:54 PST 2016 x86_64 x86_64 x86_64 GNU/Linux

CPU: Intel Xeon CPU E5-2673 v4, 16 cores @ 2.30GHz

RAM: 54.93 GB

Comment by __RicG__ (TRW) on Why I am not an AI extinction cautionista · 2023-06-20T15:39:34.599Z · LW · GW

I did listen to that post, and while I don't remember all the points, I do remember that it didn't convince me that alignment is easy and, like Christiano's post "Where I agree and disagree with Eliezer", it just seems to be like "p(doom) of 95%+ plus is too much, it's probably something like 10-50%" which is still incredibly unacceptably high to continue "business as usual". I have faith that something will be done: regulation and breakthrough will happen, but it seems likely that it won't be enough.

It comes down to safety mindset. There are very few and sketchy reasons to expect that by default an ASI will care about humans enough, so it not safe to build one until shown otherwise (preferably without actually creating one). And if I had to point out a single cause for my own high p(doom), it is the fact that we humans iterate all of our engineering to iron out all of the kinks, while with a technology that is itself adversarial, iteration might not be available (get it right the first time we deploy powerful AI).


Who do you think are the two or three smartest people to be skeptical of AI killing all humans? I think maybe Yann LeCunn and Andrew Ng.

Sure, those two. I don't know about Ng (he recently had a private discussion with Hinton, but I don't know what he thinks now), but I know LeCun hasn't really engaged with the ideas and just relies on the concept that "it's an extreme idea". But as I said, having the position "AI doesn't pose an existential threat" seems to be fringe nowadays.

If I dumb the argument down enough I get stuff like "intelligence/cognition/optimization is dangerous, and, whatever the reasons, we currently have zero reliable ideas on how to make a powerful general intelligence safe (eg. RLHF doesn't work well enough as GPT-4 still lies/hallucinates and is jailbroken way too easily)" which is evidence based, not weird and not extreme.

Comment by __RicG__ (TRW) on Why I am not an AI extinction cautionista · 2023-06-20T01:38:25.665Z · LW · GW

I don’t get you. You are upset about people saying that we should scale back capabilities research, while at the same time holding the opinion that we are not doomed because we won’t get to ASI? You are worried that people might try to stop the technology that in your opinion may not happen?? The technology that if does indeed happen, you agree that “If [ASI] us wants us gone, we would be gone”?!?

Said this, maybe you are misunderstanding the people that are calling for a stop. I don’t think anyone is proposing to stop narrow AI capabilities. Just the dangerous kind of general intelligence “larger than GPT-4”. Self-driving cars good, automated general decision-making bad.

I’d also still like to hear your opinion on my counter arguments on the object level.

Comment by __RicG__ (TRW) on Why I am not an AI extinction cautionista · 2023-06-20T00:55:44.000Z · LW · GW

Thanks for the list, I've already read a lot of those posts, but I still remain unconvinced. Are you convinced by any of those arguments? Do you suggest I take a closer look to some posts?


But honestly, with the AI risk statement signed by so many prominent scientists and engineer, debating that AI risks somehow don't exists seems to be just a fringe anti-climate-change-like opinion held by few stubborn people (or people just not properly introduced to the arguments). I find it funny that we are in a position where in the possible counter arguments appears "angels might save us", thanks for the chuckle.

To be fair I think this post argues about how overconfident Yudkosky is at placing doom at 95%+, and sure, why not... But, as a person that doesn't want to personally die, I cannot say that "it will be fine" unless I have good arguments as to why the p(doom) should be less than 0.1% and not "only 20%"!

Comment by __RicG__ (TRW) on Why I am not an AI extinction cautionista · 2023-06-19T21:23:51.127Z · LW · GW

You might object that OP is not producing the best arguments against AI-doom.  In which case I ask, what are the best arguments against AI-doom?

I am honestly looking for them too.

The best I, myself, can come up with are brief light of "maybe the ASI will be really myopic and the local maxima for its utility is a world where humans are happy long enough to figure out alignment properly, and maybe the AI will be myopic enough that we can trust its alignment proposals", but then I think that the takeoff is going to be really fast and the AI would just self-improve until it is able to see where the global maximum lies (also because we want to know how the best world for humans looks like, we don't really want a myopic AI), except that that maximum will not be aligned.

I guess a weird counter argument to AI-doom, is "humans will just not build the Torment Nexus™ because they realize alignment is a real thing and they have a too high chance (>0.1%) of screwing up", but I doubt that.

Comment by __RicG__ (TRW) on Why I am not an AI extinction cautionista · 2023-06-19T19:35:37.937Z · LW · GW

Well, I apologized for the aggressiveness/rudeness, but I am interested if I am mischaracterizing your position or if you really disagree with any particular "counter-argument" I have made.

Comment by __RicG__ (TRW) on Why I am not an AI extinction cautionista · 2023-06-19T13:55:09.957Z · LW · GW

I feel like briefly discussing every point on the object level (even though you don't offer object level discussion: you don't argue why the things you list are possible, just that they could be):

...Recursive self-improvement is an open research problem, is apparently needed for a superintelligence to emerge, and maybe the problem is really hard.

It is not necessary. If the problem is easy we are fucked and should spend time thinking about alignment, if it's hard we are just wasting some time thinking about alignment (it is not a Pascal mugging). This is just safety mindset and the argument works for almost every point to justify alignment research, but I think you are addressing doom rather than the need for alignment.

The short version of RSI is: SI seems to be a cognitive process, so if something is better at cognition it can SI better. Rinse and repeat. The long version.
I personally think that just the step from from neural nets to algorithms (which is what perfectly successful interpretability would imply) might be enough to have dramatic improvement on speed and cost. Enough to be dangerous, probably even starting from GPT-3.

...Pushing ML toward and especially past the top 0.1% of human intelligence level (IQ of 160 or something?) may require some secret sauce we have not discovered or have no clue that it would need to be discovered. 

...An example of this might be a missing enabling technology, like internal combustion for heavier-than-air flight (steam engines were not efficient enough, though very close). Or like needing the Algebraic Number Theory to prove the Fermat's last theorem. Or similar advances in other areas.

...Improvement AI beyond human level requires "uplifting" humans along the way, through brain augmentation or some other means.

This has been claimed time and time again, people thinking this, just 3 years ago, would have predicted GPT-4 to be impossible without many breakthroughs. ML hasn't hit a wall yet, but maybe soon?

Without it, we would be stuck with ML emulating humans, but not really discovering new math, physics, chemistry, CS algorithms or whatever. 

What are you actually arguing?  You seem to imply that humans don't discover new math, physics, chemistry, CS algorithms...? 🤔
AGI (not ASI) are still plenty dangerous because they are in silicon. Compared to bio-humans they don't sleep, don't get tired, have speed advantage, ease of communication between each other, ease of self-modification (sure, maybe not foom-style RSI, but self-mod is on the table), self-replication not constrained by willingness to have kids, a lot of physical space, food, health, random IQ variance, random interest and without needing the slow 20-30 years of growth needed for humans to be productive. GPT-4 might not write genius-level code, but it does write code faster than anyone else.

...Agency and goal-seeking beyond emulating what humans mean by it informally might be hard, or not being a thing at all, but just a limited-applicability emergent concept, sort of like the Newtonian concept of force (as in F=ma).

Why do you need something that goal-seeks beyond what human informally mean?? Have you seen AutoGPT? What happened whit AutoGPT when GPT gets smarter? Why would GPT-6+AutoGPT not be a potentially dangerous goal-seeking agent?

...We may be fundamentally misunderstanding what "intelligence" means, if anything at all. It might be the modern equivalent of the phlogiston.

Do you really need to fundamentally understand fire to understand that it burns your house down and you should avoid letting it loose?? If we are wrong about intelligence... what? The superintelligence might not be smart?? Are you again arguing that we might not create a ASI soon?
I feel like the answers is just: "I think that probably some of the vast quantities of money being blindly piled it blindly and helplessly piled into here are going to end up actually accomplishing something" 

People, very smart people, are really trying to build superintelligence. Are you really betting against human ingenuity?


I'm sorry if I sounded aggressive in some of this points, but from where I stand this arguments don't seem to be well though out, and I don't want to spend more time on this comment six people will see and two read.

Comment by __RicG__ (TRW) on Why I am not an AI extinction cautionista · 2023-06-19T13:32:26.520Z · LW · GW

"Despite all the reasons we should believe that we are fucked, there might just be missing some reasons we don't yet know for why everything will all go alright" is a really poor argument IMO.


...AI that is smart enough to discover new physics may also discover separate and efficient physical resources for what it needs, instead of grabby-alien-style lightconing it through the Universe.

This especially feels A LOT like you are starting from hopes and rationalizing them. We have veeeeery little reasons to believe that might be true... and also you just want to abandon that resource-rich physics to the AI instead to be used by humans to live nicely?

I think Yudkowsky put it nicely in this tweet while arguing with Ajeya Cotra:

Look, from where I stand, it's obvious from my perspective that people are starting from hopes and rationalizing them, rather than neutrally extrapolating forward without hope or fear, and the reason you can't already tell me what value was maxed out by keeping humans alive, and what condition was implied by that, is that you started from the conclusion that we were being kept alive, and didn't ask what condition we were being kept alive in, and now that a new required conclusion has been added - of being kept alive in good condition - you've got to backtrack and rationalize some reason for that too, instead of just checking your forward prediction to find what it said about that.

Comment by __RicG__ (TRW) on "LLMs Don't Have a Coherent Model of the World" - What it Means, Why it Matters · 2023-06-01T16:29:28.109Z · LW · GW

I am quite confused. It is not clear to me if at the end you are saying that LLMs do or don't have a world model. Can you clearly say on which "side" do you stand on? Are you even arguing for a particular side? Are you arguing that the idea of "having a world model" doesn't apply well to an LLM/is just not well defined?

Said this, you do seem to be claiming that LLMs do not have a coherent model of the world (again, am I misunderstanding you?), and then use humans as an example of what having a coherent world model looks like. This sentence is particularly bugging me:

For example, an LLM that can answer a question about the kinetic energy of a bludger probably doesn't have a clear boundary between models of fantasy and models of reality. But switching seamlessly between emulating different people is implicit in what they are attempting to do - predict what happens in a conversation.

In the screenshots you provided GPT3.5 does indeed answer the question, but it seem to distinguish it being not real (it says "...bludgers in Harry Potter are depicted as...", " the Harry Potter universe...") and indeed it says it doesn't have specific information about their magical properties. I also, in spite of being a physicist with knowledge that HP isn't real, I would have gladly tried to answer that question kinda like GPT did. What are you arguing? LLMs seem to have the distinction at least between reality and HP or not?

And large language models, like humans, do the switching so contextually, without explicit warning that the model being used is changing. They also do so in ways that are incoherent.

What's incoherent about the response it gave? Was the screenshot not meant to be evidence?


The simulator theory (which you seem to rely on) is, IMO, a good human-level explanation of what GPT is doing, but it is not a fundamental-level theory. You cannot reduce every interaction with an LLM as a "simulation", somethings are just weirder. Think of pathological examples of the input being "££££..." repeated 1000s of times: the output will be some random, possibly incoherent, babbling (funny incoherent output I got from the API inputting "£"*2000 and asking it how many "£" there were: 'There are 10 total occurrences of "£" in the word Thanksgiving (not including spaces).'). Notice also the random title it gives to the conversations. Simulator theory fails here.


In the framework of simulator theory and lack of world model, how do you explain that it is actually really hard to make GPT overtly racist? Or how the instruct finetuning is basically never broken?

If I leave a sentence incomplete why doesn't the LLM completes my sentence instead of saying "You have been cut off, can you please repeat?"? Why doesn't the "playful" roleplaying take over, while (as you seem to claim) it takes over when you ask for factual things? Do they have a model of what "following instruction means" and "racisms" but not what "reality" is?


To state my belief: I think hallucinations, non-factuality and a lot of the problems are better explained by addressing the failure of RLHF and not from a lack of a coherent world model. RLHF apparently isn't that good at making sure that GPT-4 answers factually. Especially  since it is really hard to make it overtly racist. And especially since they reward it for "giving it a shot" instead of answering "idk" (because that would make it answer always "idk"). I explain it as: in training the reward model a lot of non-factual things might appear, and even some non-factual thing are actually the preferred response that human like.

Or it might just be the autoregressive paradigm that once it make a mistake (just by randomly sampling the "wrong" token) the model "thinks": *Yoda voice* 'mhmm, a mistake in the answer I see, mistaken the continuation of the answer should then be'.

And the weirdness of the outputs after a long repetition of a single token is explained by the non-zero repetition penalty in ChatGPT and so the output will kinda resemble the output of a glitch token.

Comment by __RicG__ (TRW) on Is behavioral safety "solved" in non-adversarial conditions? · 2023-05-27T05:46:37.653Z · LW · GW

The article and my examples were meant to show that there is a gap between what GPT knows and what it says. It knows something, but sometimes says that it doesn’t, or it just makes it up. I haven’t addressed your “GPT generator/critic” framework or the calibration issues as I don’t really see them much relevant here. GPT is just GPT. Being a critic/verifier is basically always easier. IIRC the GPT-4 paper didn’t really go into much detail of how they tested the calibration, but that’s irrelevant here as I am claiming that sometimes it know the “right probability” but it generates a made up one.

I don’t see how “say true things when you are asked and you know the true thing” is such a high standard, just because we have already internalised that it’s ok that sometimes GPT says make up things

Comment by __RicG__ (TRW) on Is behavioral safety "solved" in non-adversarial conditions? · 2023-05-27T00:47:59.912Z · LW · GW

Offering a confused answer is in a sense bad, but with lying there’s an obviously better policy (don’t) while it’s not the case that a confused answer is always the result of a suboptimal policy.

Sure, but the “lying” probably stems from the fact that to get the thumbs up from RLHF you just have to make up a believable answer (because the process AFAIK didn’t involve actual experts in various fields fact checking every tiny bit). If just a handful of “wrong but believable” examples sneak in the reward modelling phase you get a model that thinks that sometimes lying is what humans want (and without getting too edgy, this is totally true for politically charged questions!)."Lying" could well be the better policy! I am not claiming that GPT is maliciously lying, but in AI safety, malice is never really needed or even considered (ok, maybe deception is malicious by definition).

AFAIK there’s no evidence of a gap between what GPT knows and what it says when it’s running in pure generative mode

I am unsure if this article will satisfy you, but nonetheless I have repeatedly corrected GPT-3/4 and it goes “oh, yeah, right, you’re right, my bad, [elaborates, clearly showing that it had the knowledge all along]”. Or even:
Me: "[question about thing]"
GPT: "As of my knowledge cut-off of 2021 I have absolutely no idea what you mean by thing"
Me: "yeah, you know, the thing"
GPT: "Ah, yeah the thing [writes four paragraphs about the thing]"
Fresh example of this: Link (it says the model is the default, but it's not, it's a bug, I am using GPT-4)

Maybe it is just perpetrating the bad training data full of misconceptions or maybe when I correct it I am the one who's wrong and it’s just a sycophant (very common in GPT-3.5 back in February).

But I think the point is that you could justify the behaviour in a million ways. It doesn’t change the fact that it says untrue things when asked for true things.

Is it safe to hallucinate sometimes? Idk, that could be discussed, but sure as hell it isn’t aligned with what RLHF was meant to align it to.

I’d also like to add that it doesn’t consistently hallucinate. I think sometimes it just gets unlucky and it samples the wrong token and then, by being autoregressive, keeps the factually wrong narrative going. So maybe being autoregressive is the real demon here and not RLHF. ¯\_(ツ)_/¯

It's still not factual.

Comment by __RicG__ (TRW) on how humans are aligned · 2023-05-26T16:28:36.142Z · LW · GW

To me it isn't clear what alignment are you talking about.

You say that the list is about "alignment towards genetically-specified goals", which I read as "humans are aligned with inclusive genetic fitness", but then you talk about what I would describe as "humans aligned with each other" as in "humans want humans to be happy and have fun". Are you confusing the two?

South Korea isn't having kids anymore. Sometimes you get serial killers or Dick Cheney.

Here the first one shows misalignment towards IGF, while the second shows misalignment towards other humans, no?

Comment by __RicG__ (TRW) on Is behavioral safety "solved" in non-adversarial conditions? · 2023-05-26T16:03:03.784Z · LW · GW

I'd actually argue the answer is "obviously no".

RLHF wasn't just meant to address "don't answer how to make a bomb" or "don't say the n-word", it was meant to make GPT say factual things. GPT fails at that so often that this "lying" behaviour has its own term: hallucinations. It doesn't "work as intended" because it was intended to make it say true things.

Do many people really forget that RLHF was meant to make GPT say true things?

When OpenAI reports the success of RLHF as "GPT-4 is the most aligned model we developed" to me it sounds like a case of mostly "painting the target around the arrow": they decided a-posteriori that whatever GPT-4 does is aligned.

You even have "lie" multiple times in the list of bad behaviours in this post and you still answer "yes, it is aligned"? Maybe you just have a different experience? Do you check what it says? If I ask it about my own expertise it is full of hallucinations.

Comment by __RicG__ (TRW) on AGI-Automated Interpretability is Suicide · 2023-05-23T12:14:31.899Z · LW · GW

Hell, neural networks, in physics are often regarding as just fitting with many parameters a really complex function we don't have the mathematical form of (sot hhe reverse of what I explained in this paragraph).

Basically I expect the neural networks to be a crude approximation of a hard-coded cognition algorithm. Not the other way around.

Comment by __RicG__ (TRW) on AGI-Automated Interpretability is Suicide · 2023-05-23T10:44:22.841Z · LW · GW

What NNs do can't be turned into an algorithm by any known route.

NN-> agorithms was one of my assumptions. Maybe I can relay my intuitions for why it is a good assumption:

For example in the paper they explore grokking by making a transformer learn to do modular addition, and then they reverse engineer what algorithm the training "came up with". Furthermore, supporting my point in this post, the learned algorithm is also very far from being the most efficient, due to "living" inside a transformer. And so, in this example, if you imagine that we didn't know what the network was doing, and someone was  just trying to do the same thing that the NN did, but faster and more efficiently, it would study the network, look a the bonkers algo that it learned, realize what it does, and then write the three assembly code lines to actually do the modular addition so much faster (and more precise!) without wasting resources and time by using the big matrices in the transformer.

I can also tackle the problem from the other side: I assume (is it non-obvious?) that predicting-the-next-token can be also be done with algorithms and not only neural networks. I assume that Intelligence can also be made with algorithms rather than only NNs. And so there is very probably a correspondence: I can do the same thing in two different way. And so NN -> agorithms is possible. Maybe this correspondence isn't always in favour of more simpler algos and NNs are sometimes actually less complex, but it feels a bolder claim to for it to be true in general.

To support my claim more we could just look at the math. Transformers, RNN, etc...  are just linear algebra and non-linear activation functions. You can write that down or even, just as an example, just fit the multi-dimensional curve with a nonlinear function, maybe just a polynomials: do a Taylor expansion and maybe you discard the term that contribute less, or something else entirely... I am reluctant to even give ideas on how to do it because of the dangers, but the NNs can most definitely be written down as a multivariate non-linear function. Hell, neural networks, in physics are often regarding as just fitting with many parameters a really complex function we don't have the mathematical form of (sot he reverse of what I explained in this paragraph).

And neural networks can be evolved, which is their biggest strength. I do expect that predicting-the-next-token algorithms can be actually much better than GPT-4, by using the same analogy that Yudkowsky uses for why designed nanotech is probably much better than natural nanotech: the learned algorithms must be evolvable and so they sit around much shallower "loss potential well" than designed algorithms could be.

And it seems to me that this reverse engineering process is what is interpretability is all about. Or at least what the Holy Grail of interpretability is.

Now, as I've written down in my assumptions, I don't know if any of the learned cognition algorithms can be written down efficiently enough to have an edge on NNs:

[I assume that] algorithms interpreted from matrix multiplications are efficient enough on available hardware. This is maybe my shakiest hypothesis: matrix multiplication in GPUs is actually pretty damn well optimized


Maybe I should write a sequel to this post showing my all of these intuitions and motivations on how NN->Algo is a possibility.

I hope I made some sense, and I didn't just ramble nonsense 😁.

Comment by __RicG__ (TRW) on AGI-Automated Interpretability is Suicide · 2023-05-22T22:32:36.840Z · LW · GW

Well, tools like Pythia helps us peer inside the NN and helps us reason about how things works. The same tools can help the AGI reason about itself. Or the AGI develops its own better tools. What I am talking about is an AGI doing what the interpretability researchers are doing now (or what OpenAI is trying to do with GPT-4 interpreting GPT-2).

It doesn't' matter how, I don't know how, I just wanted to point out the simple path to algorithmic foom even if we start with a NN.

Comment by __RicG__ (TRW) on AGI-Automated Interpretability is Suicide · 2023-05-11T14:37:19.413Z · LW · GW

Disclaimer: These are all hard questions and points that I don't know their true answers, these are just my views, what I have understood up to now. I haven't studied the expected utility maximisers exactly because I don't expect the abstraction to be useful for the kind of AGI we are going to be making.

There's a huge gulf between agentic systems and "zombie-agentic" systems (that act like agents with goals, but have no explicit internal representation of those goals)

I feel the same, but I would say that it's the “real-agentic” system (or a close approximation of it) that needs God-level knowledge of cognitive systems (why orthodox alignment by building the whole mind from theory is really hard). An evolved system like us or like GPT, IMO, seems more close to a “zombie-agentic” system.
I feel the key thing to understand each other might be coherence, and how coherence can vary from introspection, but I am not knowledgeable enough to delve into this right now.

How do you define the goal (or utility function) of an agent? Is it something that actually happens when universe containing the agent evolves in its usual physical fashion? Or is it something that was somehow intended to happen when the agent is run (but may not actually happen due to circumstances and agent's shortcomings)?

The view in my mind that makes sense is that a utility function is an abstraction that you put on top of basically anything if you wish. It's a hat to describe a system that does things in the most general way. The framework is borrowed from economics where human behaviour is modelled with more or less complicated utility functions, but whether there is or not an internal representation is mostly irrelevant. And, again, I don't expect a DL system do display anything remotely close to a "goal circuit", but that we can still describe them as having a utility function and them being maximisers (of not infinite cognition power) of that UF. But the UF, form our part, would be just a guess. I don't expect us to crack that with interpretability of neural networks learned by gradient descent.

Comment by __RicG__ (TRW) on AGI-Automated Interpretability is Suicide · 2023-05-11T00:00:00.598Z · LW · GW

What I meant to articulate was: the utility function and expected utility maximiser is a great framework to think about intelligent agents, but it's a theory put on top of the system, it doesn't need to be internal. In fact that system is incomputable (you need an hypercomputer to make the right decision).

Comment by __RicG__ (TRW) on AGI-Automated Interpretability is Suicide · 2023-05-10T23:57:13.372Z · LW · GW

I feel the exact opposite! Creating something that seems to maximise something without having a clear idea of what its goal is really natural IMO. You said it yourself, GPT ""wants"" to predict the correct  probability distribution of the next token, but there is probably not a thing inside actively maximising for that, instead it's very likely to be a bunch of weird heuristics that were selected by the training method because they work.

If you instead meant that GPT is "just an algorithm" I feel we disagree here as I am pretty sure that I am just an algorithm myself.

Look at us! We can clearly model a single human as to having a utility function (k maybe given their limited intelligence it's actually hard) but we don't know what our utility actually is. I think Rob Miles made a video about that iirc.

My understanding is that the utility function and expected utility maximiser is basically the theoretical pinnacle of intelligence! Not your standard human or GPT or near-future AGI. We are also quite myopic (and whatever near-future AGI we make will also be myopic at first).

Is it possible to have a system that just "actively try to make paperclips no matter what" when it runs, but it doesn't reflect it in its reasoning and planning?

I'd say that it can reflect about its reasoning and planning, but it just plaster the universe with tiny molecular spirals because it just like that more than keeping humans alive.

I think this tweet by EY shows what I mean. We don't know what the ultimate dog is, we don't know what we would have created if we did have the capabilities to make a dog-like thing from scratch. We didn't create ice-cream because it maximise our utility function. We just stumbled on its invention and found that it is really yummy.



But I really don't want to adventure myself in this, I am writing something similar to these points in order to deconfuse myself, it is not exactly clear to me the divide between agent meant in the theoretical sense and real systems.

So to keep the discussion on-topic, what I think is:

  • interpretability to "correct" the system: good, but be careful pls
  • interpretability for capabilities: bad
Comment by __RicG__ (TRW) on AGI-Automated Interpretability is Suicide · 2023-05-10T22:32:50.123Z · LW · GW

You are basically discussing these two assumptions I made (under "Algorithmic foom (k>1) is possible"), right?

  • The intelligence ceiling is much higher than what we can achieve with just DL
  • The ceiling of hard-coded intelligence that runs on near-future hardware isn’t particularly limited by the hardware itself: algorithms interpreted from matrix multiplications are efficient enough on available hardware. This is maybe my shakiest hypothesis: matrix multiplication in GPUs is actually pretty damn well optimized
  • Algorithms are easier to reason about than staring at NNs weights

But maybe the third assumption is the non-obvious one?

For the sake of discourse:

I still question [...] "AGI RSIs toward non-DL architecture that results in it maximizing some utility function in a pre-DL manner" being more dangerous than simply "AGI RSIs"

My initial motive to write "Foom by change of paradigm" was to show another previously unstated way RSI could happen. Just to show how RSI could happen, because if your frame of mind is "only compute can create intelligence" foom is indeed unfeasible... but if it is possible to make the paradigm jump then you might just be blind to this path and fuck up royally, as the French say.

One key thing that I find interesting is also that this paradigm shift does circumvent the "AIs not creating other AIs because of alignment difficulties" 

I think that this argument is at odds with the universal learning hypothesis...

I am afraid I am not familiar with this hypothesis and google (or ChatGPT) aren't helpful. What do you mean with this and modularity?


P.S. I have now realized that the opposite of a black-box is indeed a glass-box and not a white-box lol. You can't see inside a box of any colour unless it is clear, like glass!

Comment by __RicG__ (TRW) on AGI-Automated Interpretability is Suicide · 2023-05-10T21:18:11.149Z · LW · GW

I hope we can prevent the AGI to just train a twin (or just copy itself and call that a twin) and study that. In my scenario I took as a given that we do have the AGI under some level control:

If no alignment scheme is in place, this type of foom is probably a problem we would be too dead to worry about.

I guess when I say "No lab should be allowed to have the AI reflect on itself" I do not mean only the running copy of the AGI, but just at any copy of the AGI.

Comment by __RicG__ (TRW) on AGI-Automated Interpretability is Suicide · 2023-05-10T20:41:12.616Z · LW · GW

Wouldn't it at least solve corrigibility by making it possible to detect formation of undesirable end-goals? I think even GPT-4 can classify textual interpretation of an end-goal on a basis of its general desirability for humans.

I really don't expect "goals" to be explicitly written down in the network. There will very likely not be a thing that says "I want to predict the next token" or "I want to make paperclips" or even a utility function of that. My mental image of goals is that they are put "on top" of the model/mind/agent/person. Whatever they seem to pursue, independently of their explicit reasoning.

Anyway, detecting goals, detecting deceit, detecting hidden knowledge of the system is a good thing to have. Interpretability of those things are needed. But interpretability cuts both ways, and with a full-interpretable AGI, foom seems to be a great danger. That's what I wanted to point out. With a fast intelligence explosion (that doesn't need slow retraining or multiple algorithmic breakthrough) the capabilities will explode alongside, while alignment won't.

It seem to need another assumption, namely that the AGI have sufficient control of its internal state and knowledge of the detection network to be able to bring itself into a state that produces interpretation that trips detection network, while also allowing the AGI to advance its agenda.

It is not clear to me, what you are referring to, here. Do you think we will have detection networks? Detection for what? Deceit? We might literally have the AGI look inside for a purpose (like in the new OpenAI paper). I hope we have something like a thing that tells us if it wants to self-modify, but if nobody points out the danger of foom, we likely won't have that.

Comment by __RicG__ (TRW) on AGI-Automated Interpretability is Suicide · 2023-05-10T15:48:29.018Z · LW · GW

I do feel just having humans in the loop is not be a complete solution, though. Even if humans look at the process, algorithmic foom could be really really fast. Especially if it is purposely being used to augment the AGI abilities.

Without a strong reason to believe our alignment scheme will be strong enough to support the ability gain (or that the AGI won't recklessly arbitrarily improve itself), I would avoid letting the AGI look at itself al together. Just make it illegal for AGI labs to use AGIs to look at themselves. Just don't do it.

Not today. But probably soon enough. We still need the interpretability for safety, but we don't know how much of that work will generalize to capabilities.

I would have loved if the paper wasn't using GPT but something more narrow to automate interpretability, but alas. To make sure I am not misunderstood: I think it's good work that we need, but it does point in a dangerous direction.

Comment by __RicG__ (TRW) on AGI-Automated Interpretability is Suicide · 2023-05-10T15:08:04.924Z · LW · GW

Cheers. You comments actually allowed me to fully realize where the danger lies and expand a little on the consequences.
Thanks again for the feedback