Where are the people building AGI in the non-dumb way?

post by Johannes C. Mayer (johannes-c-mayer) · 2023-07-09T11:39:12.692Z · LW · GW · 8 comments

This is a question post.

Contents

  Answers
    28 Steven Byrnes
    1 iwis
    1 rvnnt
    -3 Valerio
None
8 comments

I am somewhat baffled by the fact that I have never ran into somebody who is actively working on developing a paradigm of AGI which is targeted at creating a system that the system is just inherently transparent to the operators.

If you're having a list sorting algorithm like QuickSort, you can just look at the code and then get lots of intuitions about what kinds of properties the code has. An AGI would of course be much, much more complex than QuickSort, but I am pretty sure that there is a program that you can write down that has the same structural property of being interpretable in this way, where the algorithm also happens to define an AGI.

And this seems to be especially the case when you consider that when building the system we can build it in such a way that we have many components and these components have sub-components such that in the end, we have some pretty small set of instructions that does some specific task that is then understandable. And if you understand this component, you can probably understand how this set of instructions behaves in some larger module that uses this set of instructions.

Everything interpretability tries to do, we would just get for free in this kind of paradigm. Moreover, we could design the system in such a way that we have additional good properties. Instead of using SGD in order to find just some set of weights that performs well, that we then interpret, we could just constrain the kinds of algorithms we design in such a way that they are as interpretable as possible, such that we are subjected so strongly to the will of SGD and what algorithms it finds.

Maybe these people exist (if you are one please say hello), but I have talked to probably between 20 and 40 people who would describe themselves as doing AI alignment research and never came something like this up even remotely.

Basically, this is my current research agenda now. I'm not necessarily saying this is definitely the best thing that will save everyone and everybody should do this, but if zero people do this, it seems pretty strange to me. So I'm wondering if there are some standard arguments that I have not come across yet, while this kind of thing is actually really stupid to do.

There are two counter-arguments to this that I'm aware of, that I don't think in themselves justify not working on this.

  1. This seems like a really hard program and might take just way too long and then we're already all dead by the time we would have built AGI in this way.
  2. This kind of paradigm comes with the inherent problem that because the code is interpretable it becomes probably easy to see once you get a really capable algorithm that is basically an AGI. In that case, any person on the team that understands the code well enough can just take the code and do some unilateral madness. So you need to find a lot of people that are aligned enough such that they could work on this, which might be extremely difficult.

Though I'm not even sure how much of a problem point 2 is, because that seems to be a problem in any paradigm, no matter what we do, we probably end up being able to build unaligned AGI before we know how to align it. But maybe it is especially pronounced in this kind of approach. Though consider how much effort we need to invest in order to bridge the gap from being able to build an unaligned AI to being able to build an aligned AI, in any paradigm. I think that time might be especially short in this paradigm.

I feel like what MIRI is doing doesn't quite count. At least from my limited understanding, they are trying to identify problems that are likely to come up in highly intelligent systems and solve these problems in advance, but not necessarily advancing <interpretable/alignable> capabilities in the way that I am imagining. Though I do, of course, have no idea about what they're doing in terms of the research that they do not make public.

Answers

answer by Steven Byrnes · 2023-07-09T15:26:06.356Z · LW(p) · GW(p)

At some point you have to deal with the fact the fact that understanding the world entails knowing lots and lots of stuff—things like “tires are usually black”, or “it’s gauche to wear white after labor day”, etc.

There seem to be only two options:

  • Humans manually type in “tires are usually black” and zillions more things like that. This is very labor-intensive, if it’s possible at all. Cyc is the famous example along these lines. Davidad’s recent proposal [LW · GW] is that we should try to do this.
  • A learning algorithm infers zillions of regularities in the world, like the fact that tires are usually black. That’s the deep learning approach, but there are also many non-deep-learning approaches in this category. I think conventional wisdom (which I happen to share) is that this category is the only category that might actually get to powerful AGI. And I don’t see how this category can be compatible with “creating a system that the system is just inherently transparent to the operators”, because the AGI will do different things depending on its “knowledge”, i.e. the giant collection of regularities that it has discovered, which are (presumably) unlabeled-by-default and probably a giant mess of things vaguely like “PATTERN 87462: IF BOTH PATTERN 24953 AND PATTERN 758463 ARE SIMULTANEOUSLY ACTIVE RIGHT NOW THEN IT’S MARGINALLY MORE LIKELY THAT PATTERN 217364 WILL BE ACTIVE SOON”, or whatever. And then the AGI does something, and humans have their work cut out figuring out why.

There might be a middle way between these—I think the probabilistic programming people might describe their roadmap-to-AGI that way?—but I don’t understand those kinds of plans, or if I do, then I don’t believe them.

comment by NicholasKross · 2023-07-10T01:43:26.621Z · LW(p) · GW(p)

I think the second-setup still allows for powerful AGI that's more explainable than current AI, in the same way that humans can kind of explain decisions to each other, but not very well at the level of neuroscience.

If something like natural abstractions [LW · GW] are real, then this would get easier. I have a hard time not believing a weak version of this (e.g. human and AGI neuron structures could be totally different, but they'd both end up with some basic things like "the concept of 1").

answer by iwis · 2023-09-26T15:08:51.420Z · LW(p) · GW(p)

On https://consensusknowledge.com, I described the idea of building a knowledge database that is understandable for both people and computers, that is, for all intelligent agents. It would be a component responsible for memory and interactions with other agents. Using this component, agents could increase intelligence much faster, which could lead to the emergence of the collective human superintelligence, AGI, and generally the collective superintelligence of all intelligent agents. At the same time, due to the interpretability of the database of knowledge and information, such intelligence would be much safer. Thinking performed by AI would also be much more interpretable.

Please let me know what you think about this.

comment by Johannes C. Mayer (johannes-c-mayer) · 2023-09-26T21:27:27.455Z · LW(p) · GW(p)

I haven't read it in detail.

The hard part of the problem is that we need to have a system that can build up a good world model on it's own. There is too much stuff, such that it takes way way too long for a human to enter everything. Also I think that we need to be able to process basically arbitrary input streams with our algorithm. E.g. build a model of the world just by seeing a camera feed and the input of a microphone.

And then we want to figure out how to constrain the world model, such that if we use some planning algorithm we also designed on this world model we know that it won't kill us because there is weird stuff in the world model, like there is weird stuff in solomonoff induction, because that are just arbitrary programs.

Also, a hard part is to make a world model that is that general, that it can represent the complexity of the real world interpretable.

If you have a database where you just enter facts about the world like X laptop has Y resolution, that seems to be not nearly powerful enough. Your world model only seems to be complex and talk about the real world, because you use natural language words as descriptors. So to a human brain these things have meaning, but not to a computer by default. That is how you can get a false sense of how good your world model is.

Replies from: iwis
comment by iwis · 2023-09-29T10:50:03.378Z · LW(p) · GW(p)

There is too much stuff, such that it takes way way too long for a human to enter everything.

Is it also true for a large group of people? If yes then why?

Replies from: johannes-c-mayer
comment by Johannes C. Mayer (johannes-c-mayer) · 2023-09-29T11:14:47.531Z · LW(p) · GW(p)

Cyc does not work. At least not yet. I haven't really looked into it a lot, but I expect that it will also not work in the near future for anything like doing a pivotal act. And they got a lot of man-hours put into it. In principle, it could probably succeed with enough data input, but it is not practical. Also, it would not succeed if you don't have the right inference algorithms, and I guess that would be hard to notice when you are distracted entering all the data. Because you can just never stop entering the data, as there is so much of it to enter.

Replies from: iwis
comment by iwis · 2023-09-29T11:33:42.295Z · LW(p) · GW(p)

> Cyc does not work.
What if the group of users adding knowledge was significantly larger than the Cyc team?.

Edit: I ask because CyC is built by a group of its employees, it is not crowdsourced. Crowdsourcing often involves a much larger group of people, like in Wikipedia.

> In principle, it could probably succeed with enough data input, but it is not practical.
Why is it not practical?

> that would be hard to notice
What do you mean by "to notice" here?

Replies from: johannes-c-mayer
comment by Johannes C. Mayer (johannes-c-mayer) · 2023-10-03T18:15:26.545Z · LW(p) · GW(p)

Cyc does not seem like the things that I would expect to work very well compared to a system that can build the world model from scratch because even if it is crowd sourced it would take to much effort.

I mean notice that the inference algorithms are too bad, to make the system capable enough. You can still increase the capability of the system very slowly, by just adding more data. So it seems easy to instead of fixing the inference, to just focus on adding more data, which is the wrong move in that situation.

answer by rvnnt · 2023-07-12T15:56:08.629Z · LW(p) · GW(p)

Tamsin Leake's project [LW · GW] might match what you're looking for.

comment by Johannes C. Mayer (johannes-c-mayer) · 2023-07-12T21:40:29.440Z · LW(p) · GW(p)

I feel like the thing that I'm hinting is not directly related to QACI. I'm talking about a specific way to construct an AGI where we write down all of the algorithms explicitly, whereas the QACI part of QACI, is about specifying an objective that is aligned when optimized very hard. It seems like, in the thing that I'm describing, you would get the alignment properties from a different place. You get them because you understand the algorithm of intelligence that you have written down very well. Whereas in QHCI, you get the alignment properties by successfully pointing to the causal process that is the human in the world that you want to "simulate" in order to determine the "actual objective".

Just to clarify, when I say non-DUMB way, I mainly refer to using giant neural networks and just making them more capable in order to get to intelligent systems to be the DUMB way. And Tasman's thing seems to be one of the least DUMB things I have heard recently. I can't see how this obviously fails (yet), though, of course, this doesn't necessarily imply that it will succeed (though it is of cause possible).

answer by Valerio · 2023-07-11T16:20:11.819Z · LW(p) · GW(p)

I am also interested in interpretable ML. I am developing artificial semiosis, a human-like AI training process which can achieve aligned (transparency-based, interpretability-based) cognition. You can find an example of the algorithms I am making here [LW · GW]: the AI runs a non-deep-learning algorithm, does some reflection and forms a meaning for someone “saying” something, a meaning different from the usual meaning for humans, but perfectly interpretable.

I support then the case for differential technological development:

There are two counter-arguments to this that I'm aware of, that I don't think in themselves justify not working on this.

Regarding 1, it may take several years to have interpretable ML reach capabilities equivalent to LLMs, but the future may offer surprises either in terms of coordination to pause the development of "opaque" advanced AI or of deep learning hitting a wall... at killing everyone. Let's have a plan also for the case we are still alive.

Regarding 2, interpretable ML would need to have programmed control mechanisms to be aligned. There is currently no such a field of AI safety as we do not have yet interpretable ML, but I imagine computer engineers being able to make progress on these control mechanisms (being able to make more progress than on mechanistic interpretability of LLMs). While it is true that control mechanisms can be disabled, you can always advocate for the highest security (like in Ian Hogarth's Island idea). You can then also reject this counterargument. 

mishka noted that this paradigm of AI is more foomable [LW(p) · GW(p)]. Self-modification is a huge problem. I have an intuition interpretable ML will exhibit a form of scaffolding, in that control mechanisms for robustness (i.e. for achieving capabilities) can advantageously double as alignment mechanisms. Thanks to interpretable ML, engineers may be able to study self-modification already in systems with limited capabilities and learn the right constraints.

8 comments

Comments sorted by top scores.

comment by Dagon · 2023-07-09T15:43:59.981Z · LW(p) · GW(p)

Umm, I don't know how deep you've gotten into even simple non-ai machine learning.  But this is based on simply wrong assumptions.  Even your simplification is misleading - 

If you're having a list sorting algorithm like QuickSort, you can just look at the code and then get lots of intuitions about what kinds of properties the code has

I've talked with and interviewed a lot of software developers, and it's probably fewer than 5% that really understand QuickSort including the variance in performance on pathological lists.  This is trivially simple compared to large models, but not actually easy or self-explaining.

I am pretty sure that there is a program that you can write down that has the same structural property of being interpretable in this way, where the algorithm also happens to define an AGI.

I am pretty sure that this is not possible.  

Replies from: johannes-c-mayer
comment by Johannes C. Mayer (johannes-c-mayer) · 2023-07-10T08:12:31.837Z · LW(p) · GW(p)

I've talked with and interviewed a lot of software developers, and it's probably fewer than 5% that really understand QuickSort including the variance in performance on pathological lists. This is trivially simple compared to large models, but not actually easy or self-explaining.

Well, these programmers probably didn't try to understand Quicksort. I think you can see simple dynamics such as, "oh this will always return a list that is the same size as the list that I input" and "all the elements in that list will be elements from the original list in a bijective mapping. There won't be different elements and there won't be duplicated elements or something like that." That part is pretty easy to see. And now there are some pathological cases for quick search, though I don't understand the mechanics of why they arise. However, I'm pretty sure that I can, within one hour, understand very well what these pathological cases are and why they arise, and how I might change a quick search algorithm to handle a particular pathological case well. That is, I'm not saying I look at Wikipedia and just read up on the pathological cases, but I just look at the algorithm alone and then derive the pathological cases. Maybe an hour is not enough, I'm not sure. That seems like an interesting experiment to test my claim.

I am pretty sure that there is a program that you can write down that has the same structural property of being interpretable in this way, where the algorithm also happens to define an AGI.

I am pretty sure that this is not possible.

Could you explain why you think that this is not possible? Do you really think there isn't an explicit Python program that I can write down such that within the Python program, e.g. write down the step-by-step instructions that when you follow them, you will end up building an accurate model of the world. And such that the program does not use any layered optimization like SGD or something similar. Do you think these kinds of instructions don't exist? Well, if they don't exist, how does the neural network learn things like constructing a world model? How does the human brain do it?

Once you write down your algorithm explicitly like that, I just expect that it will have this structural property I'm talking about of being possible to analyze and get intuitions about the algorithm.

comment by rime · 2023-09-21T17:57:59.425Z · LW(p) · GW(p)

but I am pretty sure that there is a program that you can write down that has the same structural property of being interpretable in this way, where the algorithm also happens to define an AGI.

Interesting. I have semi-strong intuitions in the other direction. These intuitions are mainly from thinking about what I call the Q-gap, inspired by Q Home's post [LW(p) · GW(p)] and this quote:

…for simple mechanisms, it is often easier to describe how they work than what they do, while for more complicated mechanisms, it is usually the other way around.

Intelligent processes are anabranching rivers of causality: it starts and ends at highly concentrated points, but the route between is incredibly hard to map. If you find an intelligent process in the wild, and you have yet to statistically ascertain which concentrated points its many actions converge on (aka its intentionality), then this anabranch will appear as a river delta to you.

Whereas simple processes that have no intentionality just are river deltas. E.g., you may know everything about the simple fundamental laws of the universe, yet be unable to compute whether it will rain tomorrow.

Replies from: johannes-c-mayer
comment by Johannes C. Mayer (johannes-c-mayer) · 2023-09-23T14:35:46.114Z · LW(p) · GW(p)

That is an interesting analogy.

So if I have a simple AGI algorithm, then if I can predict where it will move to, and understand the final state it will move to, I am probably good, as long as I can be sure of some high-level properties of the plan. I.e. the plan should not take over the world let's say. That seems to be property you might be able to predict of a plan, because it would make the plan so much longer, than just doing the obvious thing. This isn't easy of cause, but I don't think having a system that is more complex would help with this. Having a system that is simple makes it simpler to analyze the system in all regards, all else equal (assuming you don't make it short by writing a code golf program, you still want to follow good design practices, and lay out the program in the obvious most understandable way).

As a story sidenote before I get into why I think tho Q-gap probably is wrong: That I can't predict that it will rain tomorrow if I have the perfect model of low-level dynamics in the universe, has more to do with how much compute I have available. I might be able to predict if it would rain tomorrow would I know the initial conditions of the universe and some very large but finite amount of compute, if the universe is not infinite?

I am not sure the Q-gap makes sense. I can have a 2D double pendulum. This is very easy to describe and hard to predict. I can make a chaotic system more complex, and then it becomes a bit harder to predict but not really by much. It's not analytically solvable for 2 joints already (according to Google).

That describing the functioning of complex mechanisms seems harder than saying what they do, might be an illusion. We as humans have a lot of abstractions in our heads to think about the real world. A lot of the things that we build mechanisms to do are expressible in these concepts. So they seem simple [LW · GW] to us. This is true for most mechanisms we build that produce some observable output.

If we ask "What does this game program running of a computer do?" We can say something like "It creates the world that I see on the screen." That is a simple explanation in terms of observed effects. We care about things in the world, and for those things we normally have concepts, and then machines that manipulate the world in ways we want have interpretable output.

There is also the factor that we need complex programs for things where we have not figured out a good general solution, which would then be simple. If we have a complex program in the world, it might be complex because the creators have not figured out how to do it the right way.

So I guess I am saying that there are two properties of a program. Caoticness, and Kolmogorov complexity. Increasing one always makes the program less interpretable, if the other stays fixed, if we assume that we are only considering optimal algorithms, and not a bunch of halfhazard heuristics we use because we have not figured out the best algorithm yet.

comment by mishka · 2023-07-09T14:42:46.293Z · LW(p) · GW(p)

I am writing it as a comment, not as an answer (the answers, I suspect, are more social; people are not doing this yet, because the methods which would work capability-wise are mostly still in their blind spots).

two counter-arguments to this

  1. Technically, it has been too difficult to do it this way. But it is becoming less and less difficult, and various versions of this route are becoming more and more feasible.

    Although, the ability to predict behavior is still fundamentally limited, because the systems like that become complex really easy (one can have very complex behavior with really small number of parameters), and because they will interact with complex world around them (so one really needs to reason about the world containing software systems like this; even if software systems themselves are transparent and interpretable, if they are smart, the overall dynamics might be highly non-trivial).

  2. This kind of paradigm (if it works) makes it much easier to modify these systems, so it is much easier to have self-modifying AIs, or, more likely, self-modifying ecosystems of AIs producing changing populations of AI systems.

    Capability-wise, this is likely to give such systems a boost competing with current systems where self-modification is less fluent and so far rather sluggish.

    But this is even more foomable than the status quo. So one really needs to solve AI existential safety for a self-evolving, better and better self-modifying ecosystem of AIs, this is even more urgent with this approach than with the current mainstream.

    Might this problem be easier to solve here? Perhaps... At least, with (self-)modification being this fluent and powerful, one can direct it this way and that way more easily than with more sluggish and resistant methods. But, on the other hand, it is very easy to end up with a situation where things are changing even faster and are even more difficult to understand...


I do like looking at this topic, but the safety-related issues in this approach are, if anything, even more acute (faster timelines + very fluently reconfigurable machines)...

Replies from: johannes-c-mayer
comment by Johannes C. Mayer (johannes-c-mayer) · 2023-07-10T09:05:39.823Z · LW(p) · GW(p)

I expect that it is much more likely that most people are looking at the current state of the art and don't even know or think about other possible systems and just narrowly focus on aligning the state of the art, not considering creating a "new paradigm", because they think that would just take too long.

I would be surprised if there were a lot of people who carefully thought about the topic and used the following reasoning procedure:

"Well, we could build AGI in an understandable way, where we just discover the algorithms of intelligence. But this would be bad because then we would understand intelligence very well, which means that the system is very capable. So because we understand it so well now, it makes it easier for us to figure out how to do lots of more capability stuff with the system, like making it recursively self-improving. Also, if the system is inherently more understandable, then it would also be easier for the AI to self-modify because understanding itself would be easier. So all of this seems bad, so instead we shouldn't try to understand our systems. Instead, we should use neural networks, which we don't understand at all, and use SGD in order to optimize the parameters of the neural network such that they correspond to the algorithms of intelligence, but are represented in such a format that we have no idea what's going on at all. That is much safer because now it will be harder to understand the algorithms of intelligence, making it harder to improve and use. Also if an AI would look at itself as a neural network, it would be at least a bit harder for it to figure out how to recursively self-improve."

Obviously, alignment is a really hard problem and it is actually very helpful to understand what is going on in your system at the algorithmic level in order to figure out what's wrong with that specific algorithm. How is it not aligned? And how would we need to change it in order to make it aligned? At least, that's what I expect. I think not using an approach where the system is interpretable hurts alignment more than capabilities. People have been steadily making progress at making our systems more capable and not understanding them at all, in terms of what algorithms they run inside, doesn't seem to be much of an issue there, however for alignment that's a huge issue.

comment by benjamin-asdf · 2023-07-09T16:43:16.341Z · LW(p) · GW(p)

I share your intuition. Turing already conjectured how much computing power an AGI needs and he said something little. I think the hardest part was getting to computers and AGI is just making a program that is a bit more dynamic. 


I can recommend all of Marvin Minskies work. The Society Of Mind is very accessible and has an online version. In short, the mind is made of smaller sub-pieces. The important aspects are the orchestration and the architecture of these resources. And Minsky also has some stuff on how you put that into programs.


The most concrete stuff I know of: 

EM-ONE: An Architecture for Reflective Commonsense Thinking Push Singh

is very concrete with code implementation. Implementing some of the layers of critics that Minsky described in "The Emotion Machine" that are a hypothesis of how common sense could be built. 
 

Read by Aaron Sloman and Gerald Sussman isn't this super cool?
 

It is useful to first think of the concepts before programming something. We might be thinking of slightly different things with the word algorithm. It sounds very low level to me. While the important things are the architecture of a program, not the bricks it is made out of.

Replies from: johannes-c-mayer
comment by Johannes C. Mayer (johannes-c-mayer) · 2023-07-10T09:15:02.111Z · LW(p) · GW(p)

I think the problem with the things you mention is that they are just super vague, where you don't even know what is the thing that you are talking about. What does it mean that:

Most important of all, perhaps, is making such machines learn from their own experience.

Finally, we'll get machines that think about themselves and make up theories, good or bad, of how they, themselves might work.

Also, all of this seems to be some sort of vague stuff about imagining how AI systems could be. I'm actually interested in just making the AI systems and making them in a very specific way such that they have good alignment properties and not vaguely philosophizing about what could happen. The whole point of writing down algorithms explicitly, which is one non-dumb way to build AGI, is that you can just see what's going on in the algorithm and understand it and design the algorithm in such a way that it would think in a very particular way.

So it's not like, oh yes, these machines will think for themselves some stuff and it will be good or bad, it's more like, I make these machines think, how do I make them think, what's the actual algorithm to make them think, how can I make this algorithm such that it will actually be aligned. And I am controlling what they are thinking, I am controlling if it's good or bad, I am controlling if they are going to build a model of themselves, maybe that's dangerous for alignment purposes in some context and then I would want the algorithm to not want the system to build a model of themselves.

For, at that point, they'll probably object to being called machines. I think it's pretty accurate to say that I am a machine.

(Also, as a meta note, it would be very good, I think, if you do not break the lines as you did in this big text block because that's pretty annoying to block quote.)