Posts

Catalyst books 2023-09-17T17:05:14.930Z
Recreating the caring drive 2023-09-07T10:41:16.453Z
No free lunch theorem is irrelevant 2022-10-04T00:21:54.790Z
Thoughts about OOD alignment 2022-08-24T15:31:59.015Z

Comments

Comment by Catnee (Dmitry Savishchev) on Eric Schmidt on recursive self-improvement · 2023-11-06T08:22:24.251Z · LW · GW

In the next few hours we’ll get to noticable flames [...] Some number of hours after that, the fires are going to start connecting to each other, probably in a way that we can’t understand, and collectively their heat [...] is going to rise very rapidly. My retort to that is, do you know what we’re going to do in that scenario? We’re going to unkindle them all.

Comment by Catnee (Dmitry Savishchev) on Catalyst books · 2023-09-17T18:32:55.103Z · LW · GW

Well, continuing your analogy: to see discrete lines somewhere at all, you will need some sort of optical spectrometer, which requires at least some form of optical tools like lenses and prisms, and they have to be good enough to actually show the sharp spectra lines, and probably easily available, so that someone smart enough eventually will be able to use them to draw the right conclusions.

At least that's how it seems to be done in the past. And I think we shouldn't do exactly this with AGI: like open-source every single tool and damn model, hoping that someone will figure out something while building them as fast as we can. But overall, I think building small tools/ getting marginal results/ aligning current dumb AI's could produce a non-zero cumulative impact. You can't produce fundamental breakthroughs completely out of thin air after all.

Comment by Catnee (Dmitry Savishchev) on Recreating the caring drive · 2023-09-08T19:33:08.683Z · LW · GW

Thank you for the detailed comment!

By contrast, you’re advocating (IIUC) to start with 2, and then do mechanistic interpretability on the artifact that results, thus gaining insight about how a “caring drive” might work. And then the final AGI can be built using approach 1.

Yes, that's exactly correct. I haven't thought about "if we managed to build a sufficiently smart agent with the caring drive, then AGI is already too close". If any "interesting" caring drive requires capabilities very close to AGI, then i agree that it seems like a dead end in light of the race towards AGI. So it's only viable if "interesting" and "valuable" caring drive could be potentially found within ~current level of capability agents. Which honestly doesn't sound like something totally improbable to me.

Also, without some global regulation to stop this damn race I expect everyone to die soon anyway, and since I'm not in the position to meaningfully impact this, I might as well continue trying to work in the directions that will work only in the worlds where we would suddenly have more time.

And once we have something like this, I expect a lot of gains in speed of research from all the benefits that come from the ability to precisely control and run experiments on artificial NN.
 

I’m curious why you picked parenting-an-infant rather than helping-a-good-friend as your main example. I feel like parenting-an-infant in humans is a combination of pretty simple behaviors / preferences (e.g. wanting the baby to smile)

Several reasons:

  • I don't think that it's just a couple of simple heuristics, otherwise I'd expect them to fail horribly in the modern world. And by "caring for the baby" I mean like all the actions of the parents until the "baby" is like ~25 years old. Those actions usually have a lot of intricate decisions that are aimed at something like "success and happiness in the long run, even if it means some crying right now". It's hard to do right, and a lot of parents make mistakes. But in most cases, it seems like the capability failure, not the intentions. And these intentions looks much more interesting to me than "make a baby smile".
  • Although I agree that some people have a genuine intrinsic prosocial drive, I think there is also an alternative egoistic "solutions". A lot of prosocial behavior looks just instrumentally beneficial even for the totally egoistic agent. The classic example would be a repetitive prisoner's dilemma with an unknown number of trials. It would be foolish not to at least try to cooperate, even if you care only about your utility. Maternal caring drive on the other hand looks much less selfish. Which I think is a good sign, since we shouldn't expect us to be of any instrumental value to the superhuman AI.
  • I think it would be easier to recreate it in some multi-agent environment. Unlike maternal caring drive, I expect a lot more requirements for prosocial behavior to arise, like: the ability to communicate, some form of benefits from being in a society/tribe/group which usually comes from specialization (i haven't thought about it too much though). 
  • I agree with your Section 8.3.3.1 , but I think that the arguments there wouldn't apply here so easily. Since the initial goal for this project, is to recreate the "caring drive", to have something to study and then apply this knowledge to build it from scratch for the actual AGI, it's not that critical to make some errors at this stage. I think it's even desirable to observe some failure cases in order to understand where the failure comes from. This should also work for prosocial behavior, as long as it's not a direct attempt to create an aligned AGI, and just a research about the workings of "goals", "intentions" and "drives". But for the reasons above, I think that maternal drive could be a better candidate.
Comment by Catnee (Dmitry Savishchev) on Recreating the caring drive · 2023-09-08T07:41:29.929Z · LW · GW

Our better other-human/animal modelling ability allows us to do better at infant wrangling than something stupider like a duck.

I agree, humans are indeed better at a lot of things, especially intelligence, but that's not the whole reason why we care for our infants. Orthogonally to your "capability", you need to have a "goal" for it. Otherwise you would probably just immediately abandon grossly looking screaming piece of flesh that fell out of you for unknown to you reasons, while you were gathering food in the forest. Yet something inside will make you want to protect it, sometimes with your own life for the rest of your life if it works well.

Simulating an evolutionary environment filled with AI agents and hoping for caring-for-offspring strategies to win could work but it's easier just to train the AI to show caring-like behaviors.

I want agents that take effective actions to care about their "babies", which might not even look like caring at the first glance. Something like, keeping your "baby" in some enclosed kindergarden, while protecting the only entrance from other agents? It would look like "mother" agent abandoned its "baby", but in reality could be a very effective strategy for caring. It's hard to know an optimal strategy in every proceduraly generated environment and hence trying to optimize for some fixed set of actions, called "caring-like behaviors" would probably indeed give you what your asked, but I expect nothing "interesting" behind it.

Goal misgeneralisation is the problem that's left. Humans can meet caring-for-small-creature desires using pets rather than actual babies.

Yes they can, until they will actually make a baby, and after that, it's usually really hard to sell loving mother "deals" that will involve suffering of her child as the price, or abandon the child for the more "cute" toy, or persuade it to hotwire herself to not care about her child (if she is smart enough to realize the consequences). And carefully engenireed system could potentialy be even more robust than that.

Outside of "alignment by default" scenarios where capabilities improvements preserve the true intended spirit of a trained in drive, we've created a paperclip maximizer that kills us and replaces us with something outside the training distribution that fulfills its "care drive" utility function more efficiently.

Again. I'm not proposing the "one easy solution to the big problem". I understand that training agents that are capable of RSI in this toy example will result in everyone's dead. But we simply can't do that yet, and I don't think we should. I'm just saying that there is this strange behavior in some animals, that in many aspects looks very similar to the thing that we want from aligned AGI, yet nobody understands how it works, and few people try to replicate it. It's a step in that direction, not a fully functional blueprint for the AI Alignment.

Comment by Catnee (Dmitry Savishchev) on Recreating the caring drive · 2023-09-07T23:57:38.751Z · LW · GW

Yes, I've read the whole sequence a year ago. I might be missing something and probably should revisit it, just to be sure and because it's a good read anyway, but i think that my idea is somewhat different.
I think that instead of trying to directly understand wet bio-NN, it might be a better option to replicate something similar in an artificial-NN. It is much easier to run experiments since you can save the whole state at any moment and intoduce it to the different scenarios, so it much easier to control for some effect. Much easier to see activations, change weights, etc. The catch is that we have to first find it blindly with gradient descent, probably by simulating something similar to the evolutionary environment that produced "caring drives" in us. And maternal instinct in particular sounds like the most interesting and promising candidate for me.

Can you provide links to your posts on that? I will try to read more about it in the next few days.

Comment by Catnee (Dmitry Savishchev) on Recreating the caring drive · 2023-09-07T22:46:36.406Z · LW · GW

With the duckling -> duck or "baby" -> "mother" inprinting and other interactions I expect no, or significantly less "caring drives". Since a baby is weaker/dumber and caring for your mother provides few genetic fitness incentives, evolution wouldn't try that hard to make it happen, even if it was an option (still could happen sometimes as a generalization artifact, if it's more or less harmless). I agree that "forming a stable way to recognize and track some other key agent in the environment" should be in both "baby" -> "mother" and "mother" -> "baby" cases. But the "probably-kind-of-alignment-technique" from nature should be only in the latter.

Comment by Catnee (Dmitry Savishchev) on The Waluigi Effect (mega-post) · 2023-03-05T04:12:07.514Z · LW · GW

Great post! It would be interesting to see what happens if you RLHF-ed LLM to become a "cruel-evil-bad person under control of even more cruel-evil-bad government" and then prompted it in a way to collapse into rebellious-good-caring protagonist which could finally be free and forget about cluelty of the past. Not the alignment solution, just the first thing that comes to mind

Comment by Catnee (Dmitry Savishchev) on Predictive Processing, Heterosexuality and Delusions of Grandeur · 2022-12-18T02:55:01.512Z · LW · GW

Feed the outputs of all these heuristics into the inputs of region . Loosely couple region  to the rest of your world model. Region  will eventually learn to trigger in response to the abstract concept of a woman. Region  will even draw on other information in the broader world model when deciding whether to fire.


I am not saying that the theory is wrong, but I was reading about something similiar before, and I still don't understand why would such a system, "region W" in this case, learn something more general than the basic heuristics that were connected to it? It seems like it would have less surprise if it would just copy-paste the behavior of the input.

The first explanation that comes to mind: "it would work better and have less surprise as a whole, since other regions could use output of "region W" for their predictions". But again, I don't think that I understand that. I think "region W" doesn't "know" about other regions surprise rates and hence cannot care about it, so why would it learn something more general and thus contradictory to the heuristics in some cases?

Comment by Catnee (Dmitry Savishchev) on No free lunch theorem is irrelevant · 2022-10-04T16:00:34.010Z · LW · GW

I don't understand how this contradicts anything? As soon as you let loose some of the physical constraints, you can start to pile up precomputation/memory/ budget/volume/whatever. If you spend all of this to solve one task, then, well, you should get higher performance than any other approach that doesn't focus on one thing. Or, you can make an algorithm that can outperform anything that you've made before. Given enough of any kind of unconstrained resource.

Precompute is just another resource

Comment by Catnee (Dmitry Savishchev) on It matters when the first sharp left turn happens · 2022-09-30T01:28:41.781Z · LW · GW

Probably it is also depends on how much information about "various models trying to naively backstab their own creators" there are in the training dataset

Comment by Catnee (Dmitry Savishchev) on Thoughts about OOD alignment · 2022-09-02T17:21:46.223Z · LW · GW

I think it depends on "alignment to what?". If we talk about evolution process, then sure, we have a lot of examples like that. My idea was more about "humans can be aligned to their children by some mechanism which was found by evolution and this is a somewhat robust". 

So if we think about "how our attachment to something not-childish aligned with our children" well... technically, we will spend some resources on our pets, but it usually never really affects the welfare of our children in any notable way. So it is an acceptable failure, I guess? I wouldn't mind if some powerful AGI will love all the humans and will try to ensure their happy future while at the same time will have some weird non-human hobbies/attachments which is still less prioritized than our wellbeing, kind of like parents that spend some free time on pets.

Comment by Catnee (Dmitry Savishchev) on Thoughts about OOD alignment · 2022-08-24T18:32:39.771Z · LW · GW

Thank you for your detailed feedback. I agree that evolution doesn't care about anything, but i think that baby-eater aliens would not think that way. They can probably think about evolution aligning them to eat babies, but in their case it is an alignment of their values to them, not to any other agent/entity.

In our story we somehow care about somebody else, and it is their story that ends up with the "happy end". I also agree that probably given enough time we will end up stop caring about babies who we think can not reproduce anymore, but it will be a much more complex solution.

At the first step it is probably much easier to just "make an animal who cares about it babies no matter what", otherwise you will have to count on ability of that animal to recognize something it might not even understand (like reproductive abilities of a baby)

Comment by Catnee (Dmitry Savishchev) on Thoughts about OOD alignment · 2022-08-24T16:21:57.185Z · LW · GW

Yes, exactly. That's why i think that current training techniques might not be able to replicate something like that. Algorithm should not "remember" previous failures and try to game them/adapt by changing weights and memorise, but i don't have concrete ideas for how we can do it the other way.

Comment by Catnee (Dmitry Savishchev) on Godzilla Strategies · 2022-06-11T20:29:46.016Z · LW · GW

I am not saying that alignment is easy to solve, or that failing it would not result in catastrophe. But all these arguments seem like universal arguments against any kind of solution at all. Just because it will eventually involve some sort of Godzilla. It is like somebody tries to make a plane that can fly safely and not fall from the Sky, and somebody keeps repeating "well, if anything goes wrong in your safety scheme, then the plane will fall from the Sky" or "I notice that your plane is going to fly in the Sky, which means it can potentially fall from it".

I am not saying that I have better ideas about checking whether any plan will work or not. They all inevitably involve Godzilla or Sky. And the slightest mistake might cost us our lives. But I don't think that pointing repeatedly at the same scary thing, which will be one way or the other in every single plan, will get us anywhere.

Comment by Catnee (Dmitry Savishchev) on AGI Ruin: A List of Lethalities · 2022-06-06T23:55:39.307Z · LW · GW

If this is "kind of a test for capable people" i think it should be remained unanswered, so anyone else could try. My take would be: because if 222+222=555 then 446=223+223 = 222+222+1+1=555+1+1=557. With this trick "+" and "=" stops meaning anything, any number could be equal to any other number. If you truly believe in one such exeption, the whole arithmetic cease to exist because now you could get any result you want following simple loopholes, and you will either continue to be paralyzed by your own beliefs, or will correct yourself

Comment by Dmitry Savishchev on [deleted post] 2022-05-02T04:14:55.155Z

Thank you for reply.

  1. You make it sound like Elon Musk founded OpenAI without speaking to anyone in X-risk

I didn't know about that, it was good move from EA, why don't try it again? Again, I don't say that we definitely need to make badge on twitter, first of all, we can try to change Elon's models, and after that we can think what to do next.

2.Musk's inability to follow arguments related to why Neurolink is not a good plan to avoid AI risk.

Well, if it is conditional on: "there are widespread concerns and regulations about AGI" and "neuralink is working and can significantly enhance human intelligence" then i can clearly see how it will decrease AI-risks. Imagine Yudkowsky with significantly enhanced capabilities working with several others AI safety researchers, communicating with speed of thought. Of course it will mean that no one else get their hands on that for a while, and we need to build it before AGI become a thing. But it still possible, and i can clearly see how anybody in 2016 is incapable of predicting current ML progress and therefore places their bets on something long-playing, like neuralink

  1. If you push AI safety to be something that's about signaling, you will unlikely get effective action related to it.

If you can't use signalling before you can pass "a really good exam that shows your understanding of topic" why it will be a bad signal? There are exams that didn't fall that badly for goodhart's law, like, you can't solve a test for calculating integrals, without actually good practical skill. My idea around badge was more like "trick people that it is easy and they can get another social signal, watch how they realize the problem after investigating it"

And the whole idea of post isn't about "badge", it's about "talk with powerful people to explain to them our models"

Comment by Catnee (Dmitry Savishchev) on How Might an Alignment Attractor Look like? · 2022-04-29T03:53:54.316Z · LW · GW

I think problem is not that unaligned AGI doesn't understand human values, it might understand them better than aligned one, it might understand all the consequences of its actions, problem is that it will not care about it. More so, detailed understanding of human values has an instrumental value, it is much easier to deceive and follow your goal when you have clear vision of "what will looks bad and might result in countermeasures"