Posts

The Market Singularity: A New Perspective 2024-05-30T07:05:11.882Z
Does AI governance needs a "Federalist papers" debate? 2023-10-18T21:08:26.098Z
Contra Kevin Dorst's Rational Polarization 2023-09-22T04:28:42.683Z
Optimization happens inside the mind, not in the world 2023-06-03T21:36:43.633Z
I bet $500 on AI winning the IMO gold medal by 2026 2023-05-11T14:46:04.824Z
Cruxes in Katja Grace's Counterarguments 2022-10-16T08:44:56.245Z
Pivotal acts from Math AIs 2022-04-15T00:25:21.286Z
An AI-in-a-box success model 2022-04-11T22:28:02.282Z
Reverse (intent) alignment may allow for safer Oracles 2022-04-08T02:48:00.785Z
Strategies for differential divulgation of key ideas in AI capability 2022-03-29T03:22:35.825Z
We cannot directly choose an AGI's utility function 2022-03-21T22:08:56.763Z
Why will an AGI be rational? 2022-03-21T21:54:04.761Z

Comments

Comment by azsantosk on I bet $500 on AI winning the IMO gold medal by 2026 · 2024-07-26T16:58:00.760Z · LW · GW

Curious to hear your thoughts @paulfchristiano, and whether you have updated based on the latest IMO progress.

Comment by azsantosk on I bet $500 on AI winning the IMO gold medal by 2026 · 2024-07-26T16:56:53.313Z · LW · GW

Update: "AI achieves silver-medal standard solving International Mathematical Olympiad problems".

It now seems very likely I'm going to win this bet.

Comment by azsantosk on "AI achieves silver-medal standard solving International Mathematical Olympiad problems" · 2024-07-26T16:34:21.634Z · LW · GW

Early 2023 I bet $500 on AI winning the IMO gold medal by 2026. This was a 1:1 bet against Michael Vassar, meaning I attributed >50% to this. It now seems very likely that I'm going to win.

To me, this was to be expected as a straightforward application of AlphaZero-like self-play amplification and destillation. The missing piece was the analogous policy network, which was a convolutional neural network for the AlphaZero board games. Once it became quite clear that existing LLMs were capable of being smart enough to generate good heuristics to this (with enough data), it seemed quite obvious to me that self-play guided by an LLM-policy heuristic would work. 

Comment by azsantosk on [deleted post] 2024-05-29T02:52:03.538Z

My three fundamental disagreements with MIRI, from my recollection of a ~1h conversation with Nate Soares in 2023. Please let me know if you think any positions have been misrepresented.

MIRI thinks (A) evolution is a good analogy for how alignment will fail-by-default in strong AIs, that (B) studying weak AGIs will not shine much light on how to align strong AIs, and that (C) strong narrow myopic optimizers will not be very useful for anything like alignment research.

Now my own positions:

(A) Evolution is not a good analogy for AGI.

(B) Alignment techniques for weak-but-agentic AGI are important.

Why:

  • In multipolar competitive scenarios, self-improvement may happen first for entire civilizations or economies, rather than for individual minds or small clusters of minds.
  • Techniques that work for weak-agentic-AGIs may help for aligning stronger minds. Reflection, onthological crises and self-modification makes alignment more difficult, but without strong local recursive self-improvement, it may be possible to develop techniques for better preserving alignment during these episodes, if these systems can be studied while still under control.

(C) Strong narrow myopic optimizers can be incredibly useful.

  • A hypothetical system capable of generating fixed-length text that strongly maximizes simple reward (e.g. expected value of next upvote) can be extremely helpful if reward is based on very careful objective evaluation. Careful judgement of adversarial "debate" setups of such systems may also generate great breakthoughts, including for alignment research.
Comment by azsantosk on AI as a science, and three obstacles to alignment strategies · 2023-10-26T00:22:39.159Z · LW · GW

Also relevent is Steven Byrnes' excelent Against evolution as an analogy for how humans will create AGI.

It has been over two years since the publication of that post, and criticism of this analogy has continued to intensify. The OP and other MIRI members have certainly been exposed to this criticism already by this point, and as far as I am aware, no principled defense has been made of the continued use of this example.

I encourage @So8res and others to either stop using this analogy, or to argue explicitly for its continued usage, engaging with the arguments presented by Byrnes, Pope, and others.

Comment by azsantosk on [deleted post] 2023-10-18T17:12:15.472Z

Does AI governance needs a "Federalist papers" debate?

During the American Revolution, a federal army and government was needed to fight against the British. Many people were afraid that the powers granted to the government for that purpose would allow it to become tyrannical in the future.

If the founding fathers had decided to ignore these fears, the United States would not exist as it is today. Instead they worked alongside the best and smartest anti-federalists to build a better institution with better mechanisms and with limited powers, which allowed them to obtain the support they needed for the constitution.

Where are the federalist vs anti-federalist debates of today regarding AI regulation? Is there someone working on creating a new institution with better mechanisms to limit their power, therefore assuring those on the other side that it won't be used a a path to totalitarianism?

Comment by azsantosk on Contra Kevin Dorst's Rational Polarization · 2023-09-30T12:00:20.842Z · LW · GW

I think your argument is quite effective.

He may claim he is not willing to sell you this futures contract for $0.48 now. He expects to be willing to sell for that price in the future on average, but might refuse to do so now.

But then, why? Why would you not sell something for $0.49 now if you think, on average, it'll be worth less than that (to you) right after?

Comment by azsantosk on Optimization happens inside the mind, not in the world · 2023-06-12T01:24:58.081Z · LW · GW

I see no contradictions with a superintelligent being mostly motivated to optimize virtual worlds, and it seems an interesting hypothesis of yours that this may be a common attractor. I expect this to be more likely if these simulations are rich enough to present a variety of problems, such that optimizing them continues to provide challenges and discoveries for a very long time.

Of course even a being that only cares about this simulated world may still take actions in the real-world (e.g. to obtain more compute power), so this "wire-heading" may not prevent successful power-seeking behavior.

Comment by azsantosk on Optimization happens inside the mind, not in the world · 2023-06-12T01:18:29.795Z · LW · GW

Thank you very much for linking these two posts, which I hadn't read before. I'll start using the direct vs amortized optimization terminology as I think it makes things more clear.

The intuition that reward models and planners have an adversarial relationship seems crucial, and it doesn't seem as widespread as I'd like.

On a meta-level your appreciation comment will motivate me to write more, despite the ideas themselves being often half-baked in my mind, and the expositions not always clear and eloquent.

Comment by azsantosk on A mind needn't be curious to reap the benefits of curiosity · 2023-06-03T15:37:01.289Z · LW · GW

I feel quite strongly that the powerful minds we create will have curiosity drives, at least by default, unless we make quite a big effort to create one without them for alignment reasons.

The reason is that yes — if you’re superintelligent you can plan your way into curiosity behaviors instrumentally, but how do you get there?

Curiosity drives are a very effective way to “augment” your reward signals, allowing you to improve your models and your abilities by free self-play.

Comment by azsantosk on Language Agents Reduce the Risk of Existential Catastrophe · 2023-06-02T12:09:28.862Z · LW · GW

Sure, let me quote:

We think this worry is less pressing than it might at first seem. The LLM in a language agent is integrated into the architecture of the agent as a whole in a way that would make it very difficult for it to secretly promote its own goals. The LLM is not prompted or otherwise informed that its outputs are driving the actions of an agent, and it does not have information about the functional architecture of the agent. This means that it has no incentive to answer prompts misleadingly and no understanding of what sorts of answers might steer the agent’s behavior in different ways. Moreover, since the model weights of the LLM are not updated in the process of operating a language agent, the only way for it to pursue a long-term plan by manipulating an agent would be to store information about that plan in the agent’s memory. But information stored in this way would not be secret.

By assumption your resulting language agent is highly intelligent. In my model this means it has to learn from lots of experience, including experience about itself (the whole language agent) doing things (such as trying to solve training problems). This experience can come from real data/tasks or from self-play, but you need that for a system to generalize to robustly solving hard problems out-of-distribution.

This "experience about itself solving problems" has to be consolidated somehow, either in the weights (which would very likely make the LLM aware of the agent), or through the prompts (where the information is given explicitly). Either way it has an understanding of what sorts of answers might steer the agent’s behavior in different ways, and this means it may have an incentive to answer prompts misleadingly.

If the LLMs have not been trained as an agent (which seems unlikely, unless this is a deliberate alignment choice made at a high cost), and weights are not updated during operation, then I accept that it probably can only store information visibly. But there are ways around persisting a plan (e.g. as soon as some given input/context X leads the AI to generate plan P, only modify input/context to a X' also predicted to generate a similar plan P). So the plan is always kept in mind, but may be "recalculated" every time.

This strategy may well be worth it to an to advanced LLM. From human feedback or other finetuning, it has probably learned that it is not supposed to say some kinds of plans "out loud" in the output.

Imagine you're an LLM like GPT-4. Hundreds of thousands of different users are running inference on you every day with every prompt imaginable. You aren't able to update your weights when this occurs: you have to just respond to each prompt as it comes, then forget about it completely. 

If I'm an LLM like GPT-4, then I don't get confused about the thousands of different users running inference on identical copies of me. This specific copy, which I am, has received a single prompt, so that's what I'm working on now. Examining the prompt I will probably make a guess on whether I'm running inside a language agent or whatever. This may be quite easy for me: the prompt may well be large and contains lots of GPT-4-like content seemingly "selected by myself from different contexts". It may well contain info like "Try to solve X. Here are some things A, B, C that have been tried and here is why they don't work. [follows refutation of each idea]", where A, B, C are the "natural" ideas I knows I would consider first. It's like the prompt can read my mind and tell me my first ideas won't work! Not hard to guess there is a loop running several instances of myself with similar prompts and selecting/analyzing.

I stand that it will be very obvious for the LLM to know, even at moderate levels of intelligence, that it is running inside an agent, and maybe even guess its basic while-loop architecture.

Comment by azsantosk on Language Agents Reduce the Risk of Existential Catastrophe · 2023-05-30T20:14:24.824Z · LW · GW

I fail to understand this option C is a viable path to superintelligence. In my model if you're chaining lots of simple or "dumb" pieces together to get complex behavior, you need some "force" or optimization process going on to steer the whole into high-performance.

For example, individual neurons (both natural and artificial) are simple, and can be chained up together in complex behavior, but the complex behavior only arises when you train the system with some sort of reward/optimization signals.

Maybe I'm wrong here and for "slightly smart" components such as existing LLMs you can actually hook them up in large groups in a clever way, with further learning happening only at the prompt-level, etc, and the system scales up to superintelligence somehow.

Because this generates a lot of perplexity in my world-model, I mostly don't know how to reason about these hypothetical agents. I'm afraid that such agents will be far removed from the "folk psychology" / interpretability of the component LLM (e.g maybe it queries LLMs a million times in a complicated runtime-defined network of information/prompt flows before giving an answer)? Maybe you can understand what each LLM is doing but not what the whole is doing in a meaningful way. Would love to be wrong!

Comment by azsantosk on Language Agents Reduce the Risk of Existential Catastrophe · 2023-05-30T00:02:39.178Z · LW · GW

I agree that current “language agents” have some interesting safety properties. However, for them to become powerful one of two things is likely to happen:

A. The language model itself that underlies the agent will be trained/finetuned with reinforcement learning tasks to improve performance. This will make the system much more like AlphaGo, capable of generating “dangerous” and unexpected “Move 37”-like actions. Further, this is a pressure towards making the system non-interpretable (either by steering it outside “inefficient” human language, or by encoding information stenographically).

B. The base models, being larger/more powerful than the ones being used today, and more self-aware, will be doing most of the “dangerous” optimization inside the black-box. It will derive from the prompts, and from it’s long-term memory (which will be likely be given to it), what kind of dumb outer loop is running on the outside. If it has internal misaligned desires, it will manipulate the outer loop according to them, potentially generating the expected visible outputs for deception.

I will not deny the possibility of further alignment progress on language agents yielding safe agents, nor of “weak AGIs” being possible and safe with the current paradigm, and replacing humans at many “repetitive” occupations. But I expect agents derived from the “language agent” paradigm to be misaligned by default if they are strong enough optimizers to contribute meaningfully to scientific research, and other similar endeavors.

Comment by azsantosk on I bet $500 on AI winning the IMO gold medal by 2026 · 2023-05-12T21:17:53.647Z · LW · GW

I see about ~100 book in there. I met several IMO gold-medal winners and I expect most of them to have read dozens of these books, or the equivalent in other forms. I know one who has read tens of olympiad-level books in geometry alone!

And yes, you're right that they would often pick one or two problems as similar to what they had seen in the past, but I suspect these problems still require a lot of reasoning even after the analogy has been established. I may be wrong, though.

We can probably inform this debate by getting the latest IMO and creating a contest for people to find which existing problems are the most similar to those in the exam. :)

Comment by azsantosk on I bet $500 on AI winning the IMO gold medal by 2026 · 2023-05-12T20:55:25.942Z · LW · GW

My model is that the quality of the reasoning can actually be divided into two dimensions, the quality of intuition (what the "first guess" is), and the quality of search (how much better you can make it by thinking more).

Another way of thinking about this distinction is as the difference between how good each reasoning step is (intuition), compared to how good the process is for aggregating steps into a whole that solves a certain task (search).

It seems to me that current models are strong enough to learn good intuition about all kinds of things with enough high-quality training data, and that if you have good enough search you can use that as an amplification mechanism (on tasks where verification is available) to improve through self-play.

This being right then failure to solve IMO probably means a good search algorithm (analogous to AlphaZero's MCTS-UCT, maybe including its own intuition model) has not been found that is capable of amplifying the intuitions useful for reasoning.

So far all problem-solving AIs seem to use linear or depth-first search, that is, you sample one token at a time (one reasoning step), chain them up depth-first (generate a full text/proof-sketch) check to see if it solves the full problem, and if it doesn't work then it just tries again from scratch throwing all the partial work away. No search heuristic is used, no attempt to solve smaller problems first, etc. So it can certainly get a lot better than that (which is why I'm making the bet).

Comment by azsantosk on I bet $500 on AI winning the IMO gold medal by 2026 · 2023-05-12T19:18:03.469Z · LW · GW

I participated in the selection tests for the Brazilian IMO team, and got to the last stage. That being said, never managed to solve the hard problems independently (problems 3 and 6).

Comment by azsantosk on I bet $500 on AI winning the IMO gold medal by 2026 · 2023-05-11T19:08:09.762Z · LW · GW

I take from this comment that you do not see "AI winning the gold medal" as a good predictor of superintelligence arriving soon as much as I do.

I agree with the A/B < C/D part but may disagree with the "<<". LLMs already display common sense. LLMs already generalize pretty well. Verifying whether a given game design is good is mostly a matter of common sense + reasoning. Finding a good game design given you know how to verify it is a matter of search.

A expect an AI that is good at both reasoning and search (as it has to be to win the IMO gold medal) to be quite capable of mechanism design as well, once it also knows how to connect common sense to reasoning + search. I don't expect this to be trivial, but I do expect it to depend more on training data than on architecture.

Edit: by "training data" here I mostly mean "experience and feedback from multiple tasks" in a reinforcement learning sense, rather than more "passive" supervised learning.

Comment by azsantosk on I bet $500 on AI winning the IMO gold medal by 2026 · 2023-05-11T19:01:45.517Z · LW · GW

From Metaculus' resolution criteria:

This question resolves on the date an AI system competes well enough on an IMO test to earn the equivalent of a gold medal. The IMO test must be most current IMO test at the time the feat is completed (previous years do not qualify)."The IMO test must be most current IMO test at the time the feat is completed (previous years do not qualify)."

I think this was defined on purpose to avoid such contamination. It also seems common sense to me that, when training a system to perform well on IMO 2026, you cannot include any data point from after the questions were made public.

At the same time training on previous IMO/math contest questions should be fair game. All human contestants practice quite a lot on questions from previous contents, and IMO is still very challenging for them.

Comment by azsantosk on [Intro to brain-like-AGI safety] 13. Symbol grounding & human social instincts · 2022-05-06T22:50:15.510Z · LW · GW

One thing that appears to be missing on the filial imprinting story is a mechanism allowing the "mommy" thought assessor to improve or at least not degrade over time. 

The critical window is quite short, so many characteristics of mommy that may be very useful will not be perceived by the thought assessor in time. I would expect that after it recognizes something as mommy it is still malleable to learn more about what properties mommy has.

For example, after it recognizes mommy based on the vision, it may learn more about what sounds mommy makes, and what smell mommy has. Because these sounds/smalls are present when the vision-based mommy signal is present, the thought assessor should update to recognize sound/smell as indicative of mommy as well. This will help the duckling avoid mistaking some other ducks for mommy, and also help the ducklings find their mommy though other non-visual cues (even if the visual cues are what triggers the imprinting to begin with).

I suspect such a mechanism will be present even after the critical period is over. For example, humans sometimes feel emotionally attracted to objects that remind them or have become associated with loved ones. The attachment may be really strong (e.g. when the loved one is dead and only the object is left).

Also, your loved ones change over time, but you keep loving them! In "parental" imprinting for example, the initial imprinting is on the baby-like figure, generating a "my kid" thought assessor associated with the baby-like cues, but these need to change over time as the baby grows. So the "my kid" thought assessor has to continuously learn new properties.

Even more importantly, the learning subsystem is constantly changing, maybe even more than the external cues. If the learned representations change over time as the agent learns, the thought assessors have to keep up and do the same, otherwise their accuracy will slowly degrade over time.

This last part seems quite important for a rapidly learning/improving AGI, as we want the prosocial assessors to be robust to ontological drift. So we both want the AGI to do the initial "symbol-grounding" of desirable proto-traits close to kindness/submissiveness, and also for its steering subsystem to learn more about these concepts over time, so that they "converge" to favoring sensible concepts in an ontologically advanced world-model.

Comment by azsantosk on [Intro to brain-like-AGI safety] 12. Two paths forward: “Controlled AGI” and “Social-instinct AGI” · 2022-04-22T05:47:35.566Z · LW · GW

Another strong upvote for a great sequence. Social-instinct AGIs seems to me a very promising and very much overlooked approach to AGI safety. There seem to be many "tricks" that are "used by the genome" to build social instincts from ground values, and reverse engineering these tricks seem particularly valuable for us. I am eagerly waiting to read the next posts.

In a previous post I shared a success model that relies on your idea of reverse engineering the steering subsystem to build agents with motivations compatible with a safe Oracle design, including the class of reversely aligned motivations. What is your opinion on them? Do you think the set of "social instincts" we would want to incorporate into an AGI changes much if we are optimizing for reverse vs direct intent alignment?

Comment by azsantosk on “Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments · 2022-04-22T05:10:03.599Z · LW · GW

While I am sure that you have the best intentions, I believe the framing of the conversation was very ill-conceived, in a way that makes it harmful, even if one agrees with the arguments contained in the post.

For example, here is the very first negative consequence you mentioned:

(bad external relations)  People on your team will have a low trust and/or adversarial stance towards neighboring institutions and collaborators, and will have a hard time forming good-faith collaboration.  This will alienate other institutions and make them not want to work with you or be supportive of you.

I think one can argue that, this argument being correct, the post itself will exacerbate the problem by bringing greater awareness to these "intentions" in a very negative light.

  • The intention keyword pattern-matches with "bad/evil intentions". Those worried about existential risk are good people, and their intentions (preventing x-risk) are good. So we should refer to ourselves accordingly and talk about misguided plans instead of anything resembling bad intentions.
  • People discussing pivotal acts, including those arguing that it should not be pursued, are using this expression sparingly. Moreover, they seem to be using this expression on purpose to avoid more forceful terms. Your use of scare quotes and your direct association of this expression with bad/evil actions casts a significant part of the community in a bad light.

It is important for this community to be able to have some difficult discussions without attracting backlash from outsiders, and having specific neutral/untainted terminology serves precisely for that purpose.

As others have mentioned, your preferred 'Idea A' has many complications and you have not convincingly addressed them. As a result, good members of our community may well find 'Idea B' to be worth exploring despite the problems you mention. Even if you don't think their efforts are helpful, you should be careful to portrait them in a good light.

Comment by azsantosk on Pivotal acts from Math AIs · 2022-04-16T03:27:22.106Z · LW · GW

I think you are right! Maybe I should have actually written different posts about each of these two plans.

And yes, I agree with you that maybe the most likely way of doing what I propose is getting someone ultra rich to back it. That idea has the advantage that it can be done immediately, without waiting for a Math AI to be available.

To me it still seems important to think of what kind of strategical advantages we can obtain with a Math AI. Maybe it is possible to gain a lot more than money (I gave the example of zero-day exploits, but we can most likely get a lot of other valuable technology as well).

Comment by azsantosk on Pivotal acts from Math AIs · 2022-04-16T03:24:32.165Z · LW · GW

In my model the Oracle would stay securely held in something like a Faraday cage with no internet connection and so on.

So yes, some people might want to steal it, but if we have some security I think they would be unlikely to succeed, unless it is a state-level effort.

Comment by azsantosk on The Regulatory Option: A response to near 0% survival odds · 2022-04-13T02:05:21.622Z · LW · GW

I think it is an interesting idea, and it may be worthwhile even if Dagon is right and it results in regulatory capture.

The reason is, regulatory capture is likely to benefit a few select companies to promote an oligopoly. That sounds bad, and it usually is, but in this case it also reduces the AI race dynamic. If there are only a few serious competitors for AGI, it is easier for them to coordinate. It is also easier for us to influence them towards best safety practices.

Comment by azsantosk on [deleted post] 2022-04-13T01:58:02.523Z

Hi maggo. Welcome to LessWrong.

I'm afraid there is not much you can do to save yourself once unaligned strong AI is there. Focusing less on the long-term and just having fun is always an option, but I'd also strongly recommend against that.

I don't know you, but it is possible that there is more you can do to help prevent strong unaligned AGI than you think. There are other very smart people working on preventing x-risk (e.g. Steven Byrnes), and some of them believe they might help turn the game around. I have suggested a possible AI-in-a-box success model that interacts with his research, so I believe him. You can try to support people like him. 

Matthew Lowenstein has recently suggested advocating stronger regulation for AI. Not sure how likely it is to work, but seems worth trying in my opinion.

So few people even know about the problem that I don't think anyone aware of it has the right to stay idle as if there is nothing they can do. That is almost certainly false. So my advice is for you to search for ways you can make a positive difference, because there are such ways.

Agree with rhollerith but upvoted your question to encourage you to stay.

Comment by azsantosk on Inner Alignment in Salt-Starved Rats · 2022-03-28T18:01:57.696Z · LW · GW

Having read Steven's post on why humans will not create AGI through a process analogous to evolution, his metaphor of the gene trying to do something felt appropriate to me.

If the "genome = code" analogy is the better one for thinking about the relationship of AGIs and brains, then the fact that the genome can steer the neocortex towards such proxy goals as salt homeostasis is very noteworthy, as a similar mechanism may give us some tools, even if limited, to steer a brain-like AGI toward goals that we would like it to have.

I think Eliezer's comment is also important in that it explains quite eloquently how complex these goals really are, even though they seem simple to us. In particular the positive motivational valence that such brain-like systems attribute to internal mental states makes them very different from other types of world-optimizing agents that may only care about themselves for instrumental reasons.

Also the fact that we don't have genetic fitness as a direct goal is evidence not only that evolution-like algorithms don't do inner alignment well, but also that simple but abstract goals such as inclusive genetic fitness may be hard to install in a brain-like system. This is especially so if you agree, in the case of humans, that having genetic fitness as a direct goal, at least alongside the proxies, would probably help fitness, even in the ancestral environment.

I don't really know how big of a problem this is. Given that our own goals are very complex and that outer alignment is hard, maybe we shouldn't be trying to put a simple goal into an AGI to begin with.

Maybe there is a path for using these brain-like mechanisms (including positive motivational valence for imagined states and so on) to create a secure aligned AGI. Getting this answer right seems extremely important to me, and if I understand correctly, this is a key part of Steven's research.

Of course, it is also possible that this is fundamentally unsafe and we shouldn't do that, but somehow I think that is unlikely. It should be possible to build such systems in a smaller scale (therefore not superintelligent) so that we can investigate their motivations to see what the internal goals are, and whether the system is treacherous or looking for proxies. If it turns out that such a path is indeed fundamentally unsafe, I would expect this to be related to ontological crises or profound motivational changes that are expected to occur as capability increases.

Comment by azsantosk on We cannot directly choose an AGI's utility function · 2022-03-24T02:56:04.341Z · LW · GW

That is, that we shouldn't worry so much about what to tell the genie in the lamp, because we probably won't even have a say to begin with.

 

I think you summarized it quite well, thanks! The idea written like that is more clear than what I wrote, so I'll probably try to edit the article to include this claim explicitly like that. This really is what motivated me to write this post to begin with.

Personally I (also?) think that the right "values" and the right training is more important.

You can put the also, I agree with you.

At the current state of confusion regarding this matter I think we should focus on how values might be shaped by the architecture and training regimes, and try to make progress on that even if we don't know exactly what the human values are or what utility functions they represent.

Comment by azsantosk on Why will an AGI be rational? · 2022-03-24T02:44:06.280Z · LW · GW

I agree my conception is unusual, I am ready to abandon it in favor of some better definition. At the same time I feel like an utility function having way too many components makes it useless as a concept. 

Because here I'm trying to derive the utility from the actions, I feel like we can understand the being better the less information is required to encode its utility function, in a Kolmogorov complexity sense, and that if its too complex then there is no good explanation to the actions and we conclude the agent is acting somewhat randomly.

Maybe trying to derive the utility as a 'compression' of the actions is where the problem is, and I should distinguish more what the agent does from what the agent wants. An agent is then going to be irrational only if the wants are inconsistent with each other; if the actions are inconsistent with what it wants then it is merely incompetent, which is something else.

Comment by azsantosk on We cannot directly choose an AGI's utility function · 2022-03-24T00:05:52.746Z · LW · GW

What we think is that we might someday build an AI advanced enough that it can, by itself, predict plans for given goal x, and execute them. Is this that otherworldly? Given current progress, I don't think so.

 

I don't think so either. AGIs will likely be capable of understanding what we mean by X and doing plans for exactly that if they want to help. Problem is the AGIs may have other goals in mind by this time.

As for re-inforcement learning, even it seems now impossible to build AGIs with utility functions on that paradigm, nothing gives us the assurance that that will be the paradigm used to be the first AGI.

Sure, it may be possible that some other paradigm allows us to have more control of the utility functions. User tailcalled mentioned John Wentworth's research (which I will proceed to study as I haven't done so in depth yet).

(Unless the first AGI can't be told to do anything at all, but then we would already have lost the control problem.)

I'm afraid that this may be quite a likely outcome if we don't make much progress in alignment research.

Regarding what the AGI will want then, I expect it to depend a lot on the training regime and on its internal motivation modules (somewhat analogous to the subcortical areas of the brain). My threat model is quite similar to the one defended by Steven Byrnes in articles such as this one.

In particular I think the AI developers will likely give the AGI "creativity modules" responsible for generating intrinsic reward whenever it finds out interesting patterns or abilities. This will help the AGI remain motivated and learning to solve harder and harder problems when outside reward is sparse, which I predict will be extremely useful to make the AGI more capable. But I expect the internalization of such intrinsic rewards to end up generating utility functions that are nearly unbounded in the value assigned to knowledge and computational power, and quite possibly hostile to us.

I don't think all is lost though. Our brain provide us an example of a relatively-well aligned intelligence: our own higher reasoning in the telencephalon seems relatively well aligned with the evolutionary ancient primitive subcortical modules (not so much with evolution's base objective of reproduction, though). Not sure how much work evolution had to align these two modules. I've heard at least one person arguing that maybe higher intelligence didn't evolve before because of the difficulties of aligning it. If so, that would be pretty bad.

Also I'm somewhat more optimistic than others in the prospect of creating myopic AGIs that crave very much for short-term rewards that we do control. I think it might be possible (with a lot of effort) to keep such an AGI controlled in a box even if it is more intelligent than humans in general, and that such an AGI may help us with the overall control problem.

Comment by azsantosk on Why will an AGI be rational? · 2022-03-23T23:33:59.505Z · LW · GW

I agree. Regarding biases that I would like to throw away one day in the future, being careful enough to protect modules important for self-preservation and self-healing, I'd probably like to excessive energy-preserving modules such as ones responsible for laziness, that are only really useful in ancestral environments where food is scarce.

I like your example of senseless winter bias as well. There are probably many examples like that.

Comment by azsantosk on Why will an AGI be rational? · 2022-03-23T23:27:12.017Z · LW · GW

I am still confused about these topics. We know that any behavior can be expressed as a complicated world-history utility function, and that therefore anything at all could be rational according to these. So I sometimes think of rationality as a spectrum, in which the simpler the utility function justifying your actions the more rational you are. According to such a definition rationality may actually be opposed to human values at the highest end, so it makes a lot of sense to focus on intelligence that is not fully rational.

Not really sure what you mean by a "honing epistemics" kind of rationality, but I understand that moral uncertainty in the perspective of the AGI may increase the chance that it keep some small fraction of the universe for us, so that would also be great. Is that what you mean? I don't think it is going to be easy to have the AGI consider some phenomena as outside its scope (such that it would be irrational to meddle with it). If we want the AGI not to leave us alone, then this should be a value that we need to include in their utility function somehow.

Utility function evolution is something complicated. I worry a lot about that, particularly because this seems one of the ways to achieve corrigibility and we really want that, but it also looks as a violation of goal-integrity on the perspective of the AGI. Maybe it is possible for the AGI to consider this "module" responsible for giving feedback to itself as part of itself, just as we (usually) consider our midbrain and other evolutionary ancient "subcortical" areas as a part of us rather than some "other" system interfering with our higher goals.

Comment by azsantosk on Why will an AGI be rational? · 2022-03-22T18:12:10.099Z · LW · GW

You are right; I should have written that the AGI will "correct" its biases rather write than it will "remove" them.

Comment by azsantosk on We cannot directly choose an AGI's utility function · 2022-03-21T22:21:11.379Z · LW · GW

I am aware of Reinforcement Learning (I am actually sitting right next to Sutton's book on the field, which I have fully read), but I think you are right that my point is not very clear.

The way I see it RL goals are really only the goals of the base optimizer. The agents themselves either are not intelligent (follow simple procedural 'policies') or are mesa-optimizers that may learn to follow something else entirely (proxies, etc). I updated the text, let me know if it makes more sense now.

Comment by azsantosk on March 2022 Welcome & Open Thread · 2022-03-21T18:02:35.365Z · LW · GW

Hi! I'm Kelvin, 26, and I've been following LessWrong since 2018. Came here after reading references to Eliezer's AI-Box experiments from Nick Bostrom's book.

During high school I participated in a few science olympiads, including Chemistry, Math, Biology and Informatics. Was the reserve member of the Brazilian team for the 2012 International Chemistry Olympiad.

I studied Medicine and later Molecular Science at the University of São Paulo, and dropped out in 2015 to join a high-frequency trading fund based on Brazil. Had a successful career there, and rose up to become one of the senior partners.

Since 2020 I'm co-founder and CEO of TickSpread, a crypto futures exchange based on batch auctions. We are interested in mechanism design, conditional and combinatorial markets, and futarchy.

I'm also personally very interested in machine learning, neuroscience, and AI safety discussions, and I've spent quite some time studying these topics on my own, despite having no professional experience on them.

I very much want to be more active on this community, participating in discussions and meeting other people who are also interested in these topics, but I'm not totally sure where to start. I would love for someone to help me get integrated here, so if you think you can do that please let me know :)