Posts

Infra-Bayesianism Unwrapped 2021-01-20T13:35:03.656Z
Against the Backward Approach to Goal-Directedness 2021-01-19T18:46:19.881Z
Literature Review on Goal-Directedness 2021-01-18T11:15:36.710Z
The Case for a Journal of AI Alignment 2021-01-09T18:13:27.653Z
Postmortem on my Comment Challenge 2020-12-04T14:15:41.679Z
[Linkpost] AlphaFold: a solution to a 50-year-old grand challenge in biology 2020-11-30T17:33:43.691Z
Small Habits Shape Identity: How I became someone who exercises 2020-11-26T14:55:57.622Z
What are Examples of Great Distillers? 2020-11-12T14:09:59.128Z
The (Unofficial) Less Wrong Comment Challenge 2020-11-11T14:18:48.340Z
Why You Should Care About Goal-Directedness 2020-11-09T12:48:34.601Z
The "Backchaining to Local Search" Technique in AI Alignment 2020-09-18T15:05:02.944Z
Universality Unwrapped 2020-08-21T18:53:25.876Z
Goal-Directedness: What Success Looks Like 2020-08-16T18:33:28.714Z
Mapping Out Alignment 2020-08-15T01:02:31.489Z
Will OpenAI's work unintentionally increase existential risks related to AI? 2020-08-11T18:16:56.414Z
Analyzing the Problem GPT-3 is Trying to Solve 2020-08-06T21:58:56.163Z
What are the most important papers/post/resources to read to understand more of GPT-3? 2020-08-02T20:53:30.913Z
What are you looking for in a Less Wrong post? 2020-08-01T18:00:04.738Z
Dealing with Curiosity-Stoppers 2020-07-30T22:05:02.668Z
adamShimi's Shortform 2020-07-22T19:19:27.622Z
The 8 Techniques to Tolerify the Dark World 2020-07-20T00:58:04.621Z
Locality of goals 2020-06-22T21:56:01.428Z
Goal-directedness is behavioral, not structural 2020-06-08T23:05:30.422Z
Focus: you are allowed to be bad at accomplishing your goals 2020-06-03T21:04:29.151Z
Lessons from Isaac: Pitfalls of Reason 2020-05-08T20:44:35.902Z
My Functor is Rich! 2020-03-18T18:58:39.002Z
Welcome to the Haskell Jungle 2020-03-18T18:58:18.083Z
Lessons from Isaac: Poor Little Robbie 2020-03-14T17:14:56.438Z
Where's the Turing Machine? A step towards Ontology Identification 2020-02-26T17:10:53.054Z
Goal-directed = Model-based RL? 2020-02-20T19:13:51.342Z

Comments

Comment by adamshimi on Against the Backward Approach to Goal-Directedness · 2021-01-20T15:08:46.403Z · LW · GW

Yep, we seem to agree.

It might not be clear from the lit review, but I personally don't agree with all the intuitions, or not completely. And I definitely believe that a definition that throw some part of the intuitions but applies to AI risks argument is totally fine. It's more that I believe the gist of these intuitions is pointing in the right direction, and so I want to keep them in mind.

Comment by adamshimi on Against the Backward Approach to Goal-Directedness · 2021-01-20T15:06:29.886Z · LW · GW

Good to know that my internal model of you is correct at least on this point.

For Daniel, given his comment on this post, I think we actually agree, but that he puts more explicit emphasis on the that-which-makes-AI-risk-arguments-work, as you wrote. 

Comment by adamshimi on Literature Review on Goal-Directedness · 2021-01-20T15:03:39.400Z · LW · GW

Another way to talk about this distinction is between definitions that allow you to predict the behaviour of agents which you haven't observed yet given how they were trained, versus definitions of goal-directedness which allow you to predict the future behaviour of an existing system given its previous behaviour.

I actually don't think we should make this distinction. It's true that Dennett's intentional stance falls in the first category for example, but that's not the reason why I'm interested about it. Explainability seems to me like a way to find a definition of goal-directedness that we can check through interpretability and verification, and which tells us something about the behavior of the system with regards to AI risk. Yet that doesn't mean it only applies to the observed behavior of systems.

The biggest difference between your definition and the intuitions is that you focus on how goal-directedness appears through training. I agree that this is a fundamental problem; I just think that this is something we can only solve after having a definition of goal-directedness that we can check concretely in a system and that allows the prediction of behavior.

Firstly, we don't have any AGIs to study, and so when we ask the question of how likely it is that AGIs will be goal-directed, we need to talk about the way in which that trait might emerge.

As mentioned above, I think a definition of goal-directedness should allow us to predict what an AGI will broadly do based on its level of goal-directedness. Training for me is only relevant in understanding which level of goal-directedness are possible/probable. That seems like the crux of the disagreement here.

Secondly, because of the possibility of deceptive alignment, it doesn't seem like focusing on observed behaviour is sufficient for analysing goal-directedness.

I agree, but I definitely don't think the intuitions are limiting themselves to the observed behavior. With a definition you can check through interpretability and verification, you might be able to steer clear of deception during training. That's a use of (low) goal-directedness similar to the one Evan has in mind for myopia.

Thirdly, suppose that we build a system that's goal-directed in a dangerous way. What do we do then? Well, we need to know why that goal-directedness emerges, and how to change the training regime so that it doesn't happen again.

For that one, understanding how goal-directedness emerges is definitely crucial.

Comment by adamshimi on Literature Review on Goal-Directedness · 2021-01-19T18:48:18.033Z · LW · GW

Glad my comment clarified some things.

About the methodology, I just published a post clarifying my thinking about it.

Comment by adamshimi on Literature Review on Goal-Directedness · 2021-01-19T16:19:10.077Z · LW · GW

Thanks for the proposed idea!

Yet I find myself lost when trying to find more information about this concept of care. It is mentioned in both the chapter on Heidegger in The History of Philosophy and the section on care in the SEP article on Heidegger, but I don't get a single thing written there. I think the ideas of "thrownness" and "disposedness" are related?

Do you have specific pointers to deeper discussions of this concept? Specifically, I'm interested in new intuitions for how a goal is revealed by actions.

Comment by adamshimi on Literature Review on Goal-Directedness · 2021-01-18T17:49:06.836Z · LW · GW

Thanks!

Comment by adamshimi on Literature Review on Goal-Directedness · 2021-01-18T17:48:51.110Z · LW · GW

Glad they helped! That's the first time I use this feature, and we debated whether to add more or remove them completely, so thanks for the feedback. :)

I think depending on what position you take, there are difference in how much one thinks there's "room for a lot of work in this sphere." The more you treat goal-directedness as important because it's a useful category in our map for predicting certain systems, the less important it is to be precise about it. On the other hand if you want to treat goal-directedness in a human-independent way or otherwise care about it "for its own sake" for some reason, then it's a different story.

If I get you correctly, you're arguing that there's less work on goal-directedness if we try to use it concretely (for discussing AI risk), compared to if we study it for it's own sake? I think I agree with that, but I still believe that we need a pretty concrete definition to use goal-directedness in practice, and that we're far from there. There is less pressure to deal ith all the philosophical nitpicks, but we should at least get the big intuitions (of the type mentioned in this lit review) right, or explain why they're wrong.

Comment by adamshimi on Literature Review on Goal-Directedness · 2021-01-18T17:44:19.286Z · LW · GW

Thanks for the feedback!

My only critique so far is that I'm not really on board yet with your methodology of making desiderata by looking at what people seem to be saying in the literature. I'd prefer a methodology like "We are looking for a definition of goal-directedness such that the standard arguments about AI risk that invoke goal-directedness make sense. If there is no such definition, great! Those arguments are wrong then."

I agree with you that the endgoal of this research is to make sense of the arguments about AI risk invoking goal-directedness, and of the proposed alternatives. The thing is, even if it's true, proving that there is no property making these arguments work looks extremely hard. I have very little hope that it is possible to show one way or the other heads-on.

On the other hand, when people invoke goal-directedness, they seem to reference a cluster of similar concepts. And if we manage to formalize this cluster in a satisfying manner for most people, then we can look whether these (now formal) concepts make the arguments for AI risk work. If they do, then problem solved. If the arguments now fail with this definition, I still believe that this is a strong evidence for the arguments not working in general. You can say that I'm taking the bet that "The behavior of AI risk arguments with inputs in the cluster of intuitions from the literature is representative of the behavior of AI risks arguments with any definition of goal-directedness". Rohin for one seems less convinced by this bet (for example with regard to the importance of explainability)

My personal prediction is that the arguments for AI risks do work for a definition of goal-directedness close to this cluster of concepts. My big uncertainty is what constitute a non-goal-directed (or less goal-directed) system, and whether they're viable against goal-directed ones.

(Note that I'm not saying that all the intuitions in the lit review should be part of a definition of goal-directedness. Just that they probably need to be addressed, and that most of them capture an important detail of the cluster)

I also have a suggestion or naive question: Why isn't the obvious/naive definition discussed here? The obvious/naive definition, at least to me, is something like:

"The paradigmatic goal-directed system has within it some explicit representation of a way the world could be in the future -- the goal -- and then the system's behavior results from following some plan, which itself resulted from some internal reasoning process in which a range of plans are proposed and considered on the basis of how effective they seemed to be at achieving the goal. When we say a system is goal-directed, we mean it is relevantly similar to the paradigmatic goal-directed system."

I feel like this is how I (and probably everyone else?) thought about goal-directedness before attempting to theorize about it. Moreover I feel like it's a pretty good way to begin one's theorizing, on independent grounds: It puts the emphasis on relevantly similar and thus raises the question "Why do we care? For what purpose are we asking whether X is goal-directed?"

Your definition looks like Dennett's intentional stance to me. In the intentional stance, the "paradigmatic goal-directed system" is the purely rational system that tries to achieve its desires based on its beliefs, and being an intentional system/goal-directed depend on similarity in terms of prediction with this system.

On the other hand, for most internal structure based definitions (like Richard's or the mesa-optimizers), a goal-directed system is exactly a paradigmatic goal-directed system.

But I might have misunderstood your naive definition.

Comment by adamshimi on Transparency and AGI safety · 2021-01-17T19:18:13.255Z · LW · GW

Likewise, thanks for taking the time to write such a long comment! And hoping that's a typo in the second sentence :)

You're welcome. And yes, this was as typo that I corrected. ^^

Wrt the community though, I’d be especially curious to get more feedback on Motivation #2. Do people not agree that transparency is *necessary* for AI Safety? And if they do agree, then why aren’t more people working on it?

My take is that a lot of people around here agree that transparency is at least useful, and maybe necessary. And the main reason why people are not working on it is a mix of personal fit, and the fact that without research in AI Alignment proper, transparency doesn't seem that useful (if we don't know what to look for).

I agree, but think that transparency is doing most of the work there (i.e. what you say sounds more to me like an application of transparency than scaling up the way that verification is used in current models.) But this is just semantics.

Well, transparency is doing some work, but it's totally unable to prove anything. That's a big part of the approach I'm proposing. That being said, I agree that this doesn't look like scaling the current way.

Hm, I want to disagree, but this may just come down to a difference in what we mean by deployment. In the paragraph that you quoted, I was imagining the usual train/deploy split from ML where deployment means that we’ve frozen the weights of our AI and prohibit further learning from taking place. In that case, I’d like to emphasize that there’s a difference between intelligence as a meta-ability to acquire new capabilities and a system’s actual capabilities at a given time. Even if an AI is superintelligent, i.e. able to write new information into its weights extremely efficiently, once those weights are fixed, it can only reason and plan using whatever object-level knowledge was encoded in them up to that point. So if there was nothing about bio weapons in the weights when we froze them, then we wouldn't expect the paperclip-maximizer to spontaneously make plans involving bio weapons when deployed.

You're right that I was thinking of a more online system that could update it's weights during deployment. Yet even with frozen weights, I definitely expect the model to make plans involving things that were not involved. For example, it might not have a bio-weapon feature, but the relevant subfeature to build some by quite local rules that don't look like a plan to build a bio-weapon.

Suppose an AI system was trained on a dataset of existing transparency papers to come up with new project ideas in transparency. Then its first outputs would probably use words like neurons and weights instead of some totally incomprehensible concepts, since those would be the very same concepts that would let it efficiently make sense of its training set. And new ideas about neurons and weights would then be things that we could independently reason about even if they’re very clever ideas that we didn’t think of ourselves, just like you and I can have a conversation about circuits even if we didn’t come up with it.

That seems reasonable.

Comment by adamshimi on Why I'm excited about Debate · 2021-01-16T21:57:25.013Z · LW · GW

To check if I understand correctly, you're arguing that the selection pressure to use argument in order to win requires the ability to be swayed by arguments, and the latter already requires explicit reasoning?

That seems convincing as a counter-argument to "explicit reasoning in humans primarily evolved not in order to help us find out about the world, but rather in order to win arguments.", but I'm not knowledgeable enough about the work quoted to check if they don't have a more subtle position.

Comment by adamshimi on Why Productivity Systems Don't Stick · 2021-01-16T18:44:24.026Z · LW · GW

I'm biased against Twitter and Twitter threads, but I found this post readable and useful.

My own current take on resolving my inner conflict is to see my complex identity as the result of multiple simpler and focused identities, and to try to address the need of each while keeping my priorities. Something vaguely like IFS, but not exactly.

The pattern: We find a new coercive method that gets us to do things through "self-discipline". As we use this method, resentment builds. Until finally the resentment for the method is stronger than the coercion it provides.

And then there's a ridiculous binge period when you do nothing productive whatsoever.

Comment by adamshimi on Why I'm excited about Debate · 2021-01-16T17:37:43.082Z · LW · GW

I really like that kind of post! The only thing that I feel is missing is a discussion of what was your privileged research direction before this change of opinion. I assume it was RRM, given how you talk about it, but something like a comparison would be really useful I think.

Still, you give me even more reason to eventually take some time to read all the published work and posts about Debate.

Comment by adamshimi on Transparency and AGI safety · 2021-01-13T18:26:14.165Z · LW · GW

Thanks a lot for all the effort you put into this post! I don't agree with everything, but reading and commenting it was very stimulating, and probably useful for my own research.

In this post, I’ll argue that making AI systems more transparent could be very useful from a longtermist or AI safety point of view. I’ll then review recent progress on this centered around the circuits program being pursued by the Clarity team at OpenAI, and finally point out some directions for future work that could be interesting to pursue.

I'm quite curious about why you wrote this post. If it's for convincing researchers in AI Safety that transparency is useful and important for AI Alignment, my impression is that many researchers do agree, and those who don't tend to have thought about it for quite some time (Paul Christiano comes to mind, as someone who is less interested in transparency while knowing a decent amount about it). So if the goal was to convince people to care about transparency, I'm not sure this post was necessary. 

I'm not saying I don't find value this post. As a big fan of the circuit research, I'm glad to have more in-depth discussion about and around it. I am simply trying to understand what you wanted to do with this post, to give you better feedback.

Artificial general intelligence (AGI) is usually more vaguely defined as an AI system that can do anything that humans can do. Here, I’ll operationally take it to mean “AI that can outperform humans at the task of generating qualitative insights into technical AI safety research,” for instance by coming up with new research agendas that turn out to be fruitful.

Nitpicking here, but I assume you mean coming up with a high enough proportion of new research agendas, instead of just coming up with some. That changes removes stupid edge cases like programs writing all the permutations of some sentences about AGI, which would probably generate at least some useful ideas among the noise.

To summarize, even if one assumes the UAT in the current deep learning paradigm, the choice of model initialization + dataset that would let us get to AGI may be highly non-generic. To the extent that this is true, it pushes towards (indefinitely) longer timelines than forecasted in analyses based on compute, since practitioners might then have to understand an unknown number of qualitatively new things around setting initial conditions for the search problem.

I agree with the idea, with maybe the caveat that it doesn't apply to Ems à la Hanson. A similar argument could hold about neuroscience facts we would need to know to scan and simulate brains, though.

Motivation #1: Work on transparency could help to reduce this uncertainty

This leads to a first motivation for transparency research: that getting a better understanding of how today's AI systems work seems useful to let us make better-educated guesses about how they might scale up. 

For example, learning more about how GPT-3 works seems like it could help us to reason better about whether the task of text prediction by itself could ever lead to AGI, which competes with the hard paths hypothesis. (For examples of the type of insights that we might hope to gain from applying transparency tools, see Part 2 of this note below.)

This argument applies to every part of ML that studies how learned model works and why. So as itself, it's insufficient for privileging transparency over theoretical work on neural nets, for example.

From the machine learning ("lobotomized alien") point of view, a natural way to partition the alignment problem is as

  • an outer alignment (specification) problem of making sure that our systems are designed with utility / loss functions that would make them do what we intend for them to do in theory, and
  • an inner alignment (distribution shift) problem of making sure that a system trained on a "theoretically correct" objective with a finite-sized training set would keep on pursuing that objective when deployed in a somewhat different environment than the one that it trained on.

From the more anthropomorphic "alien in a box" point of view on the other hand, one might instead find it natural to slice up the alignment problem into

  • a competence (translation) problem of making the AI system learn to understand what we want, and
  • an intent alignment problem of making the AI system care to do what we want, assuming that it understands us perfectly well.

I really like the way you present the two point of view and how they partition the alignment problem. It's going to be quite useful for me. Notably, I almost always take the "lobotomized alien" perspective, but now I can remind myself to make that a choice and see if the "alien in a box" perspective is more appropriate. Thanks!

Motivation #2: Transparency seems necessary to guard against emergent misbehavior

This leads to a second motivation for transparency research, which is that to defend against emergent misbehavior in all situations that an agentic AGI could encounter when deployed, it seems necessary to me that we understand something about the AI's internal cognition.

I completely agree with this motivation, and it is really well presented.

But we can't guarantee ahead of time that adversarial training will catch every failure mode, and verification requires that we characterize the space of possible inputs, which seems hard to scale up to future AI systems with arbitrarily large input/output spaces [^9]. So this is in no way a proof but is a failure of my imagination otherwise (and I'd be very excited to hear about other ideas!).

My take on why verification might scale is that we will move towards specification of properties of the program instead of it's input/output relation. So verifying whether the code satisfy some formal property that indicates myopia or low goal-directedness. Note that transparency is still really important here, because even with completely formal definitions of things like myopia and goal-directedness, I think transparency will be necessary to translate them into properties of the specific class of models studied (neural networks for example).

A third motivation is that exact transparency would give us a mulligan: a chance to check if something could go catastrophically wrong with a system that we've built before we decide to deploy it in the real world. E.g. suppose that just by looking at the weights of a neural network, we could read off all of the knowledge encoded inside the network. Then you could imagine looking into the "mind" of a paperclip-making AI system, seeing that for some reason it had been learning things related to making biological weapons, and deciding against letting it run your paperclip-making factory.

I think this misses a very big part of what makes a paperclip-maximizer dangerous -- the fact that it can come up with catastrophic plans after it's been deployed. So it doesn't have to be explicitly deceptive and bidding it's time; it might just be really competent and focused on maximizing paperclips, which requires more than exact transparency to catch. It requires being able to check properties that ensures the catastrophic outcomes won't happen.

But I still think your motivation makes sense for a part of deceptive alignment. My more general caveat is that I don't believe in exact transparency, so I am more for a mixed transparency and verification approach (as mentioned above).

A minimal AI system that can write blog posts about AI safety, or otherwise do theoretical science research, doesn't seem to require a large output space. It plausibly just needs to be able to write text into an offline word processor. This suggests that the first AGI may be close to what people have historically called an “Oracle AI”.  

In my opinion, Oracle AIs already seem pretty safe by virtue of being well-boxed, without further qualification. If all they can do is write offline text, they would have to go through humans to cause an existential catastrophe. However, some might argue that a hypothetical Oracle AI that was very proficient at manipulating humans could trick its human handlers into taking dangerous actions on its behalf. So to strengthen the case, we should also appeal to selection pressure.

An AI who does AI Safety research is properly terrifying. I'm really stunned by this choice, as I think this is probably one of the most dangerous case of oracle AI (and oracle AI is a pretty dangerous class by itself) that I can think of. I see two big problems with it:

  • It looks like exactly the kind of tasks where, if we haven't solve AI alignment in advance, Goodhart is upon us. What's the measure? What's the proxy? Best scenario: the AI is clearly optimizing something stupid, and nobody cares. Worst case scenario, more probably because the AI is actually supposed to outperform humans: it pushes for something that looks like it makes sense but doesn't actually work, and we might use these insight to build more advanced AGIs and be fucked.
  • It's quite simple to imagine a Predict-o-matic type scenario: pushing simpler and easier models that appear to work but don't, so that its task becomes easier.

To finish this argument, we would need to characterize what's needed to do AI safety research and argue that there exists a limited curriculum to impart that knowledge that wouldn't lead to deceptive oracle AI. I don't have a totally satisfactory argument for this (hence the earlier caveat!), but one bit of intuition in this direction is that the transparency agenda in this rest of this document certainly doesn't require deep (or any) knowledge of humans. The same seems true of at least a subset of other safety agendas, and we need only argue that AI that could accelerate progress in some parts of technical AI safety or otherwise change how we do intellectual work will come before plausibly dangerous agent AIs to reconsider how much to invest in object-level AI safety work today (since then it might make sense to defer some of the work to future researchers). We don't need to prove that a safe AGI oracle would solve the entire problem of AGI safety in one go.

I don't think any of the intuitions given work, for a simple reason: even if the research agenda doesn't require in itself any real knowledge of humans, the outputs still have to be humanly understandable. I want the AI to write blog posts that I can understand. So it will have to master clear writing, which seems from experience to require a lot of modeling of the other (and as a human, I get a bunch of things for free unconsciously, that an AI wouldn't have, like a model of emotions).

Another issue with this proposal is that you're saying on one side that the AI is superhuman at technical AI safety, and on the other hand that it can only do these specific proposals that don't use anything about humans. That's like saying that you have an AI that wins at any game, but in fact it only works for chess. Either the AI can do research on everything in AI Safety, and it will probably have to understand humans; or it is specifically for one research proposal, but then I don't see why not create other AIs for other research proposals. The technology is available, and the incentives would be here (if only to be as productive as the other researchers who have an AI to help them).

Motivation #4: Work on transparency could still be instrumentally valuable in such a world

Even in such a world though, there are some non-AGI-safety reasons that transparency research could be well-motivated today.

I disagree with the previous argument, yet I find this motivation really useful, because if it's also useful when things go correctly, that's a good way to have people unconvinced by risks work on it.

Notably, I believe in pushing new entrants who want do to AI (and are not interested or ready to switch to AI Alignment) towards transparency, as this is a really useful subfield for alignment and it's one of the parts of AI that push capabilities the less.

Summary of the technical approach

Even for someone who read most of the circuits paper, I found this summary really clear and insightful. You might actually be where I redirect people for getting an idea of circuits!

One way to get some insight into these things might be to edit the network. To test the first thing, we could delete the "color detectors" and see how badly that degrades the performance of the black-and-white detector at finding black-and-white images, while to test the second one, we could delete the black-and-white circuit from InceptionV1, and see how badly that degrades the performance of InceptionV1 at transfer learning on the task of black-and-white vs. color image classification [^11]. It might be interesting to develop quantitative standards for checks along these lines. 

These looks great! I hadn't thought about the issue you mention according to modularity, but it seems really important to settle, and your proposals are ingenious ways to study the question.

Compare circuits work to existing work on loss landscapes in deep learning. Another strategy might be to go through the literature of existing results from other perspectives and look for synergies with the circuits approach. As a semi-random example, the linked paper in the previous bullet-point suggests that some directions in the loss landscape are more important than others; it might be interesting to understand if such directions play an interesting role from the circuits POV.

I'm especially interested in this direction, as it seems highly relevant to my own research on gradient hacking.

Comment by adamshimi on Gradient hacking · 2021-01-13T14:33:57.918Z · LW · GW

I think the part in bold should instead be something like "failing hard if SGD would (not) update weights in such and such way". (SGD is a local search algorithm; it gradually improves a single network.)

Agreed. I said something similar in my comment.

As I already argued in another thread, the idea is not that SGD creates the gradient hacking logic specifically (in case this is what you had in mind here). As an analogy, consider a human that decides to 1-box in Newcomb's problem (which is related to the idea of gradient hacking, because the human decides to 1-box in order to have the property of "being a person that 1-boxs", because having that property is instrumentally useful). The specific strategy to 1-box is not selected for by human evolution, but rather general problem-solving capabilities were (and those capabilities resulted in the human coming up with the 1-box strategy).

Thanks for the concrete example, I think I understand better what you meant. What you describe looks like the hypothesis "Any sufficiently intelligent model will be able to gradient hack, and thus will do it". Which might be true. But I'm actually more interested in the question of how gradient hacking could emerge without having to pass that threshold of intelligence, because I believe such examples will be easier to interpret and study.

So in summary, I do think what you say makes sense for the general risk of gradient hacking, yet I don't believe it is really useful for studying gradient hacking with our current knowledge.

Comment by adamshimi on Gradient hacking · 2021-01-13T14:27:46.290Z · LW · GW

As I said elsewhere, I'm glad that my review captured points you deem important!

I think one of the weakest aspects of the post is that it starts with the assumption that an AI system has already given rise to an inner-optimizer that is now taking advantage of gradient hacking. I think while this is definitely a sufficient assumption, I don't think it's a necessary assumption and my current models suggest that we should find this behavior without the need for inner optimizers. This also makes me somewhat more optimistic about studying it.

I agree that gradient hacking isn't limited to inner optimizers; yet I don't think that defining it that way in the post was necessarily a bad idea. First, it's for coherence with Risks from Learned Optimization. Second, assuming some internal structure definitely helps with conceptualizing the kind of things that count as gradient hacking. With inner optimizer, you can say relatively unambiguously "it tries to protect it's mesa-objective", as there should be an explicit representation of it. That becomes harder without the inner optimization hypothesis.

That being said, I am definitely focusing on gradient hacking as an issue with learned goal-directed systems instead of learned optimizers. This is one case where I have argued that a definition of goal-directedness would allow us to remove the explicit optimization hypothesis without sacrificing the clarity it brought.

  • If we assume the lottery-ticket hypothesis of neural networks, we initialize our network with a large number of possible models of the world. In a sufficiently large networks, some of those models will be accurate models of not necessarily the world, but the training process of the very system that is currently being trained. This is pretty likely given that SGD isn't very complicated and it doesn't seem very hard to build a model of how it works.

Two thoughts about that:

  • Even if some subnetwork basically captures SGD (or the relevant training process), I'm unconvinced that it would be useful in the beginning, and so it might be "written over" by the updates.
  • Related to the previous point, it looks crucial to understand what is needed in addition to a model of SGD in order to gradient hack. Which brings me to your next point.
  • From an evolutionary perspective, we are going to be selecting for networks that get positively rewarded by the gradient descent learning algorithm. Some of the networks that have an accurate model of the training process will stumble upon the strategy of failing hard if SGD would reward any other competing network, creating a small ridge in the reward landscape that results in it itself getting most of the reward (This is currently very metaphorical and I feel fuzzy on whether this conceptualization makes sense). This strategy seems more complicated, so is less likely to randomly exist in a network, but it is very strongly selected for, since at least from an evolutionary perspective it appears like it would give the network a substantive advantage.

I'm confused about what you mean here. If the point is to make the network a local minimal, you probably just have to make it very brittle to any change. I also not sure what you mean by competing networks. I assumed it meant the neighboring models in model space, which are reachable by reasonable gradients. If that's the case, then I think my example is simpler and doesn't need the SGD modelling. If not, then I would appreciate more detailed explanations.

  • By default, luckily, this will create something I might want to call a "benign gradient hacker" that might deteriorate the performance of the system, but not obviously give rise to anything like a full inner optimizer. It seems that this strategy is simple enough that you don't actually need anything close to a consequentialist optimizer to run into it, and instead it seems more synonymous to cancer, in that it's a way to hijack the natural selection mechanism of a system from the inside to get more resources, and like cancer seems more likely to just hurt the performance of the overall system, instead of taking systematic control over it.

Why is that supposed to be a good thing? Sure, inner optimizers with misaligned mesa-objective suck, but so do gradient hackers without inner optimization. Anything that helps ensure that training cannot correct discrepancies and/or errors with regard to the base-objective sounds extremely dangerous to me.

I think gradient hacking should refer to something somewhat broader that also captures situations like the above where you don't have a deceptively aligned mesa-optimizer, but still have dynamics where you select for networks that adversarially use knowledge about the SGD algorithm for competitive advantage. Though it's plausible that Evan intends the term "deceptively aligned mesa-optimizer" to refer to something broader that would also capture the scenario above.

AFAIK, Evan really means inner optimizer in this context, with actual explicit internal search. Personally I agree about including situations where the learned model isn't an optimizer but is still in some sense goal-directed.

Separately, as an elaboration, I have gotten a lot of mileage out of generalizing the idea of gradient hacking in this post, to the more general idea that if you have a very simple training process whose output can often easily be predicted and controlled, you will run into similar problems. It seems valuable to try to generalize the theory proposed here to other training mechanisms and study more which training mechanisms are more easily hacked like this, and which one are not. 

Hum, I hadn't thought of this generalization. Thanks for the idea!

Comment by adamshimi on Group house norms really do seem toxic to many people. · 2021-01-12T00:06:58.472Z · LW · GW

One thing I'm confused about on the subject of rationalist group houses is whether there are specific failure modes compared to just group houses. Like I'm certain I don't want to live in a group house, just because I don't want to have to deal with that many people in the place I live, but the group house being rationalist or not is irrelevant for that.

Comment by adamshimi on The time I got really into poker · 2021-01-11T23:41:00.050Z · LW · GW

That seemed like a pretty wild experience. Have you tried doing poker in an app again, to see if you can recreate this experience at will, or were you too spooked to touch a poker app ever again?

Comment by adamshimi on adamShimi's Shortform · 2021-01-11T16:30:44.286Z · LW · GW

I can't find a reference for how to test if an inferred (or just given) reward function for a system can be used to predict decently enough what the system actually does.

What I can find are references about the usual IRL/preference learning setting, where there is a true reward function, known to us but unknown to the inference system; then the inferred reward function is evaluated by training a policy on it and seeing how much reward (or how much regret) it gets on the true reward function.

But that's a good setting for checking if the reward is good for learning to do the task, not for checking if the reward is good for predicting what this specific system will do.

The best thing I have in mind right know is to find a bunch of different initial conditions, train on the reward from these conditions, and mix all policies together to get a distribution on the action at each state, and compare that with what the system actually does. It seems decent enough, but I would really like to know if someone has done something similar in the literature.

(Agents and Devices points in the right direction, but it's focused on prediction which of the agent mixture or the device mixture is more probable in the posterior, which is a different problem.)

Comment by adamshimi on The Pointers Problem: Clarifications/Variations · 2021-01-11T16:23:47.562Z · LW · GW

Glad that you find the connection interesting. That being said, I'm confused by what you're saying afterwards: why would logical inductors not able to find propositions about worlds/plans which are outside PSPACE? I find no mention of PSPACE in the paper.

Comment by adamshimi on The Case for a Journal of AI Alignment · 2021-01-10T21:16:27.872Z · LW · GW

As I said in my answer to Kaj, the real problem I see is that I don't think we have the necessary perspective to write a useful textbook. Textbooks basically never touch research in the last ten years, or that research must be really easy to interpret and present, which is not the case here.

I'm open to being proven wrong, though.

Comment by adamshimi on The Case for a Journal of AI Alignment · 2021-01-10T21:13:24.251Z · LW · GW

Fair enough. I think my real issue with an AI Alignment textbook is that for me a textbook presents relatively foundational and well established ideas and theories (maybe multiple ones), whereas I feel that AI Alignment is basically only state-of-the-art exploration, and that we have very few things that should actually be put into a textbook right now.

But I could change my mind if you have an example of what should be included in such an AI Alignment textbook.

Comment by adamshimi on The Case for a Journal of AI Alignment · 2021-01-10T21:09:12.798Z · LW · GW

Thanks for the detailed feedback! David already linked the facebook conversation, but it's pretty useful that you summarize it in a comment like this.

I think that your position makes sense, and you do take into account most of my issues and criticisms about the current model. Do you think you could make really specific statements about what needs to change for a journal to be worth it, detailing more your last paragraph maybe.

Also, to provide a first step without the issues that you pointed, I proposed a review mechanism here in the AF in this comment.

Comment by adamshimi on The Case for a Journal of AI Alignment · 2021-01-10T21:05:28.031Z · LW · GW

Thanks. I'm curious about what you think of Ryan's position or Rohin's position?

Comment by adamshimi on The Case for a Journal of AI Alignment · 2021-01-10T21:03:58.107Z · LW · GW

Thanks! I don't plan on making it myself (as mentioned in the post), but I'll try to keep you posted if anything happens in this style.

Comment by adamshimi on The Case for a Journal of AI Alignment · 2021-01-10T21:03:08.619Z · LW · GW

It's a pretty nice idea. I thought about just giving people two weeks, which might be a bit hardcore.

Comment by adamshimi on The Case for a Journal of AI Alignment · 2021-01-10T21:02:08.098Z · LW · GW

Thanks for the feedback. I didn't know about the AGI journal, that's a good point.

If we only try to provide publicly available feedback to safety researchers, including new ones, do you think that this proposal makes sense as a first step?

Comment by adamshimi on The Case for a Journal of AI Alignment · 2021-01-10T20:57:19.410Z · LW · GW

An idea for having more AI Alignment peer review without compromising academic careers or reputation: create a review system in the Alignment Forum. What I had in mind is that people who are okay with doing a review can sign up somewhere. Then someone who posted something and wants a review can use a token (if they have some, which happens in a way I explain below) to ask for one. Then some people (maybe AF admins, maybe some specific administrator of the thing) assign one of the reviewers to the post.

The review has to follow some guidelines, like summarizing the paper, explaining the good parts and the issues, proposing new ideas. Once the review is posted and validated by the people in charge of the system, the reviewer gets a token she can use for asking a review of her own posts.

How do you bootstrap? For long time users of the AF it makes sense to give them some initial tokens maybe. And for newcomers (who really have a lot to win for reviews), I was thinking of asking them to do a nice distillation post for some token.

While not as ambitious as a journal, I think a system like that might solve two problems at once:

  • The lack of public feedback and in-depth peer review in most posts here
  • The lack of feedback at all for newcomers who don't have private gdocs with a lot of researchers on them.

There's probably a way to be even better for the second point, by for example having personal mentorship for something like three tokens.

I also believe that the incentives would be such that people would participate, and not necessarily try to game the system (being limited to the AF which is a small community also helps).

What do you think?

Comment by adamshimi on [AN #132]: Complex and subtly incorrect arguments as an obstacle to debate · 2021-01-10T15:36:30.269Z · LW · GW

As always, thanks for everyone involved in the newsletter!

The Understanding Learned Reward Functions paper looks great, both in terms of studying inner alignment (the version with goal-directed/RL policies instead of mesa-optimizers) and for thinking about goal-directedness.

Comment by adamshimi on Eight claims about multi-agent AGI safety · 2021-01-10T15:24:01.100Z · LW · GW

Thanks for writing this post! I usually focus on single/single scenarios, so it's nice to have a clear split of the multi-agent safety issues.

All claims make sense to me, with 1 being the one I'm less convinced about, and 5 depending on continuous takeoffs (which appear relatively likely to me as of now).

Comment by adamshimi on The Case for a Journal of AI Alignment · 2021-01-10T14:18:15.348Z · LW · GW

My issue with a textbook comes more from the lack of consensus. Like, the fundamentals (what you would put in the first few chapters) for embedded agency are different from those for preference learning, different from those for inner alignment, different from those for agent incentives (to only quote a handful of research directions). IMO, a textbook would either overlook big chunks of the field or look more like an enumeration of approaches than a unified resource.

Comment by adamshimi on The Case for a Journal of AI Alignment · 2021-01-10T13:50:41.831Z · LW · GW

Thanks for your pushback! I'll respond to both of you in this comment.

First, overall, I was convinced during earlier discussions that this is a bad idea - not because of costs, but because the idea lacks real benefits, and itself will not serve the necessary functions. Also see this earlier proposal (with no comments).

Thanks for the link. I'm reading through the facebook thread, and I'll come back here to discuss it after I finish.

There are already outlets that allow robust peer review, and the field is not well served by moving away from the current CS / ML dynamic of arXiv papers and presentations at conferences, which allow for more rapid iteration and collaboration / building on work than traditional journals - which are often a year or more out of date as of when they appear.

The only actual peer review I see for the type of research I'm talking about by researchers knowledgeable in the subject is from private gdocs, as mentioned for example by Rohin here. Although it's better than nothing, it has the issue of being completely invisible for any reader without access to these gdocs. Maybe you could infer the "peer-reviewness" of a post/paper by who is thanked in it, but that seems ridiculously roundabout.

When something is published in the AF, it rarely gets any feedback as deep as a peer-review or the comments in private gdocs. When something is published in a ML conference, I assume that most if not all reviews don't really consider the broader safety and alignment questions, and focus on the short term ML relevance. And there is some research that is not even possible to publish in big ML venues.

As for conference vs journal... I see what you mean, but I don't think it's really a big problem. In small subfields that actively use arXiv, papers are old news when the conference happens, so it's not a problem if they also are when the journal publishes them. I also wonder how faster could we get a journal to run if we actively try to ease the process. I'm thinking for example of not giving two months to reviewers when they all do their review the last week anyway. Lastly, you're not proposing to make a conference, but if you were, I still think a conference would require much more work to organize.

However, if this were done, I would strongly suggest doing it as an arXiv overlay journal, rather than a traditional structure.

I hadn't thought of overaly journals, that's a great idea! It might actually make it feasible without a full-time administrator.

One key drawback you didn't note is that allowing AI safety further insulation from mainstream AI work could further isolate it. It also likely makes it harder for AI-safety researchers to have mainstream academic careers, since narrow journals don't help on most of the academic prestige metrics.

I agree that this is risk, which is still another reason to privilege a journal. At least in Computer Science, the publication process is generally preprint -> conference -> journal. In that way, we can allow the submission of papers previously accepted at NeurIPS for example (maybe extended versions), which should mitigate the cost to academic careers. And if the journal curate enough great papers, it might end up decent enough on academic prestige metrics.

Two more minor disagreement are about first, the claim that  "If JAA existed, it would be a great place to send someone who wanted a general overview of the field." I would disagree - in field journals are rarely as good a source as textbooks or non-technical overview.

Agreed. Yet as I answer to Daniel below, I don't think AI Alignment is mature enough and clear enough on what matters to write a satisfying textbook. Also, the state of the art is basically never in textbooks, and that's the sort of overview I was talking about.

Second, the idea that a journal would provide deeper, more specific, and better review than Alignment forum discussions and current informal discussions seems farfetched given my experience publishing in journals that are specific to a narrow area, like Health security, compared to my experience getting feedback on AI safety ideas.

Hum, if you compare to private discussions and gdocs, I mostly agree that the review would be as good or a little worse (although you might get reviews from researchers to which you wouldn't have sent your research). If it's for the Alignment Forum, I definitely disagree that all the comments that you get here would be as useful as an actual peer-review. The most useful feedback I saw here recently was this review of Alex Turner's paper by John, and that was actually from a peer-review process on LW.

So my point is that a journal with an open peer-review might be a way to make private gdocs discussions accessible while ensuring most people (not only those in contact of other researchers) can get any feedback whatsoever.

Onto Daniel's answer:

+1 to each of these. May I suggest, instead of creating a JAA, we create a textbook? Or maybe a "special compilation" book that simply aggregates stuff? Or maybe even an encyclopedia? It's like a journal, except that it doesn't prevent these things from being published in normal academic journals as well.

As I wrote above, I don't think we're at the point where a textbook is a viable (and even useful) endeavor). For the second point, journals are not really important for careers in computer science (with maybe some exceptions, but all the recruiting processes I know basically only care about the conferences and maybe  about the existence of at least one journal paper). And as long as we actually accept extended versions of papers published at conferences, there should be no problem with doing both.

Comment by adamshimi on 2019 Review Coworking Party [Wed 7pm PT] · 2021-01-07T13:00:27.342Z · LW · GW

I'll start at that time, because I also have stuff before ^^ (Making and eating dinner, a call with my girlfriend). As for the context, today is a distillation day for me anyway, so my afternoon is already focused on that. And I have a list of posts I would like to review.

Also, will the same link work?

Comment by adamshimi on The Pointers Problem: Clarifications/Variations · 2021-01-07T09:57:28.619Z · LW · GW

After thinking a bit more about it, the no-indescribable-hellworlds seems somewhat related to logical uncertainty and the logical induction criterion. Because intuitively, indescribability comes from complexity issues about reasoning, that is the lack of logical omniscience about the consequences of our values. The sort of description we would like is a polynomial proof, or at least a polynomial interactive protocol for verifying the indescribability (which means being in PSPACE, as in the original take on debate). And satisfying the logical induction criterion seems a good way to ensure that such a proof will eventually be found, because otherwise we could be exploited forever on our wrong assessment of the hellworld.

The obvious issue with this approach comes from the asymptotic nature of logical induction guarantees, which might mean it takes so long to convince us/check indescribable hellworlds that they already came to pass.

Comment by adamshimi on 2019 Review Coworking Party [Wed 7pm PT] · 2021-01-07T09:48:25.557Z · LW · GW

I can work for an hour starting 9pm my time (I have a work call at 10pm with another pacific guy).

Comment by adamshimi on Multi-dimensional rewards for AGI interpretability and control · 2021-01-06T21:27:17.581Z · LW · GW

Value comparisons require a scalar: In the brain, we roll out multiple possible thoughts / actions / plans, and then we need to do a comparison to decide which is better. You need a scalar to enable that comparison.

You probably need a total order (or at least a join in your partial order), and things like  and  have a natural one, but you could definitely find a total order from a partial order in basically any space you want your values to live in. Not sure it's really important, but I wanted to point that out.

Comment by adamshimi on 2019 Review Coworking Party [Wed 7pm PT] · 2021-01-06T21:01:49.521Z · LW · GW

I might have come if it wasn't scheduled at 4am my time.

Comment by adamshimi on The Pointers Problem: Clarifications/Variations · 2021-01-06T18:35:25.464Z · LW · GW

Really great post! I think I already got John's idea from his post, but putting everything in perspective and reference previous and adjacent works really help!

On that note, you have been mentioning Stuart's no-indescribable-hellworlds hypothesis for a few posts now, and I'm really interested in it. To take the even more meta-argument for its importance, it looks particularly relevant to asking whether the human epistemic perspective is the right one to use when defining ascription universality (which basically abstracts most of Paul's and Paul-related approaches in term of desiderata for a supervisor).

Do you know if there have been working on poking at this hypothesis, and trying to understand what it implies/requires? I doubt we can ever prove it, but there might be a way to do the "computational complexity approach", where we formally relate it to much more studied and plausible hypotheses.

Comment by adamshimi on Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian · 2021-01-05T11:49:00.209Z · LW · GW

The second Towards Data Science post references this paper, which is also the main reference through which Levin's result is mentioned in the first paper you post. So I assume reading these references should be enough to get the gist.

Comment by adamshimi on Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian · 2021-01-05T11:18:59.575Z · LW · GW

Do you have a good reference for the Levin bound? My attempts at finding a relevant paper all failed.

Comment by adamshimi on Selection vs Control · 2021-01-04T21:06:11.388Z · LW · GW

I'm a bit confused: why can't I just take the initial state of the program (or of the physical system representing the computer) as the initial point in configuration space for your example? The execution of your program is still a trajectory through the configuration space of your computer.

Personally, my biggest issue with optimizing systems is that I don't know what the "smaller" concerning the target space really means. If the target space has only one state less than the total configuration space, is this still an optimizing system? Should we compute a ratio of measure between target and total configuration space to have some sort of optimizing power?

Comment by adamshimi on Selection vs Control · 2021-01-04T09:14:00.288Z · LW · GW

Thanks!

My take on internal optimization as a subset of external optimization probably works assuming convergence, because the configuration space capturing the internal state of the program (and its variables) is pushed reliably towards the configurations with a local minimum in the corresponding variable. See here.

Whether that's actually what we want is another question, but I think the point you're mentioning can be captured by whether the target subspace of the configuration space puts constraints on things outside the system (for good cartesian boundaries and all the corresponding subtleties).

Comment by adamshimi on Selection vs Control · 2021-01-03T17:34:53.572Z · LW · GW

Selection vs Control is a distinction I always point to when discussing optimization. Yet this is not the two takes on optimization I generally use. My favored ones are internal optimization (which is basically search/selection), and external optimization (optimizing systems from Alex Flint’s The ground of optimization). So I do without control, or at least without Abram’s exact definition of control.

Why? Simply because the internal structure vs behavior distinction mentioned in this post seems more important than the actual definitions (which seem constrained by going back to Yudkowski’s optimization power). The big distinction is between doing internal search (like in optimization algorithms or mesa-optimizers) and acting as optimizing something. It is intuitive that you can do the second without the first, but before Alex Flint’s definition, I couldn’t put words on my intuition than the first implies the second.

So my current picture of optimization is Internal Optimization (Internal Search/Selection) \subset External Optimization (Optimizing systems). This means that I think of this post as one of the first instances of grappling at this distinction, without agreeing completely with the way it ends up making that distinction.

Comment by adamshimi on Alignment Research Field Guide · 2021-01-03T17:32:41.995Z · LW · GW

How do you review a post that was not written for you? I’m already doing research in AI Alignment, and I don’t plan on creating a group of collaborators for the moment. Still, I found some parts of this useful.

Maybe that’s how you do it: by taking different profiles, and running through the most useful advice for each profile from the post. Let’s do that.

Full time researcher (no team or MIRIx chapter)

For this profile (which is mine, by the way), the most useful piece of advice from this post comes from the model of transmitters and receivers. I’m convinced that I’ve been using it intuitively for years, but having an explicit model is definitely a plus when trying to debug a specific situation, or to explain how it works to someone less used to thinking like that.

Full time research who wants to build a team/MIRIx chapter

Obviously, this profile benefits from the great advice on building a research group. I would expect someone with this profile to understand relatively well the social dynamics part, so the most useful advice is probably the detailed logistics of getting such a group off the ground.

I also believe that the escalating asks and rewards is a less obvious social dynamic to take into account.

Aspiring researcher (no team or MIRIx chapter)

The section You and your research was probably written with this profile in mind. It tries to push towards exploration instead of exploitation, babble instead of prune. And for so many people that I know who feel obligated to understand everything before toying with a question, this is the prescribed medicine.

I want to push-back just a little about the “follow your curiosity” vibe, as I believe that there are ways to check how promising the current ideas are for AI Alignment. But I definitely understand that the audience is more “wannabe researchers stifled by their internal editor”, so pushing for curiosity and exploration makes sense.

Aspiring researcher who wants to build a team/MIRIx chapter

In addition to the You and your research section, this profile would benefit a lot from the logistics section (don’t forget the food!) and social dynamics about keeping a group running (High standards for membership, Structure and elbow room, and Social norms)

Conclusion

There is something here for every profile interested in AI Alignment Research. That being said, each such profile has different needs, and the article is clearly most relevant for aspiring researchers who want to build a research group.

Comment by adamshimi on Paper-Reading for Gears · 2021-01-03T17:27:52.508Z · LW · GW

This post proposes 4 ideas to help building gears-level models from papers that already passed the standard epistemic check (statistics, incentives):

  • Look for papers which are very specific and technical, to limit the incentives to overemphasize results and present them in a “saving the world” light.
  • Focus on data instead of on interpretations.
  • Read papers on different aspects of the same question/gear
  • Look for mediating variables/gears to explain multiple results at once

(The second section, “Zombie Theories”, sounds more like epistemic check than gears-level modeling to me)

I didn’t read this post before today, so it’s hard to judge the influence it will have on me. Still, I can already say that the first idea (move away from the goal) is one I had never encountered, and by itself it probably helps a lot in literature search and paper reading. The other three ideas are more obvious to me, but I’m glad that they’re stated somewhere in detail. The examples drawn from biology also definitely help.

Comment by adamshimi on [AN #131]: Formalizing the argument of ignored attributes in a utility function · 2021-01-03T12:57:46.856Z · LW · GW

Thanks as always to everyone involved in the newsletter!

The model of the first paper sounds great for studying what happens after we're able to implement corrigibility and impact measures!

You might then reasonably ask what we should be doing instead. I see the goal of AI alignment as figuring out how, given a fuzzy but relatively well-specified task, to build an AI system that is reliably pursuing that task, in the way that we intended it to, but at a capability level beyond that of humans. This does not give you the ability to leave the future in the AI’s hands, but it would defuse the central (to me) argument for AI risk: that an AI system might be adversarially optimizing against you. (Though to be clear, there are still other risks (AN #50) to consider.)

To be more explicit, are the other risks to consider mostly about governance/who gets AGI/regulations? Because it seems that you're focusing on the technical problem of aligment, which is about doing what we want in a rather narrow sense.

On the model that AI risk is caused by utility maximizers pursuing the wrong reward function, I agree that non-obstruction is a useful goal to aim for, and the resulting approaches (mild optimization, low impact, corrigibility as defined here) make sense to pursue. I do not like this model much (AN #44), but that’s (probably?) a minority view.

It's weird, my take on your sequence was more that you want to push alternatives to goal-directedness/utility maximization, because maximizing the wrong utility function (or following the wrong goal) is a big AI-risk. Maybe what you mean in the quote above is that your approach focus on not building goal-directed systems, in which case the non-obstruction problem makes less sense?

Comment by adamshimi on AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy · 2021-01-03T12:41:19.768Z · LW · GW

This is a really cool post. Do you have books/blogs recommendations for digging into non-western philosophies?

On the philosophical tendencies you see, I would like to point some  examples which don't follow these tendencies. But on the whole I agree with your assessment.

  • For symbolic AI, I would say a big part of the Agents Foundations researchers (which includes a lot of MIRI researchers and people like John S. Wentworth) definitely do consider symbolic AI in their work. I won't go as far as saying that they don't care about connectionism, but I definitely don't get a "connectionism or nothing" vibe from them
  • For cognitivist AI, examples of people thinking in terms of internal behaviors are Evan Hubinger from MIRI and Richard Ngo, who worked at DeepMind and is now doing a PhD in Philosophy at Oxford.
  • For reasonableness/sense-making, researchers on Debate (like Beth Barnes, Paul Christiano, Joe Collman) and the people I mentioned in the symbolic AI point seem to also consider more argumentation and logical forms of rationality (in combination with decision theoretic reasoning)

4. Pluralism as respect for the equality and autonomy of persons

This feels like something that a lot of current research focuses on. Most people trying to learn values and preferences focus on the individual preferences of people at a specific point in time, which seems pretty good for respecting the differences in value. The way this wouldn't work would be if the specific formalism (like utility functions over histories) was really biased against some forms of value.

Furthermore, when it comes to human values, then at least in some domains (e.g. what is beautiful, racist, admirable, or just), we ought to identify what's valuable not with the revealed preference or even the reflective judgement of a single individual, but with the outcome of some evaluative social process that takes into account pre-existing standards of valuation, particular features of the entity under evaluation, and potentially competing reasons for applying, not applying, or revising those standards.

As it happens, this anti-individualist approach to valuation isn't particularly prominent in Western philosophical thought (but again, see Anderson). Perhaps then, by looking towards philosophical traditions like Confucianism, we can develop a better sense of how these normative social processes should be modeled.

Do you think this relates to idea like computational social choice? I guess the difference with the latter comes from it taking individual preferences as building blocks, where you seem to want community norms as primitives.

I definitely don't know Confucianism enough for discussing it in this context, but I'm really not convinced by the value of all social norms. For some (like those around language, and morality), the Learning Normativity agenda of Abram feels relevant.

I think this methodology is actually really promising way to deal with the question of ontological shifts. Rather than framing ontological shifts as quasi-exogenous occurrence that agents have to respond to, it frames them as meta-cognitive choices that we select with particular ends in mind.

My first reaction is horror at imagining how this approach could allow an AGI to take a decision with terrible consequences for humans, and then change its concept to justify it to itself. Being more charitable with your proposal, I do think that this can be a good analysis perspective, especially for understanding reward tampering problems. But I want the algorithm/program dealing with ontological crises to keep some tethers to important things I want it aligned to. So in some sense, I want AGIs to be some for of realists according to concepts like corrigibility and impact.

The worry here is that consciousness may have evolved in animals because it serves some function, and so, AI might only reach human-level usefulness if it is conscious. And if it is conscious, it could suffer. Most of us who care about sentient beings besides humans would want to make sure that AI doesn’t suffer — we don’t want to create a race of artificial slaves. So that’s why it might be really important to figure out whether agents can have functional consciousness without suffering.

I'm significantly more worried about AGI creating terrible suffering in humans than about AIs and AGIs themselves suffering. This is probably an issue with my moral circle, but I still stand by that priority. That being said, I'm not for suffering for no reason whatsoever. So finding ways to limit this suffering without compromising alignment seems worthwhile. Thanks for pointing me to this question and this paper.

Comment by adamshimi on Reflections on Larks’ 2020 AI alignment literature review · 2021-01-02T21:04:49.966Z · LW · GW

You gave me food for thoughts. I hadn't thought about your objection about growth (or at least pushing for growth). I think I disagree with the point about strategy research, since I believe that strategy research can help give a bird's eye view of the field that is harder to get when exploring.

Comment by adamshimi on Gradient hacking · 2021-01-02T14:10:39.089Z · LW · GW

Hum, I would say that your logic is probably redundant, and thus might end up being removed for simplicity reasons? Whereas I expect deceptive logic to includes very useful things like knowing how the optimization process works, which would definitely help having better performance.

But to be honest, how can SGD create gradient hacking (if it's even possible) is completely an open research problem.

Comment by adamshimi on Gradient hacking · 2021-01-02T10:06:01.613Z · LW · GW

I agree with your intuition, but I want to point out again that after some initial useless amount, "deceptive logic" is probably a pretty useful thing in general for the model, because it helps improve performance as measured through the base-objective.

SGD making the model more capable seems the most obvious way to satisfy the conditions for deceptive alignement.

Comment by adamshimi on Gradient hacking · 2021-01-01T23:21:44.906Z · LW · GW

Actually, I did meant that SGD might stumble upon gradient hacking. Or to be a bit more realistic, make the model slightly deceptive, at which point decreasing a bit the deceptiveness makes the model worse but increasing it a bit makes the model better at the base-objective, and so there is a push towards deceptiveness, until the model is basically deceptive enough to use gradient hacking in the way you mention.

Does that make more sense to you?