Posts

Is it harder to become a MIRI mathematician in 2019 compared to in 2013? 2019-10-29T03:28:52.949Z · score: 60 (25 votes)
Deliberation as a method to find the "actual preferences" of humans 2019-10-22T09:23:30.700Z · score: 24 (9 votes)
What are the differences between all the iterative/recursive approaches to AI alignment? 2019-09-21T02:09:13.410Z · score: 30 (8 votes)
Inversion of theorems into definitions when generalizing 2019-08-04T17:44:07.044Z · score: 24 (8 votes)
Degree of duplication and coordination in projects that examine computing prices, AI progress, and related topics? 2019-04-23T12:27:18.314Z · score: 28 (10 votes)
Comparison of decision theories (with a focus on logical-counterfactual decision theories) 2019-03-16T21:15:28.768Z · score: 60 (18 votes)
GraphQL tutorial for LessWrong and Effective Altruism Forum 2018-12-08T19:51:59.514Z · score: 52 (11 votes)
Timeline of Future of Humanity Institute 2018-03-18T18:45:58.743Z · score: 17 (8 votes)
Timeline of Machine Intelligence Research Institute 2017-07-15T16:57:16.096Z · score: 5 (5 votes)
LessWrong analytics (February 2009 to January 2017) 2017-04-16T22:45:35.807Z · score: 22 (22 votes)
Wikipedia usage survey results 2016-07-15T00:49:34.596Z · score: 7 (8 votes)

Comment by riceissa on What I’ll be doing at MIRI · 2019-11-20T00:14:59.307Z · score: 2 (2 votes) · LW · GW

[Meta] At the moment, Oliver's comment has 15 karma across 1 vote (and 6 AF karma). If I'm understanding LW's voting system correctly, the only way this could have happened is if Oliver undid his default vote on the comment, and then Eliezer Yudkowsky did a strong-upvote on the comment (see here for a list of users by voting power). But my intuition says this scenario is implausible, so I'm curious what happened instead.

(This isn't important, but I'm curious anyway.)

Comment by riceissa on [AN #62] Are adversarial examples caused by real but imperceptible features? · 2019-10-28T22:54:09.147Z · score: 6 (4 votes) · LW · GW

Based on the October 2019 update, it looks like Ought is now using "factored cognition" as an umbrella term that includes both factored generation (which used to be called factored cognition) and factored evaluation.

(Commenting here because as far as I know this post is one of the main places that discusses this distinction.)

Comment by riceissa on Jacy Reese (born Jacy Anthis)? · 2019-10-26T22:29:56.058Z · score: 5 (3 votes) · LW · GW

It’s interesting this minor fact that, to several people including me, has seemed like an obvious omission, doesn’t meet Wikipedia’s standards for inclusion. But if Wikipedia had less strict standards it would be very hard to keep out false information.

Eliezer Yudkowsky has made similar distinctions when talking about scientific vs legal vs rational evidence (see this wiki page) and science vs probability theory.

I think there is an interesting question of "what ought to count as evidence, if we want to produce the best online encyclopedia we can, given the flawed humans we have to write it?" My own view is that Wikipedia's standards for evidence have become too strict in cases like this.

Comment by riceissa on Deliberation as a method to find the "actual preferences" of humans · 2019-10-26T01:54:29.363Z · score: 1 (1 votes) · LW · GW

Thanks, I think I agree (but want to think about this more). I might edit the post in the future to incorporate this change.

Comment by riceissa on Deliberation as a method to find the "actual preferences" of humans · 2019-10-26T01:51:15.590Z · score: 1 (1 votes) · LW · GW

I agree with this, and didn't mean to imply anything against it in the post.

Comment by riceissa on Two explanations for variation in human abilities · 2019-10-26T01:42:19.001Z · score: 8 (4 votes) · LW · GW

Regarding your footnote, literacy rates depend on the definition of literacy used. Under minimal definitions, "pretty close to 100 percent of the population is capable of reading" is true, but under stricter definitions, "maybe 20 or 30 percent" seems closer to the mark.

https://en.wikipedia.org/wiki/Literacy#United_States

https://en.wikipedia.org/wiki/Functional_illiteracy#Prevalence

https://en.wikipedia.org/wiki/Literacy_in_the_United_States

"Current literacy data are generally collected through population censuses or household surveys in which the respondent or head of the household declares whether they can read and write with understanding a short, simple statement about one's everyday life in any written language. Some surveys require respondents to take a quick test in which they are asked to read a simple passage or write a sentence, yet clearly literacy is a far more complex issue that requires more information." http://uis.unesco.org/en/topic/literacy

I'm not sure why you are so optimistic about people learning calculus.

Comment by riceissa on AI Alignment Open Thread October 2019 · 2019-10-24T00:07:07.354Z · score: 5 (3 votes) · LW · GW

Thanks!

I am more confused about posts than comments. For posts, only my comparison of decision theories post is currently cross-posted to AF, but I actually think my post about deliberation, question about iterative approaches to alignment (along with Rohin's answer), and question about coordination on AI progress projects are more relevant to AF (either because they make new claims or because they encourage others to do so). If I see that a particular post hasn't been cross-posted to AF, I'm wondering if I should be thinking more like "every single moderator has looked at the post, and believes it doesn't belong on AF" or more like "either the moderators are busy, or something about the post title caused them to not look at the post, and it sort of fell through the cracks".

Comment by riceissa on AI Alignment Open Thread October 2019 · 2019-10-23T22:39:02.752Z · score: 1 (1 votes) · LW · GW

[Meta] I'm not a full member on Alignment Forum, but I've had some of my LW content cross-posted to AF. However, this cross-posting seems haphazard, and does not correspond to my intuitive feeling of which of my posts/comments "should" end up on AF. I would like for one of the following to happen:

• More insight into the mechanism that decides what gets cross-posted, so I feel less annoyed at the arbitrary-seeming nature of it.
• More control over what gets cross-posted (if this requires applying for full membership, I would be willing to do that).
• Have all my AF cross-posting be undone so that readers don't get a misleading impression of my AI alignment content. (I would like to avoid people visiting my AF profile, reading content there, and concluding something about my AI alignment output based on that.)
Comment by riceissa on An1lam's Short Form Feed · 2019-10-23T21:40:42.374Z · score: 3 (2 votes) · LW · GW

• I like CheCheDaWaff's comments on r/Anki; see here for a decent place to start. In particular, for proofs, I've shifted toward adding "prove this theorem" cards rather than trying to break the proof into many small pieces. (The latter adheres more to the spaced repetition philosophy, but I found it just doesn't really work.)
• Richard Reitz has a Google doc with a bunch of stuff.
• I like this forum comment (as a data point, and as motivation to try to avoid similar failures).
• I like https://eshapard.github.io
• Master How To Learn also has some insights but most posts are low-quality.

One thing I should mention is that a lot of the above links aren't written well. See this Quora answer for a view I basically agree with.

I couldn’t stop thinking about it

I agree that thinking about this is pretty addicting. :) I think this kind of motivation helps me to find and read a bunch online and to make occasional comments (such as the grandparent) and brain dumps, but I find it's not quite enough to get me to invest the time to write a comprehensive post about everything I've learned.

Comment by riceissa on An1lam's Short Form Feed · 2019-10-23T04:10:21.194Z · score: 5 (2 votes) · LW · GW

I would be surprised if Gwern hasn’t already thought about the claim going to make

I briefly looked at gwern's public database several months ago, and got the impression that he isn't using Anki in the incremental reading/learning way that you (and Michael Nielsen) describe. Instead, he seems to just add a bunch of random facts. This isn't to say gwern hasn't thought about this, but just that if he has, he doesn't seem to be making use of this insight.

In the Platonic graph of this domain’s knowledge ontology, how central is this node?

I feel like the center often shifts as I learn more about a topic (because I develop new interests within it). The questions I ask myself are more like "How embarrassed would I be if someone asked me this and I didn't know the answer?" and "How much does knowing this help me learn more about the topic or related topics?" (These aren't ideal phrasings of the questions my gut is asking.)

knowing that I’ll remember at least the stuff I’ve Anki-ized has a surprisingly strong motivational impact on me on a gut level

In my experience, I often still forget things I've entered into Anki either because the card was poorly made or because I didn't add enough "surrounding cards" to cement the knowledge. So I've shifted away from this to thinking something more like "at least Anki will make it very obvious if I didn't internalize something well, and will give me an opportunity in the future to come back to this topic to understand it better instead of just having it fade without detection".

there’s O(5) actual blog posts about it

I'm confused about what you mean by this. (One guess I have is big-O notation, but big-O notation is not sensitive to constants, so I'm not sure what the 5 is doing, and big-O notation is also about asymptotic behavior of a function and I'm not sure what input you're considering.)

I think there are few well-researched and comprehensive blog posts, but I've found that there is a lot of additional wisdom the spaced repetition community has accumulated, which is mostly written down in random Reddit comments and smaller blog posts. I feel like I've benefited somewhat from reading this wisdom (but have benefited more from just trying a bunch of things myself). For myself, I've considered writing up what I've learned about using Anki, but it hasn't been a priority because (1) other topics seem more important to work on and write about; (2) most newcomers cannot distinguish been good and bad advice, so I anticipate having low impact by writing about Anki; (3) I've only been experimenting informally and personally, and it's difficult to tell how well my lessons generalize to others.

Comment by riceissa on Rationality Exercises Prize of September 2019 (\$1,000) · 2019-10-22T04:29:46.336Z · score: 9 (5 votes) · LW · GW

Were the winners ever announced? If I'm counting correctly, it has now been over four weeks since September 20, so the winners should have been announced around two weeks ago. (I checked for new posts by Ben, this post, and the comments on this post.)

Comment by riceissa on AI Safety "Success Stories" · 2019-10-21T20:28:36.772Z · score: 1 (1 votes) · LW · GW

I think I was imagining that the pivotal tool AI is developed by highly competent and safety-conscious humans who use it to perform a pivotal act (or series of pivotal acts) that effectively precludes the kind of issues mentioned in Wei's quote there.

Even if you make this assumption, it seems like the reliance on human safety does not go down. I think you're thinking about something more like "how likely it is that lack of human safety becomes a problem" rather than "reliance on human safety".

Comment by riceissa on We tend to forget complicated things · 2019-10-21T01:22:46.804Z · score: 7 (5 votes) · LW · GW

I think you are describing overlearning and chunking (once concepts become chunked they "feel easy", and one reliable way to chunk ideas is to overlearn them).

Comment by riceissa on Humans can be assigned any values whatsoever… · 2019-10-19T02:17:02.249Z · score: 1 (1 votes) · LW · GW

I'm curious what you think of my comment here, which suggests that Kolmogorov complexity might be enough after all, as long as we are willing to change our notion of compatibility.

(I'm also curious what you think of Daniel's post, although to a lesser extent.)

Comment by riceissa on AI Safety "Success Stories" · 2019-10-18T01:32:23.485Z · score: 4 (2 votes) · LW · GW

I think pivotal tool story has low reliance on human safety (although I’m confused by that row in general).

From the Task-directed AGI page on Arbital:

The obvious disadvantage of a Task AGI is moral hazard - it may tempt the users in ways that a Sovereign would not. A Sovereign has moral hazard chiefly during the development phase, when the programmers and users are perhaps not yet in a position of special relative power. A Task AGI has ongoing moral hazard as it is used.

(My understanding is that task AGI = genie = Pivotal Tool.)

Wei Dai gives some examples of what could go wrong in this post:

For ex­am­ple, such AIs could give hu­mans so much power so quickly or put them in such novel situ­a­tions that their moral de­vel­op­ment can’t keep up, and their value sys­tems no longer ap­ply or give es­sen­tially ran­dom an­swers. AIs could give us new op­tions that are ir­re­sistible to some parts of our mo­ti­va­tional sys­tems, like more pow­er­ful ver­sions of video game and so­cial me­dia ad­dic­tion. In the course of try­ing to figure out what we most want or like, they could in effect be search­ing for ad­ver­sar­ial ex­am­ples on our value func­tions. At our own re­quest or in a sincere at­tempt to help us, they could gen­er­ate philo­soph­i­cal or moral ar­gu­ments that are wrong but ex­tremely per­sua­sive.

The underlying problem seems to be that when humans are in control over long-term outcomes, we are relying more on the humans to have good judgment, and this becomes increasingly a problem the more task-shaped the AI becomes.

I'm curious what your own thinking is (e.g. how would you fill out that row?).

Comment by riceissa on AI Safety "Success Stories" · 2019-10-18T01:11:28.309Z · score: 2 (2 votes) · LW · GW

Or does it also include a story about how AI is deployed (and by who, etc.)?

The "Controlled access" row seems to imply that at least part of how the AI is deployed is part of each success story (with some other parts left to be filled in later). I agree that having more details for each story would be nice.

Somewhat related to this is that I've found it slightly confusing that each success story is named after the kind of AI that is present in that story. So when one says "Sovereign Singleton", this could mean either the AI itself or the AI together with all the other assumptions (e.g. hard takeoff) for how having that kind of AI leads to a "win".

Comment by riceissa on Occam's Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann · 2019-10-09T21:14:56.949Z · score: 8 (4 votes) · LW · GW

I still think A&M's No Free Lunch theorem goes through, but now I think A&M are proving the wrong theorem. A&M try to find the simplest (planner, reward) decomposition that is compatible with the human policy, but it seems like we instead additionally want compatibility with all the evidence we have observed, including sensory data of humans saying things like "if I was more rational, I would be exercising right now instead of watching TV" and "no really, my reward function is not empty". The important point is that such sensory data gives us information not just about the human policy, but also about the decomposition. Forcing compatibility with this sensory data seems to rule out degenerate pairs. This makes me feel like Occam's Razor would work for inferring preferences up to a certain point (i.e. as long as the situations are all "in-distribution").

If we are trying to find the (planner, reward) decomposition of non-human minds: I think if we were randomly handed a mind from all of mind design space, then A&M's No Free Lunch theorem would apply, because the simplest explanation really is that the mind has a degenerate decomposition. But if we were randomly handed an alien mind from our universe, then we would be able to use all the facts we have learned about our universe, including how the aliens likely evolved, any statements they seem to be making about what they value, and so on.

Does this line of thinking also apply to the case of science? I think not, because we wouldn't be able to use our observations to get information about the decomposition. Unlike the case of values, the natural world isn't making statements like "actually, the laws are empty and all the complexity is in the initial conditions". I still don't think the No Free Lunch theorem works for science either, because of my previous comments.

Comment by riceissa on List of resolved confusions about IDA · 2019-10-09T07:50:35.297Z · score: 5 (3 votes) · LW · GW

Seems odd to have the idealistic goal get to be the standard name, and the dime-a-dozen failure mode be a longer name that is more confusing.

I agree this is confusing.

Is there a reason why the standard terms are not being used to refer to the standard, short-term results?

As far as I know, Paul hasn't explained his choice in detail. One reason he does mention, in this comment, is that in the context of strategy-stealing, preferences like "help me stay in control and be well-informed" do not make sense when interpreted as preferences-as-elicited, since the current user has no way to know if they are in control or well-informed.

In the post Wei contrasts "current" and "actual" preferences. "Stated" vs "reflective" preferences also seem like nice alternatives too.

I think current=elicited=stated, but actual≈reflective (because there is the possibility that undergoing reflection isn't a good way to find out our actual preferences, or as Paul says 'There’s a hy­poth­e­sis that “what I’d say af­ter some par­tic­u­lar ideal­ized pro­cess of re­flec­tion” is a rea­son­able way to cap­ture “ac­tual prefer­ences,” but I think that’s up for de­bate—e.g. it could fail if me-on-re­flec­tion is self­ish and has val­ues op­posed to cur­rent-me, and cer­tainly it could fail for any par­tic­u­lar pro­cess of re­flec­tion and so it might just hap­pen to be the case that there is no pro­cess of re­flec­tion that satis­fies it.')

Comment by riceissa on List of resolved confusions about IDA · 2019-10-09T06:41:11.501Z · score: 1 (1 votes) · LW · GW

I think Paul calls that "preferences-as-elicited", so if we're talking about act-based agents, it would be "short-term preferences-as-elicited" (see this comment).

Comment by riceissa on List of resolved confusions about IDA · 2019-10-09T05:16:31.296Z · score: 4 (3 votes) · LW · GW

My understanding is that Paul never meant to introduce the term "narrow preferences" (i.e. "narrow" is not an adjective that applies to preferences), and the fact that he talked about narrow preferences in the act-based agents post was an accident/something he no longer endorses.

Instead, when Paul says "narrow", he's talking not about preferences but about narrow vs ambitious value learning. This is what Paul means when he says "I've only ever used [the term "narrow"] in the context of value learning, in order to make this particular distinction between two different goals you might have when doing value learning."

See also this comment and the ambitious vs narrow value learning post.

Comment by riceissa on Occam's Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann · 2019-10-09T04:40:47.423Z · score: 1 (1 votes) · LW · GW

Thanks for the explanation, I think I understand this better now.

My response to your second point: I wasn't sure how the sequence prediction approach to induction (like Solomonoff induction) deals with counterfactuals, so I looked it up, and it looks like we can convert the counterfactual question into a sequence prediction question by appending the counterfactual to all the data we have seen so far. So in the nuclear launch codes example, we would feed the sequence predictor with a video of the launch codes being posted to the internet, and then ask it to predict what sequence it expects to see next. (See the top of page 9 of this PDF and also example 5.2.2 in Li and Vitanyi for more details and further examples.) This doesn't require a decomposition into laws and conditions; rather it seems to require that the events E be a function that can take in bits and print out more bits (or a probability distribution over bits). But this doesn't seem like a problem, since in the values case the policy π is also a function. (Maybe my real point is that I don't understand why you are assuming E has to be a sequence of events?) [ETA: actually, maybe E can be just a sequence of events, but if we're talking about complexity, there would be some program that generates E, so I am suggesting we use that program instead of L and C for counterfactual reasoning.]

My response to your first point: I am far from an expert here, but my guess is that an Occam's Razor advocate would bite the bullet and say this is fine, since either (1) the degenerate predictors will have high complexity so will be dominated by simpler predictors, or (2) we are just as likely to be living in a "degenerate" world as we are to be living in the kind of "predictable" world that we think we are living in.

Comment by riceissa on Occam's Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann · 2019-10-07T22:00:42.820Z · score: 5 (4 votes) · LW · GW

I'm not confident I've understood this post, but it seems to me that the difference between the values case and the empirical case is that in the values case, we want to do better than humans at achieving human values (this is the "ambitious" in "ambitious value learning") whereas in the empirical case, we are fine with just predicting what the universe does (we aren't trying to predict the universe even better than the universe itself). In the formalism, in π = P(R) we are after R (rather than π), but in E = L(C) we are after E (rather than L or C), so in the latter case it doesn't matter if we get a degenerate pair (because it will still predict the future events well). Similarly, in the values case, if all we wanted was to imitate humans, then it seems like getting a degenerate pair would be fine (it would act just as human as the "intended" pair).

If we use Occam’s Razor alone to find law-condition pairs that fit all the world’s events, we’ll settle on one of the degenerate ones (or something else entirely) rather than a reasonable one. This could be very dangerous if we are e.g. building an AI to do science for us and answer counterfactual questions like “If we had posted the nuclear launch codes on the Internet, would any nukes have been launched?”

I don't understand how this conclusion follows (unless it's about the malign prior, which seems not relevant here). Could you give more details on why answering counterfactual questions like this would be dangerous?

Comment by riceissa on What do the baby eaters tell us about ethics? · 2019-10-06T23:08:20.827Z · score: 4 (4 votes) · LW · GW

Eliezer has written a sequence on meta-ethics. I wonder if you're aware of it? (If you are, my next question is why you don't consider it an answer to your question.)

Another thought I've had since I read the story is that it seems like a lot of human-human interactions are really human-babyeater interactions.

I think Under-acknowledged Value Differences makes the same point.

Comment by riceissa on LW Team Updates - October 2019 · 2019-10-02T23:12:14.945Z · score: 1 (1 votes) · LW · GW

On LessWrong's GraphiQL, I noticed that hovering over keywords no longer provides documentation help. (See here for what the hover-over used to look like.) Would it be possible to turn this back on?

Comment by riceissa on LW Team Updates - October 2019 · 2019-10-02T23:08:06.249Z · score: 4 (2 votes) · LW · GW

On textboxes for comments/posts, I noticed that the documentation is for the rich text editor regardless of what option I have set in preferences (I have markdown). To be clear, markdown formatting works; it's just that the documentation tells me to e.g. use Cmd-4 for LaTeX rather than dollar signs. I tried switching to rich text and back to markdown, but the problem persists.

Comment by riceissa on Three ways that "Sufficiently optimized agents appear coherent" can be false · 2019-10-02T22:55:34.609Z · score: 1 (1 votes) · LW · GW

I believe there’s reason to think that Eliezer never intended “Sufficiently optimized agents appear coherent” to have an airtight argument and be universally true.

On the Arbital version of the page (but not the GreaterWrong version you linked to) [ETA: I just realized that you did link to the Arbital version, but I was viewing it on GW] one can see that Eliezer assigned 85% probability to the claim (though it's not clear if the uncertainty is more like "I tried to make an airtight universal argument, but it might be wrong" or more like "I tried to show that this will happen in most cases, but there are also cases where I don't think it will happen").

Comment by riceissa on World State is the Wrong Level of Abstraction for Impact · 2019-10-02T21:57:51.323Z · score: 2 (2 votes) · LW · GW

I appreciate this clarification, but when I wrote my comment, I hadn't read the original AUP post or the paper, since I assumed this sequence was supposed to explain AUP starting from scratch (so I didn't have the idea of auxiliary set when I wrote my comment).

Comment by riceissa on World State is the Wrong Level of Abstraction for Impact · 2019-10-02T00:21:33.631Z · score: 5 (3 votes) · LW · GW

It seems like one downside of impact in the AU sense is that in order to figure out whether an action has high impact, the AI needs to have a detailed understanding of human values and the ontology used by humans. (This is in contrast to the state-based measures of impact, where calculating the impact of a state change seems easier.) Without such an understanding, the AI seems to either do nothing (in order to prevent itself from causing bad kinds of high impact) or make a bunch of mistakes. So my feeling is that in order to actually implement an AI that does not cause bad kinds of high impact, we would need to make progress on value learning (but once we've made progress on value learning, it's not clear to me what AU theory adds in terms of increased safety).

Comment by riceissa on What funding sources exist for technical AI safety research? · 2019-10-01T20:26:35.566Z · score: 4 (3 votes) · LW · GW

Potential additions: Paul Christiano, Future of Life Institute, EA Grants.

Comment by riceissa on AI Safety "Success Stories" · 2019-10-01T09:31:02.484Z · score: 3 (2 votes) · LW · GW

Corrigible Contender

A semi-autonomous AGI that does not have long-term preferences of its own but acts according to (its understanding of) the short-term preferences of some human or group of humans

In light of recent discussion, it seems like this part should be clarified to say "actual preferences" or "short-term preferences-on-reflection".

Also in the table, for Corrigible Contender should the reliance on human safety be changed from "High" to "Medium"? (My feeling is that since the AI isn't relying on the current humans' elicited preferences, the reliance on human safety would be somewhere between that of Sovereign Singleton and Pivotal Tool.)

(I'm making these suggestions mainly because I expect people will continue to refer to this post in the future.)

Comment by riceissa on List of resolved confusions about IDA · 2019-10-01T06:23:36.348Z · score: 5 (3 votes) · LW · GW

I still feel confused about "distill ≈ RL". In RL+Imitation (which I assume is also talking about distillation, and which was written after Semi-supervised reinforcement learning), Paul says things like "In the same way that we can reason about AI control by taking as given a powerful RL system or powerful generative modeling, we could take as given a powerful solution to RL+imitation. I think that this is probably a better assumption to work with" and "Going forward, I’ll preferentially design AI control schemes using imitation+RL rather than imitation, episodic RL, or some other assumption".

Was there a later place where Paul went back to just RL? Or is RL+Imitation about something other than distillation? Or is the imitation part such a small contribution that writing "distill ≈ RL" is still accurate?

ETA: From the FAQ for Paul's agenda:

1.2.2: OK, so given this am­plified al­igned agent, how do you get the dis­til­led agent?

Train a new agent via some com­bi­na­tion of imi­ta­tion learn­ing (pre­dict­ing the ac­tions of the am­plified al­igned agent), semi-su­per­vised re­in­force­ment learn­ing (where the am­plified al­igned agent helps spec­ify the re­ward), and tech­niques for op­ti­miz­ing ro­bust­ness (e.g. cre­at­ing red teams that gen­er­ate sce­nar­ios that in­cen­tivize sub­ver­sion).

and:

The imi­ta­tion learn­ing is more about get­ting this new agent off the ground than about en­sur­ing al­ign­ment. The bulk of the al­ign­ment guaran­tee comes from the semi-su­per­vised re­in­force­ment learn­ing, where we train it to work on a wide range of tasks and an­swer ques­tions about its cog­ni­tion.

Comment by riceissa on List of resolved confusions about IDA · 2019-09-30T22:49:23.204Z · score: 9 (5 votes) · LW · GW

I used to think that after the initial distillation step, the AI would be basically human-level. Now I understand that after the initial distillation step, the AI will be superhuman in some respects and subhuman in others, but wouldn't be "basically human" in any sense. Source

Comment by riceissa on Utility ≠ Reward · 2019-09-29T23:22:54.456Z · score: 5 (3 votes) · LW · GW

To me, it seems like the two distinctions are different. There seem to be three levels to distinguish:

1. The reward (in the reinforcement learning sense) or the base objective (example: inclusive genetic fitness for humans)
2. A mechanism in the brain that dispenses pleasure or provides a proxy for the reward (example: pleasure in humans)
3. The actual goal/utility that the agent ends up pursuing (example: a reflective equilibrium for some human's values, which might have nothing to do with pleasure or inclusive genetic fitness)

The base objective vs mesa-objective distinction seems to be about (1) vs a combination of (2) and (3). The reward maximizer vs utility maximizer distinction seems to be about (2) vs (3), or maybe (1) vs (3).

Depending on the agent that is considered, only some of these levels may be present:

• A "dumb" RL-trained agent that engages in reward gaming. Only level (1), and there is no mesa-optimizer.
• A "dumb" RL-trained agent that engages in reward tampering. Only level (1), and there is no mesa-optimizer.
• A paperclip maximizer built from scratch. Only level (3), and there is no mesa-optimizer.
• A relatively "dumb" mesa-optimizer trained using RL might have just (1) (the base objective) and (2) (the mesa-objective). This kind of agent would be incentivized to tamper with its pleasure circuitry (in the sense of (2)), but wouldn't be incentivized to tamper with its RL-reward circuitry. (Example: rats wirehead to give themselves MAX_PLEASURE, but don't self-modify to delude themselves into thinking they have left many descendants.)
• If the training procedure somehow coughs up a mesa-optimizer that doesn't have a "pleasure center" in its brain (I don't know how this would happen, but it seems logically possible), there would just be (1) (the base objective) and (3) (the mesa-objective). This kind of agent wouldn't try to tamper with its utility function (in the sense of (3)), nor would it try to tamper with its RL-reward/base-objective to delude itself into thinking it has high rewards.

ETA: Here is a table that shows these distinctions varying independently:

Utility maximizer Reward maximizer
Optimizes for base objective (i.e. mesa-optimizer absent) Paperclip maximizer "Dumb" RL-trained agent
Optimizes for mesa-objective (i.e. mesa-optimizer present) Human in reflective equilibrium Rats
Comment by riceissa on The strategy-stealing assumption · 2019-09-28T00:38:34.220Z · score: 22 (5 votes) · LW · GW

Like Wei Dai, I am also finding this discussion pretty confusing. To summarize my state of confusion, I came up with the following list of ways in which preferences can be short or long:

1. time horizon and time discounting: how far in the future is the preference about? More generally, how much weight do we place on the present vs the future?
2. act-based ("short") vs goal-based ("long"): using the human's (or more generally, the human-plus-AI-assistants'; see (6) below) estimate of the value of the next action (act-based) or doing more open-ended optimization of the future based on some goal, e.g. using a utility function (goal-based)
3. amount of reflection the human has undergone: "short" would be the current human (I think this is what you call "preferences-as-elicited"), and this would get "longer" as we give the human more time to think, with something like CEV/Long Reflection/Great Deliberation being the "longest" in this sense (I think this is what you call "preference-on-idealized-reflection"). This sense further breaks down into whether the human itself is actually doing the reflection, or if the AI is instead predicting what the human would think after reflection.
4. how far the search happens: "short" would be a limited search (that lacks insight/doesn't see interesting consequences) and "long" would be a search that has insight/sees interesting consequences. This is a distinction you made in a discussion with Eliezer a while back. This distinction also isn't strictly about preferences, but rather about how one would achieve those preferences.
5. de dicto ("short") vs de re ("long"): This is a distinction you made in this post. I think this is the same distinction as (2) or (3), but I'm not sure which. (But if my interpretation of you below is correct, I guess this must be the same as (2) or else a completely different distinction.)
6. understandable ("short") vs evaluable ("long"): A course of action is understandable if the human (without any AI assistants) can understand the rationale behind it; a course of action is evaluable if there is some procedure the human can implement to evaluate the rationale using AI assistants. I guess there is also a "not even evaluable" option here that is even "longer". (Thanks to Wei Dai for bringing up this distinction, although I may have misunderstood the actual distinction.)

My interpretation is that when you say "short-term preferences-on-reflection", you mean short in sense (1), except when the AI needs to gather resources, in which case either the human or the AI will need to do more long-term planning; short in sense (2); long in sense (3), with the AI predicting what the human would think after reflection; long in sense (4); short in sense (5); long in sense (6). Does this sound right to you? If not, I think it would help me a lot if you could "fill in the list" with which of short or long you choose for each point.

Assuming my interpretation is correct, my confusion is that you say we shouldn't expect a situation where "the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy" (I take you to be talking about sense (3) from above). It seems like the user-on-reflection and the current user would disagree about many things (that is the whole point of reflection), so if the AI acts in accordance with the intentions of the user-on-reflection, the current user is likely to end up unhappy.

Comment by riceissa on Two clarifications about "Strategic Background" · 2019-09-25T00:25:12.453Z · score: 4 (3 votes) · LW · GW

That post says "We plan to say more in the future about the criteria for strategically adequate projects in 7a" and also "A number of the points above require further explanation and motivation, and we’ll be providing more details on our view of the strategic landscape in the near future". As far as I can tell, MIRI hasn't published any further explanation of this strategic plan (I expected there to be something in the 2018 update but that post talks about other things). Is MIRI still planning to say more about its strategic plan in the near future, and if so, is there a concrete timeframe (e.g. "in a few months", "in a year", "in two years") for publishing such an explanation?

Comment by riceissa on What are the differences between all the iterative/recursive approaches to AI alignment? · 2019-09-24T09:27:30.158Z · score: 5 (3 votes) · LW · GW

Thanks. It looks like all the realistic examples I had of weak HCH are actually examples of strong HCH after all, so I'm looking for some examples of weak HCH to help my understanding. I can see how weak HCH would compute the answer to a "naturally linear recursive" problem (like computing factorials) but how would weak HCH answer a question like "Should I get laser eye surgery?" (to take an example from here). The natural way to decompose a problem like this seems to use branching.

Also, I just looked again at Alex Zhu's FAQ for Paul's agenda, and Alex's explanation of weak HCH (in section 2.2.1) seems to imply that it is doing tree recursion (e.g. "I sometimes picture this as an infinite tree of humans-in-boxes, who can break down questions and pass them to other humans-in-boxes"). It seems like either you or Alex must be mistaken here, but I have no idea which.

Comment by riceissa on What are the differences between all the iterative/recursive approaches to AI alignment? · 2019-09-23T11:25:22.471Z · score: 4 (3 votes) · LW · GW

Thanks! I found this answer really useful.

I have some follow-up questions that I'm hoping you can answer:

1. I didn't realize that weak HCH uses linear recursion. On the original HCH post (which is talking about weak HCH), Paul talks in comments about "branching factor", and Vaniver says things like "So he asks HCH to separately solve A, B, and C". Are Paul/Vaniver talking about strong HCH here, or am I wrong to think that branching implies tree recursion? If Paul/Vaniver are talking about weak HCH, and branching does imply tree recursion, then it seems like weak HCH must be using tree recursion rather than linear recursion.
2. Your answer didn't confirm or deny whether the agents in HCH are human-level or superhuman. I'm guessing it's the former, in which case I'm confused about how IDA and recursive reward modeling are approximating strong HCH, since in these approaches the agents are eventually superhuman (so they could solve some problems in ways that HCH can't, or solve problems that HCH can't solve at all).
3. You write that meta-execution is "more a component of other approaches", but Paul says "Meta-execution is annotated functional programming + strong HCH + a level of indirection", which makes it sound like meta-execution is a specific implementation rather than a component that plugs into other approaches (whereas annotated functional programming does seem like a component that can plug into other approaches). Were you talking about annotated functional programming here? If not, how is meta-execution used in other approaches?
4. I'm confused that you say IDA is task-based rather than reward-based. My understanding was that IDA can be task-based or reward-based depending on the learning method used during the distillation process. This discussion thread seems to imply that recursive reward modeling is an instance of IDA. Am I missing something, or were you restricting attention to a specific kind of IDA (like imitation-based IDA)?
Comment by riceissa on Why Subagents? · 2019-09-17T23:04:01.388Z · score: 12 (3 votes) · LW · GW

When I initially read this post, I got the impression that "subagents = path-dependent/incomplete DAG". After working through more examples, it seems like all the work is being done by "committee requiring unanimous agreement" rather than by the "subagents" part.

Here are the examples I thought about:

1. Same as the mushroom/pepperoni situation, with the same two agents, but now each side can retaliate/hijack the rest of the mind if it doesn't get what it wants. For example, if it starts at pepperoni, the mushroom-preferring agent will hijack the rest of the mind to remove the pepperoni, ending up at cheese. But if the agent starts at the "both" node, it will stay there (because both agents are satisfied). The preference relation can be represented as with an extra arrow from . This is still a DAG, and it's still incomplete (in the sense that we can't compare pepperoni vs mushroom) but it's no longer path-dependent, because no matter where we start, we end up at cheese or "both" (I am assuming that toppings-removal can always be done, whereas acquiring new toppings can't).
2. Same as the previous example, except now only the mushroom-preferring agent can retaliate/hijack (because the pepperoni-preferring agent is weak or nice). Now the preferences are . This is still a DAG, but now the preferences are total, so we can also view it as a (somewhat weird) single agent. A realistic example of this is given by Andrew Critch, where pepperoni=work, cheese=burnout (i.e. neither work nor friendship), mushroom=friendship, and both=friendship-and-work.
3. A modified version of the Zyzzx Prime planet by Scott Alexander. Now whenever we start out at pepperoni, the pepperoni-preferring agent becomes stupid/weak, and loses dominance, so now there are edges from pepperoni to mushroom and "both". (And similarly, mushroom points to both pepperoni and "both".) Now we no longer have a DAG because of the cycle between pepperoni and mushroom.

It seems like when people talk about the human mind being composed of subagents, the deliberation process is not necessarily "committee requiring unanimous agreement", so the resulting preference relations cannot necessarily be represented using path-dependent DAGs.

It also seems like the general framework of viewing systems as subagents (i.e. not restricting to "committee requiring unanimous agreement") is broad enough that it can basically represent any kind of directed graph. On one hand, this is suspicious (if everything can be viewed as a bunch of subagents, then maybe the subagents framework isn't adding anything after all). On the other hand, this suggests that claims of subagents are not really about the resulting behavior/preference ordering of the system, but rather about the internal dynamics of the system.

Comment by riceissa on Conversation with Paul Christiano · 2019-09-12T03:24:50.506Z · score: 5 (3 votes) · LW · GW

like, my views aren’t that internally coherent. My suspicion is others’ views are even less internally coherent.

I would appreciate hearing more concretely what it means to be internally coherent/incoherent (e.g. are there examples of contradictory statements a single person is making?).

Comment by riceissa on Conversation with Paul Christiano · 2019-09-12T03:19:01.941Z · score: 6 (3 votes) · LW · GW

It’s lots of saving throws, you know? And you multiply the saving throws together and things look better. And they interact better than that because– well, in one way worse because it’s correlated: If you’re incompetent, you’re more likely to fail to solve the problem and more likely to fail to coordinate not to destroy the world. In some other sense, it’s better than interacting multiplicatively because weakness in one area compensates for strength in the other. I think there are a bunch of saving throws that could independently make things good, but then in reality you have to have a little bit here and a little bit here and a little bit here, if that makes sense.

I don't understand this part. Translating to math, I think it's saying something like, if is the probability that saving throw works, then the probability that at least one of them works is (assuming the saving throws are independent), which is higher the more saving throws there are; but due to correlation, the saving throws are not independent, so we effectively have fewer saving throws. I don't understand what "weakness in one area compensates for strength in the other" or "a little bit here and a little bit here and a little bit here" mean.

Comment by riceissa on Conversation with Paul Christiano · 2019-09-12T02:26:44.833Z · score: 8 (5 votes) · LW · GW

Christiano cares more about making aligned AIs that are competitive with unaligned AIs, whereas MIRI is more willing to settle for an AI with very narrow capabilities.

Looking at the transcript, it seems like "AI with very narrow capabilities" is referring to the "copy-paste a strawberry" example. It seems to me that the point of the strawberry example (see Eliezer's posts 1, 2, and Dario Amodei's comment here) is that by creating an AGI that can copy and paste a strawberry, we necessarily solve most of the alignment problem. So it isn't the case that MIRI is aiming for an AI with very narrow capabilities (even task AGI is supposed to perform pivotal acts).

Comment by riceissa on Counterfactual Oracles = online supervised learning with random selection of training episodes · 2019-09-11T04:06:18.488Z · score: 1 (1 votes) · LW · GW

Thanks! I think I understand this now.

I will say some things that occurred to me while thinking more about this, and hope that someone will correct me if I get something wrong.

• "Human imitation" is sometimes used to refer to the outward behavior of the system (e.g. "imitation learning", and in posts like "Just Imitate Humans?"), and sometimes to refer to the model of the human inside the system (e.g. here when you say "the human imitation is telling the system what to do").
• A system that is more capable than a human can still be a "human imitation", because "human imitation" is being used in the sense of "modeling humans inside the system" instead of "has the outward behavior of a human".
• There is a distinction between the counterfactual training procedure vs the resulting system. "Counterfactual oracle" (singular) seems to be used to refer to the resulting system, and Paul calls this "the system" in his "Human-in-the-counterfactual-loop" post. "Counterfactual oracles" (plural) is used both as a plural version of the resulting system and also as a label for the general training procedure. "Human-in-the-counterfactual-loop", "counterfactual human oversight", and "counterfactual oversight" all refer to the training procedure (but only when the procedure uses a model of the human).
Comment by riceissa on Counterfactual Oracles = online supervised learning with random selection of training episodes · 2019-09-11T01:50:13.719Z · score: 1 (1 votes) · LW · GW

Paul Christiano does have a blog post titled Counterfactual oversight vs. training data, which talks about the same thing as this post except that he uses the term "counterfactual oversight", which is just Counterfactual Oracles applied to human imitation (which he proposes to use to "oversee" some larger AI system).

I am having trouble parsing/understanding this part.

• The linked post by Paul doesn't seem to talk about human imitation. Is there a separate post/comment somewhere that connects counterfactual oversight to human imitation, or is the connection to human imitation somehow implicit in the linked post?
• The linked post by Paul seems to be talking about counterfactual oversight as a way to train the counterfactual oracle, but I'm parsing your sentence as saying that there is then a further step where the counterfactual oracle is used to oversee a larger AI system (i.e. the "oversee" in your sentence is different from the "oversight" in "counterfactual oversight"). Is this right?
Comment by riceissa on Iterated Distillation and Amplification · 2019-09-11T01:20:01.080Z · score: 1 (1 votes) · LW · GW

I noticed that I have two distinct "mental pictures" for what the overseer is, depending on how the Distill procedure works (i.e. depending on the narrow technique used in the Distill procedure).

1. For imitation learning and narrow inverse reinforcement learning: a "passive" overseer that just gets used as a template/target for imitation.
2. For narrow reinforcement learning and in discussions about approval-directed agents: an "active" overseer that rates actions or provides rewards.

I wonder if this way of thinking about the overseer is okay/correct, or if I'm missing something (e.g. maybe even in case (1), the overseer has a more active role than I can make out). Assuming this way of thinking about the overseer is okay, it seems like for case (1), the term "overseer" has connotations that extend beyond the role played by the overseer (i.e. it doesn't really provide any oversight since it is passive).

Comment by riceissa on Iterated Distillation and Amplification · 2019-08-30T05:47:18.433Z · score: 4 (2 votes) · LW · GW

Based on discussion between Vladimir Slepnev and Paul in this thread, it seems like statements in this post ("we assume that A[0] can acquire nearly human-level capabilities through this process", "Given an aligned agent H we can use narrow safe learning techniques to train a much faster agent A which behaves as H would have behaved") that the first stage of IDA will produce nearly-human-level assistants are misleading. In the same thread, Paul says that he "will probably correct it", but as far as I can tell, neither the Medium post nor the version of the post in this sequence (which was published after the discussion) has been corrected.

Comment by riceissa on Iterated Distillation and Amplification · 2019-08-30T05:32:47.856Z · score: 2 (2 votes) · LW · GW

I had this same thought, but my understanding (which is not solid) is that in the first iteration, since A is random, H can just ignore A and go with its own output (if my assistants are unhelpful, I can just try to perform the task all on my own). So Amplify(H, A) becomes H, which means A <- Distill(Amplify(H, A)) is basically A <- Distill(H), exactly as you suggested.

Comment by riceissa on Paul's research agenda FAQ · 2019-08-30T05:28:42.345Z · score: 3 (3 votes) · LW · GW

I'm still confused about the difference between HCH and the amplification step of IDA. Initially I thought that the difference is that with HCH, the assistants are other copies of the human, whereas in IDA the assistants are the distilled agents from the previous step (whose capabilities will be sub-human in early stages of IDA and super-human in later stages). However, this FAQ says "HCHs should not be visualized as having humans in the box."

My next guess is that while HCH allows the recursion for spawning new assistants to be arbitrarily deep, the amplification step of IDA only allows a single level of spawning (i.e. the human can spawn new assistants, but the assistants themselves cannot make new assistants). Ajeya Cotra's post on IDA talks about the human making calls to the assistant, but not about the assistants making further calls to other assistants, so it seems plausible to me that the recursive nature of HCH is the difference. Can someone who understands HCH/IDA confirm that this is a difference and/or name other differences?

Comment by riceissa on GreaterWrong Arbital Viewer · 2019-08-19T22:22:53.400Z · score: 5 (3 votes) · LW · GW

The page https://arbital.greaterwrong.com/p/AI_safety_mindset/ is blank in the GreaterWrong version, but has content in the obormot.net version.

Comment by riceissa on AALWA: Ask any LessWronger anything · 2019-07-28T21:06:27.577Z · score: 12 (7 votes) · LW · GW

I was surprised to see, both on your website and the white paper, that you are part of Mercatoria/ICTP (although your level of involvement isn't clear based on public information). My surprise is mainly because you have a couple of comments on LessWrong that discuss why you have declined to join MIRI as a research associate. You have also (to my knowledge) never joined any other rationality-community or effective altruism-related organization in any capacity.

My questions are:

1. What are the reasons you decided to join or sign on as a co-author for Mercatoria/ICTP?
2. More generally, how do you decide which organizations to associate with? Have you considered joining other organizations, starting your own organization, or recruiting contract workers/volunteers to work on things you consider important?
Comment by riceissa on What's the most "stuck" you've been with an argument, that eventually got resolved? · 2019-07-01T05:26:21.804Z · score: 7 (4 votes) · LW · GW

"Sam Harris and the Is–Ought Gap" might be one example.