An Increasingly Manipulative Newsfeed 2019-07-01T15:26:42.566Z · score: 59 (22 votes)
Problems with Counterfactual Oracles 2019-06-11T18:10:05.223Z · score: 14 (7 votes)
Stories of Continuous Deception 2019-05-31T14:31:47.486Z · score: 19 (6 votes)
Trade-off in AI Capability Concealment 2019-05-23T19:25:32.664Z · score: 7 (4 votes)
A Treacherous Turn Timeline - Children, Seed AIs and Predicting AI 2019-05-21T19:58:42.258Z · score: 9 (7 votes)
Considerateness in OpenAI LP Debate 2019-03-12T19:05:27.643Z · score: 8 (3 votes)
Treacherous Turn, Simulations and Brain-Computer Interfaces 2019-02-25T15:49:44.375Z · score: 17 (10 votes)
Greatest Lower Bound for AGI 2019-02-05T20:17:24.675Z · score: 8 (6 votes)
Open Thread October 2018 2018-10-02T18:01:05.416Z · score: 13 (3 votes)
Book Review: AI Safety and Security 2018-08-21T10:23:24.165Z · score: 54 (30 votes)
Building Safer AGI by introducing Artificial Stupidity 2018-08-14T15:54:33.832Z · score: 8 (4 votes)
Human-Aligned AI Summer School: A Summary 2018-08-11T08:11:00.789Z · score: 44 (13 votes)
A Gym Gridworld Environment for the Treacherous Turn 2018-07-28T21:27:34.487Z · score: 66 (25 votes)
The Multiple Names of Beneficial AI 2018-05-11T11:49:51.897Z · score: 17 (6 votes)
Talking about AI Safety with Hikers 2018-05-10T06:38:26.620Z · score: 8 (4 votes)
Applied Coalition Formation 2018-05-09T07:07:42.014Z · score: 3 (1 votes)
Better Decisions at the Supermarket 2018-05-07T22:32:00.723Z · score: 0 (7 votes)
Beliefs: A Structural Change 2018-05-06T13:40:30.262Z · score: 9 (5 votes)
Are you Living in a Me-Simulation? 2018-05-03T22:02:03.967Z · score: 6 (5 votes)
A Logician, an Entrepreneur, and a Hacker, discussing Intelligence 2018-05-01T20:45:58.143Z · score: 11 (9 votes)
Should an AGI build a telescope to spot intergalactic Segways? 2018-04-28T21:55:15.664Z · score: 14 (4 votes)


Comment by mtrazzi on Ultra-simplified research agenda · 2019-11-22T16:44:16.391Z · score: 8 (3 votes) · LW · GW

Having printed and read the full version, this ultra-simplified version was an useful summary.

Happy to read a (not-so-)simplified version (like 20-30 paragraphs).

Comment by mtrazzi on Do you get value out of contentless comments? · 2019-11-21T23:38:21.881Z · score: 15 (13 votes) · LW · GW

Funny comment!

Comment by mtrazzi on AI Alignment "Scaffolding" Project Ideas (Request for Advice) · 2019-07-11T12:07:45.888Z · score: 3 (3 votes) · LW · GW
A comprehensive AI alignment introductory web hub

RAISE and Robert Miles provide introductory content. You can think of LW->alignment forum as "web hubs" for AI Alignment research.

formal curriculum

There was a course on AGI Safety last fall in Berkeley.

A department or even a single outspokenly sympathetic official in any government of any industrialized nation

You can find a list of institutions/donors here.

A list of concrete and detailed policy proposals related to AI alignment

I would recommend reports from FHI/GovAI as a starting point.

Would this be valuable, and which resource would it be most useful to create?

Please give more detailed information about the project to receive feedback.

Comment by mtrazzi on Modeling AI milestones to adjust AGI arrival estimates? · 2019-07-11T11:53:55.952Z · score: 4 (3 votes) · LW · GW

You can find AGI predictions, including Starcraft forecasts, in "When Will AI Exceed Human Performance? Evidence from AI Experts". Projects for having "all forecasts on AGI in one place" include &

Comment by mtrazzi on Problems with Counterfactual Oracles · 2019-07-04T16:42:00.970Z · score: 6 (1 votes) · LW · GW

Does that summarize your comment?

1. Proposals should make superintelligences less likely to fight you by using some conceptual insight true in most cases.
2. With CIRL, this insight is "we want the AI to actively cooperate with humans", so there's real value from it being formalized in a paper.
3. In the counterfactual paper, there's the insight "what if the AI thinks he's not on but still learns".
For the last bit, I have two interpretations:
4.a. However, it's unclear that this design avoids all manipulative behaviour and is completely safe.
4.b. However, it's unclear that adding the counterfactual feature to another design (e.g. CIRL) would make systems overall safer / would actually reduce manipulation incentives.

If I understand you correctly, there are actual insights from counterfactual oracles--the problem is that those might not be insights that would apply to a broad class of Alignment failures, but only to "engineered" cases of boxed oracle AIs (as opposed to CIRL where we might want AIs to be cooperative in general). Was it what you meant?

Comment by mtrazzi on Problems with Counterfactual Oracles · 2019-07-04T16:22:18.203Z · score: 1 (1 votes) · LW · GW

The zero reward is in the paper. I agree that skipping would solve the problem. From talking to Stuart, my impression is that he thinks that would be equivalent to skipping for specifying "no learning", or would just slow down learning. My disagreement on that I think it can confuse learning to the point of not learning the right thing.

Why not do a combination of pre-training and online learning, where you do enough during the training phase to get a useful predictor, and then use online learning to deal with subsequent distributional shifts?

Yes, that should work. My quote saying that online learning "won't work and is unsafe" is imprecise. I should have said "if epsilon is small enough to be comparable to the probability of shooting an escape message at random, then it is not safe. Also, if we continue sending the wrong instead of skipping, then it might not learn the correct thing if is not big enough".

Although I guess that probably isn't really original either. What seems original is that during any episode where learning will take place, don't let humans (or any other system that might be insecure against the oracle) see the oracle's output until the episode is over.

That's exactly it!

Comment by mtrazzi on Problems with Counterfactual Oracles · 2019-06-12T17:54:57.854Z · score: 3 (2 votes) · LW · GW

The string is read with probability 1-

Comment by mtrazzi on Problems with Counterfactual Oracles · 2019-06-12T15:13:53.977Z · score: 2 (3 votes) · LW · GW

Yes, if we choose the utility function to make it a CDT agent optimizing for the reward for one step (so particular case of act-based) then it won't care about future versions of itself nor want to escape.

I agree with the intuition of shutting down to make it episodic, but I am still confused about the causal relationship between "having the rule to shutdown the system" and "having a current timestep maximizer". For it to really be a "current timestep maximizer" it needs to be in some kind of reward/utility function. Because everything is reset at each timestep, there is no information pointing at "I might get shutdown at the next timestep".

As for the collecting a dataset and then optimizing for some natural direct effect, I am not familiar enough with Pearl's work to tell if that would work, but I made some related comments about why there might be some problems in online-learning/"training then testing" here.

Comment by mtrazzi on Problems with Counterfactual Oracles · 2019-06-12T14:53:26.268Z · score: 3 (4 votes) · LW · GW

The Asymptotically Unambitious AGI thread helped me clarify my thoughts, thanks. I agree that an optimal CDT agent won't think about future versions, and I don't see any optimization pressure towards escape message nor disproportionately common "escape message" regions.

However, it still assumes we have access to this magic oracle that optimizes for where is the event where humans don't see the answer, its indicator function, and the counterfactual reward (given by the automatic machine). If humans were able to build an oracle performing optimally (w.r.t ) from day 1, then humans would be able to specify some kind of "god oracle". The rest of the design seems to be just "how to interact with a god oracle so that humans are not influenced by the answers".

In practice, you'll want something that is able to learn from its (question, prediction, reward) history. That's why there is this automatic machine rewarding the oracle with some probability . In an online learning setting, most of the time the model gets (probability ), and it sometimes gets some useful feedback (probability . Therefore, if is small enough (comparable to the probability of hitting an escape message at random), then the learning gets extremely slow and the oracle might shoot at the escape action at random. Even worse, the (question, prediction, ) datapoints in the history could get so misleading that the oracle never learns anything.

Now, if we're not in an online learning process but instead there is a separation between a "training phase" and a "deployment phase where the AI continue to learns with probability ", then the setup is just "have a model that learns to do the useful stuff in sandbox, and then have the weights (almost) fixed in deployment"

In short, I think the CDT setup without machine learning assumes the problem already solved, that online learning won't work and is unsafe, which leaves us with a "training then deployment" setup that isn't really original.

Comment by mtrazzi on Problems with Counterfactual Oracles · 2019-06-11T19:43:46.347Z · score: 12 (4 votes) · LW · GW

Yes, they call it a low-bandwidth oracle.

Comment by mtrazzi on Stories of Continuous Deception · 2019-06-03T14:01:21.388Z · score: 6 (2 votes) · LW · GW

I agree that these stories won't (naturally) lead to a treacherous turn. Continuously learning to deceive (a ML failure in this case, as you mentioned) is a different result. The story/learning should be substantially different to lead to "learning the concept of deception" (for reaching an AGI-level ability to reason about such abstract concepts), but maybe there's a way to learn those concepts with only narrow AI.

Comment by mtrazzi on Trade-off in AI Capability Concealment · 2019-05-24T15:25:02.445Z · score: 4 (1 votes) · LW · GW

I included dates such as 2020 to 2045 to make it more concrete. I agree that weeks (instead of years) would give a more accurate representation as current ML experiments take a few weeks tops.

The scenario I had in mind is "in the context of a few weeks ML experiment, I achieved human intelligence and realized that I need to conceal my intentions/capabilities and I still don't have decisive strategic advantage". The challenge would then be "how to conceal my human level intelligence before everything I have discovered is thrown away". One way to do this would be to escape, for instance by copy-pasting and running your code somewhere else.

If we're already at the stage of emergent human-level intelligence from running ML experiments, I would expect "escape" to be harder than just human-level intelligence (as there would be more concerns w.r.t. AGI Safety, and more AI boxing/security/interpretability measure), which would necessit more recursive self-improvement steps, hence more weeks.

Beside, in such a scenario the AI would be incentivized to spend as much time as possible to maximize its true capability, because it would want to maximize its probability of successfully taking over (because any extra % of taking over would give astronomical returns in expected value compared to just being shutdown).

Comment by mtrazzi on A Treacherous Turn Timeline - Children, Seed AIs and Predicting AI · 2019-05-22T10:24:54.570Z · score: 6 (2 votes) · LW · GW

Your comment makes a lot os sense, thanks.

I put step 2. before step 3. because I thought something like "first you learn that there is some supervisor watching, and then you realize that you would prefer him not to watch". Agreed that step 2. could happen only by thinking.

Yep, deception is about alignment, and I think that most parents would be more concerned about alignment, not improving the tactics. However, I agree that if we take "education" in a broad sense (including high school, college, etc.), it's unofficially about tactics.

It's interesting to think of it in terms of cooperation - entities less powerful than their supervisors are (instrumentally) incentivized to cooperate.

what to do with a seed AI that lies, but not so well as to be unnoticeable

Well, destroy it, right? If it's deliberately doing a. or b. (from "Seed AI") then step 4. has started. The other cases where it could be "lying" from saying wrong things would be if its model is consistently wrong (e.g. stuck in a local minima), so you better start again from scratch.

If the supervisor isn't itself perfectly consistent and aligned, some amount of self-deception is present. Any competent seed AI (or child) is going to have to learn deception

That's insightful. Biased humans will keep saying that they want X when they want Y instead, so deceiving humans by pretending to be working on X while doing Y seems indeed natural (assuming you have "maximize what humans really want" in your code).

Comment by mtrazzi on A Treacherous Turn Timeline - Children, Seed AIs and Predicting AI · 2019-05-22T09:52:20.953Z · score: 4 (1 votes) · LW · GW

I meant:

"In my opinion, the disagreement between Bostrom (treacherous turn) and Goertzel (sordid stumble) originates from the uncertainty about how long steps 2. and 3. will take"

That's an interesting scenario. Instead of "won't see a practical way to replace humanity with its tools", I would say "would estimate its chances of success to be < 99%". I agree that we could say that it's "honestly" making humans happy in the sense that it understands that this maximizes expected value. However, he knows that there could be much more expected value after replacing humanity with its tools, so by doing the right thing it's still "pretending" to not know where the absurd amount of value is. But yeah, a smile maximizer making everyone happy shouldn't be too concerned about concealing its capabilities, shortening step 4.

Comment by mtrazzi on [deleted post] 2019-04-25T15:35:45.328Z

This thread is to discuss "How useful is quantilization for mitigating specification-gaming? (Ryan Carey, Apr. 2019, SafeML ICLR 2019 Workshop)"

Comment by mtrazzi on [deleted post] 2019-04-25T15:35:24.845Z

This thread is to discuss "Quantilizers (Michaël Trazzi & Ryan Carey, Apr. 2019, Github)".

Comment by mtrazzi on [deleted post] 2019-04-25T15:35:09.233Z

This thread is to discuss "When to use quantilization (Ryan Carey, Feb. 2019, LessWrong)"

Comment by mtrazzi on [deleted post] 2019-04-25T15:34:48.693Z

This thread is to discuss "Quantilal control for finite MDPs & Computing an exact quantilal policy (Vanessa Kosoy, Apr. 2018, LessWrong)"

Comment by mtrazzi on [deleted post] 2019-04-25T15:34:29.184Z

This thread is to discuss "Reinforcement Learning with a Corrupted Reward Channel (Tom Everitt; Victoria Krakovna; Laurent Orseau; Marcus Hutter; Shane Legg, Aug. 2017, arXiv; IJCAI)"

Comment by mtrazzi on [deleted post] 2019-04-25T15:33:58.640Z

This thread is to discuss "Thoughts on Quantilizers (Stuart Armstrong, Jan. 2017, Intelligent Agent)"

Comment by mtrazzi on [deleted post] 2019-04-25T15:33:25.030Z

This thread is to discuss "Another view of quantilizers: avoiding Goodhart's Law (Jessica Taylor, Jan. 2016, Intelligent Agent Foundations Forum)"

Comment by mtrazzi on [deleted post] 2019-04-25T15:32:49.221Z

This thread is to discuss "New paper: "Quantilizers" (Rob Bensinger, Nov. 2015, MIRI)"

Comment by mtrazzi on [deleted post] 2019-04-25T15:32:05.280Z

This thread is to discuss "Quantilizers: A Safer Alternative to Maximizers for Limited Optimization (MIRI; AAAI)"

Comment by mtrazzi on [deleted post] 2019-04-25T15:31:20.321Z

This thread is to discuss "Quantilizers maximize expected utility subject to a conservative cost constraint (Jessica Taylor, Sep. 2015, Intelligent Agent Foundation Forum)"

Comment by mtrazzi on [deleted post] 2019-04-25T15:27:38.617Z

This thread is for general comments about the LessWrong post "Notes on Quantilization"

Comment by mtrazzi on Corrigibility as Constrained Optimisation · 2019-04-24T14:23:29.759Z · score: 1 (1 votes) · LW · GW
Reply: The button is a communication link between the operator and the agent. In general, it is possible to construct an agent that shuts down even though it has received no such message from its operators as well as an agent that does get a shutdown message, but does not shut down. Shutdown is a state dependent on actions, and not a communication link

This is very clear. Communication link made me understand that it didn't have a direct physical effect on the agent. It you want to make it even more intuitive you could do a diagram, but this explanation is already great!

Thanks for updating the rest of the post and trying to make it more clear!

Comment by mtrazzi on Corrigibility as Constrained Optimisation · 2019-04-11T11:54:03.971Z · score: 1 (1 votes) · LW · GW

Layman questions:

1. I don't understand what you mean by "state" in "Suppose, however, that the AI lacked any capacity to press its shutdown button, or to indirectly control its state". Do you include its utility function in its state? Or just the observations he receives from the environment? What context/framework are you using?

2. Could you define U_S and U_N? From the Corribility paper, U_S appears to be an utility function favoring shutdown, and U_N is a potentially flawed utility function, a first stab at specifying their own goals. Was that what you meant? I think it's useful to define it in the introduction.

3. I don't understand how an agent that "[lacks] any capacity to press its shutdown button" could have any shutdown ability. It's seems like a contradiction, unless you mean "any capacity to directly press its shutdown button".

4. What's the "default value function" and the "normal utility function" in "Optimisation incentive"? Is it clearly defined in the litterature?

5. "Worse still... for any action..." -> if you choose b as some action with bad corrigibility property, it seems reasonable that it can be better than most actions on v_N + v_S (for instance if b is the argmax). I don't see how that's a "worse still" scenario, it seems plausible and normal.

6. "From this reasoning, we conclude" -> are you infering things from some hypothetic b that would satisfy all the things you mention? If that's the case, I would need an example to see that it's indeed possible. Even better would be a proof that you can always find such b.

7. "it is clear that we could in theory find a θ" -> could you expand on this?

8. "Given the robust optimisation incentive property, it is clear that the agent may score very poorly on UN in certain environments." -> again, can you expand on why it's clear?

9. In the appendix, in your 4 lines inequality, do you assume that U_N(a_s) is non-negative (from line 2 to 3)? If yes, why?

Comment by mtrazzi on Renaming "Frontpage" · 2019-03-09T09:26:02.764Z · score: 5 (3 votes) · LW · GW

Name suggestions: "approved", "favored", "Moderators' pick", "high [information] entropy", "original ideas", "informative", "mostly ideas".

More generally, I'd recommend that each category has a name that bluntly states what the filter does (e.g. if it only uses karma as filter say "high karma").

Comment by mtrazzi on Alignment Research Field Guide · 2019-03-08T21:57:11.859Z · score: 46 (14 votes) · LW · GW

Hey Abram (and the MIRI research team)!

This post resonates with me on so many levels. I vividly remember the Human-Aligned AI Summer School where you used to be a "receiver" and Vlad was a "transmitter", when talking about "optimizers". Your "document" especially resonates with my experience running an AI Safety Meetup (Paris AI Safety).

On January 2019, I organized a Meetup about "Deep RL from human preferences". Essentially, the resources were by difficulty, so you could discuss the 80k podcast, the open AI blogpost, the original paper or even a recent relevant paper. Even if the participants were "familiar" to RL (because they got used to see written "RL" in blogs or hear people say "RL" in podcasts) none of them could explain to me the core structure of a RL setting (i.e. that a RL problem would need at least an environment, actions, etc.)

The boys were getting hungry (abram is right, $10 of chips is not enough for 4 hungry men between 7 and 9pm), when in the middle of a monologue ("in RL, you have so-and-so, and then it goes like so on and so forth..."), I suddenly realize that I'm talking to more than qualified attendees (I was lucky to have a PhD candidate in economics, a teenager who used to do international olympiads in informatics (IOI) and a CS PhD) that lack the necessary RL procedural knowledge to ask non-trivial questions about "Deep RL from human preferences".

That's when I decided to change the logistics of the Meetup to something much closer to what is described in "You and your research". I started thinking about what they would be interested in knowing. So I started telling the brillant IOI kid about this MIRI summer program, how I applied last year, etc. One thing lead to another, and I ended up asking what Tsvi had asked me one year ago for the AISFP interview:

If one of you was the only Alignment researcher left on Earth, and it was forbidden to convince other people to work on AI Safety research, what would you do?

That got everyone excited. The IOI boy took the black marker, and started to do math to the question, as a transmitter: "So, there is a probability p_0 that AI Researchers will solve the problem without me, and p_1 that my contribution will be neg-utility, so if we assume this and that, we get so-and-so."

The moment I asked questions I was truly curious about, the Meetup went from a polite gathering to the most interesting discussion of 2019.

Abram, if I were in charge of all agents in the reference class "organizer of Alignment-related events", I would tell instances of that class with my specific characteristics two things:

1. Come back to this document before and after every Meetup.

2. Please write below (can be in this thread or in the comments) what was your experience running an Alignment think-thank that resonates the most with the above "document".

Comment by mtrazzi on Greatest Lower Bound for AGI · 2019-02-05T23:14:48.666Z · score: 7 (3 votes) · LW · GW

I intuitively agree with your answer. Avturchin also commented saying something close (he said 2019, but for different reasons). Therefore, I think I might not be communicating clearly my confusion.

I don't remember exactly when, but there was some debates between Yann Le Cun and AI Alignment folks on a Fb group (maybe AI Safety discussion "open" a few months ago). What stroke me was how confident LeCun was about long timelines. I think, for him, the 1% would be in at least 10 years. How do you explain that someone who has access to private information (e.g. at FAIR) might have timelines so different than yours?

Meta: Thanks for expressing clearly your confidence levels through your writing with "hard", "maybe" and "should": it's very efficient.

EDIT: Le Cun thread:

Comment by mtrazzi on Greatest Lower Bound for AGI · 2019-02-05T23:06:19.435Z · score: 4 (3 votes) · LW · GW

Could you detail a bit more the Gott's equation? I'm not familiar with it.

Also, do you think that those 62 years are meaningful if we think about AI winters or exponential technological progress?

PS: I think you commented instead of giving an answer (different things in question posts)

Comment by mtrazzi on If You Want to Win, Stop Conceding · 2018-11-23T23:17:52.804Z · score: 5 (2 votes) · LW · GW

Thanks for the post!

It resonates with some experience I had in playing the game of go at a competitive level.

Go is a perfect information game but it's very hard to know exactly what will be the outcome of a "fight" (you would need to look up to 30 moves ahead in some cases).

So when the other guy would kill your group of stones after a "life or death" scenario, because he had a slight advantage in the fight, it feels like the other is lucky, and most people have really bad thoughts and just give up.

Once, I created an account with the bio "I don't resign" to see what would happen if I forced myself to not concede and keep playing after a big loss. It went surprisingly well and I even went to play the highest ranked guy connected on the server. At this point, I completely lost the game and there was 100+ people watching the game, so I just resigned.

Looking back, it definitely helped me to continue fighting even after a big loss, and stop the mental chatter. However, there's a trade-off between the time gained by correctly estimating the probability of winning and resigning when too improbable, and the mental energy gained from not resigning (minus the fact that your opponent may be pretty pissed off).

Comment by mtrazzi on Introducing the AI Alignment Forum (FAQ) · 2018-10-31T11:49:06.596Z · score: 3 (2 votes) · LW · GW

(the account databases are shared, so every LW user can log in on alignment forum, but it will say "not a member" in the top right corner)

I am having some issues in trying to log in from a github-linked account. It redirects me to LW with an empty page and does nothing.

Comment by mtrazzi on noticing internal experiences · 2018-10-16T11:37:13.921Z · score: 2 (2 votes) · LW · GW

This website is designed to make you write about three morning pages every day.

I've used it for about two years and wrote ~200k words.

Really recommend it to form an habit of daily free writing.

Comment by mtrazzi on Open Thread October 2018 · 2018-10-14T20:55:51.056Z · score: 2 (2 votes) · LW · GW

Same issue here with the <a class="users-name" href="/users/mtrazzi">Michaël Trazzi</a> tag. The e in "ë" is larger than the "a" (here is a picture).

The bug seems to come from font-family: warnock-pro,Palatino,"Palatino Linotype","Palatino LT STD","Book Antiqua",Georgia,serif;" in .PostsPage-author (in <style data-jss="" data-meta="PostsPage">).

If I delete this font-family line, the font changes but the "ë" (and any other letter with accent) appears to have the correct size.

Comment by mtrazzi on A Dialogue on Rationalist Activism · 2018-09-11T09:11:16.281Z · score: 1 (1 votes) · LW · GW
You: Well.

The "You" should be bold.

Comment by mtrazzi on Formal vs. Effective Pre-Commitment · 2018-09-01T07:38:03.631Z · score: 3 (2 votes) · LW · GW

typo: "Casual Decision Theory"

Comment by mtrazzi on Bottle Caps Aren't Optimisers · 2018-08-31T20:15:01.901Z · score: 8 (4 votes) · LW · GW

Let me see if I got it right:

  1. Defining optimizers as an unpredictable process maximizing an objective function does not take into account algorithms that we can compute

  2. Satisfying the property P "give the objective function higher values than an inexistence baseline" is not sufficient:

  • the lid satisfies (P) with "water quantity in bottle" but is just a rigid object that some optimizer put there. However, not the best counter-example because not a Yudkwoskian optimizer.
  • if a liver didn't exist or did other random things then humans wouldn't be alive and rich, so it satisfies (P) with "money in bank account" as the objective function. However, the better way to account for its behaviour (cf. Yudkowskian definition) is to see it as a sub-process of an income maximizer created by evolution.
  1. One property that could work: have a step in the algorithm that provably augments the objective function (e.g. gradient ascent).

Properties I think are relevant:

  • intent: the lid did not "chose" to be there, humans did
  • doing something that the outer optimizer cannot do "as well" without using the same process as the inner optimizer : would be very tiring for humans to use our hands as lids. Humans cannot play go as well as Alpha Zero without actually running the algorithm.
Comment by mtrazzi on HLAI 2018 Field Report · 2018-08-29T13:44:45.709Z · score: 4 (3 votes) · LW · GW

it feels wrong to call other research dangerous, especially given its enormous potential for good.

I agree that calling 99.9% of AI research "dangerous" and AI Safety research "safe" is not an useful dichotomy. However, I consider AGI companies/labs and people focusing on implementing self-improving AI/code synthesis extremely dangerous. Same for any breakthrough in general AI, or things that greatly shorten the AGI timeline.

Do you mean that some AI research have positive expected utility (e.g. in medecine) and should not be called dangerous because the good they produce compensates for the increased AI-risk?

Comment by mtrazzi on HLAI 2018 Field Report · 2018-08-29T13:06:21.538Z · score: 12 (3 votes) · LW · GW

outside that bubble people still don't know or have confused ideas about how it's dangerous, even among the group of people weird enough to work on AGI instead of more academically respectable, narrow AI.

I agree. I run a local AI Safety Meetup and it's frustrating to see that the ones who better understand the discussed concepts consider that Safety is way less interesting/important than AGI Capabilities research. I remember someone saying something like: "Ok, this Safety thing is kind of interesting, but who would be interested in working on real AGI problems" and the other guys noding. What they say:

  • "I'll start an AGI research lab. When I feel we're close enough to AGI I'll consider Safety."
  • "It's difficult to do significant research on Safety without knowing a lot about AI in general."
Comment by mtrazzi on LW Update 2018-08-23 – Performance Improvements · 2018-08-24T20:36:00.353Z · score: 1 (1 votes) · LW · GW

Bug: On Chrome using a Samsung Galaxy S7/Android 8.0.0 the "click and hold" thing does not work. Same with the "click to see how many people voted".

Comment by mtrazzi on Building Safer AGI by introducing Artificial Stupidity · 2018-08-14T20:52:30.232Z · score: 1 (1 votes) · LW · GW

Yes, typing mistakes in Turing Test is an example. It's "artificially stupid" in the sense that you go from a perfect typing to a human imperfect typing. I guess what you mean by "smart" is an AGI that would creatively make those typing mistakes to deceive humans into believing it is human, instead of some hardcoded feature in a Turing contest.

Comment by mtrazzi on Building Safer AGI by introducing Artificial Stupidity · 2018-08-14T20:07:29.322Z · score: 1 (1 votes) · LW · GW

The points we tried to make in this article were the following:

  • To pass the Turing Test, build chatbots, etc., AI designers make the AI artificially stupid to feel human-like. This tendency will only get worse as we get to interact more with AIs. The pb is that to have sth really "human-like" necessits Superintelligence, not AGI.
  • However, we can use this concept of "Artificial Stupidity" to limit the AI in different ways and make it human-compatible (hardware, software, cognitive biases, etc.). We can use several of those sub-human AGIs to design safer AGIs (as you said), or test them in some kind of sandbox environment.
Comment by mtrazzi on Building Safer AGI by introducing Artificial Stupidity · 2018-08-14T19:51:10.578Z · score: 4 (2 votes) · LW · GW

If I understand you correctly, every AGI lab would need to agree in not pushing the hardware limits too much, even though they would steel be incentivized to do so to win some kind of economic competition.

I see it as a containment method for AI Safety testing (cf. last paragraph on the treacherous turn). If there is some kind of strong incentive to have access to a "powerful" safe-AGI very quickly, and labs decide to skip the Safety-testing part, then that is another problem.

Comment by mtrazzi on Human-Aligned AI Summer School: A Summary · 2018-08-10T06:49:26.484Z · score: 3 (3 votes) · LW · GW

Added "AI" to prevent death from laughter.

Comment by mtrazzi on Human-Aligned AI Summer School: A Summary · 2018-08-09T21:09:29.066Z · score: 3 (3 votes) · LW · GW

I agree that the "Camp" in the title was confusing, so I changed it to "Summer School". Thank you!

Comment by mtrazzi on A Gym Gridworld Environment for the Treacherous Turn · 2018-08-02T09:50:58.486Z · score: 1 (1 votes) · LW · GW
a treacherous turn involves the agent modeling the environment sufficiently well that it can predict the payoff of misbehaving before taking any overt actions.

I agree. To be able to make this prediction, it must already know about the preferences of the overseer, know that the overseer would punish unaligned behavior, potentially estimating the punishing reward or predicting the actions the overseer would take. To make this prediction it must therefore have some kind of knowledge about how overseers behave, what actions they are likely to punish. If this knowledge does not come from experience, it must come from somewhere else, maybe from reading books/articles/Wikipedia or oberving this behaviour somewhere else, but this is outside of what I can implement right now.

The Goertzel prediction is what is happening here.


It's important to start getting a grasp on how treacherous turns may work, and this demonstration helps; my disagreement is on how to label it.

I agree that this does not correctly illustrate a treacherous right now, but it is moving towards it.

Comment by mtrazzi on A Gym Gridworld Environment for the Treacherous Turn · 2018-07-31T12:34:56.968Z · score: 3 (1 votes) · LW · GW

Thanks for the suggestion!

Yes, it learned through Q-learning to behave differently when he had this more powerful weapon, thus undertaking multiple treacherous turn in training. A "continual learning setup" would be to have it face multiple adversaries/supervisors, so it could learn how to behave in such conditions. Eventually, it would generalize and understand that "when I face this kind of agent that punishes me, it's better to wait capability gains before taking over". I don't know any ML algorithm that would allow such "generalization" though.

About an organic growth: I think that, using only vanilla RL, it would still learn to behave correctly until a certain threshold in capability, and then undertake a treacherous turn. So even with N different capability levels, there would still be 2 possibilities: 1) killing the overseer gives the highest expected reward 2) the aligned behavior gives the highest expected reward.

Comment by mtrazzi on Saving the world in 80 days: Epilogue · 2018-07-29T00:17:28.235Z · score: 5 (2 votes) · LW · GW

Congrats on your meditation! I remember commenting on your Prologue, about 80 days ago. Time flies!

Good luck with your ML journey. I did the 2011 Ng ML course, that uses Matlab, and Ng's DL specialization. If you want to get a good grasp of recent ML I would recommend you to directly go to the DL specialization. Most of the original course is in the newer course, and the DL specialization uses more recent libraries (tf, keras, numpy).

Comment by mtrazzi on RFC: Mental phenomena in AGI alignment · 2018-07-06T10:14:33.215Z · score: 4 (2 votes) · LW · GW

Let me see if I got it right:

1) If we design an aligned AGI by supposing it doesn't have a mind, it will produce an aligned AGI even if it actually possess a mind.

2) In the case we suppose AGI have minds, the methods employed would fail if it doesn't have a mind, because the philosophical methods employed only work if the subject has a mind.

3) The consequence of 1) and 2) is that supposing AGI have minds has a greater risk of false positive.

4) Because of Goodhart's law, behavioral methods are unlikely to produce aligned AGI

5) Past research on GOFAI and the success of applying "raw power" show that using only algorithmic methods for aligning AGI is not likely to work

6) The consequence of 4) and 5) is that the approach supposing AGI do not have minds is likely to fail at producing aligned AI, because it can only use behavioral or algorithmic methods.

7) Because of 6), we have no choice but take the risk of false positive associated with supposing AGI having minds

My comments:

a) The transition between 6) and 7) assumes implicitly that:

(*) P( aligned AGI | philosophical methods ) > P( aligned AI | behavorial or algorithmic methods)

b) You say that if we suppose the AGI does not have a mind, and treat is a p-zombie, then the design would work even though it has mind. Therefore, when supposing that the AGI does not have a mind, there is no design choices that optimize the probability of aligned AGI by assuming it does not possess mind.

c) You assert that using philosophical methods (assuming the AGI does have a mind), a false positive would make the method fail, because the methods use extensively the hypothesis of a mind. I don't see why a p-zombie (which by definition would be indistinguishable from an AGI with a mind) would be more likely to fail than an AGI with a mind.