Posts

A reductio ad absurdum for naive Functional/Computational Theory-of-Mind (FCToM). 2020-01-02T17:16:35.566Z · score: 4 (5 votes)
A list of good heuristics that the case for AI x-risk fails 2019-12-02T19:26:28.870Z · score: 23 (17 votes)
What I talk about when I talk about AI x-risk: 3 core claims I want machine learning researchers to address. 2019-12-02T18:20:47.530Z · score: 28 (16 votes)
A fun calibration game: "0-hit Google phrases" 2019-11-21T01:13:10.667Z · score: 7 (4 votes)
Can indifference methods redeem person-affecting views? 2019-11-12T04:23:10.011Z · score: 11 (4 votes)
What are the reasons to *not* consider reducing AI-Xrisk the highest priority cause? 2019-08-20T21:45:12.118Z · score: 30 (8 votes)
Project Proposal: Considerations for trading off capabilities and safety impacts of AI research 2019-08-06T22:22:20.928Z · score: 34 (17 votes)
False assumptions and leaky abstractions in machine learning and AI safety 2019-06-28T04:54:47.119Z · score: 23 (6 votes)
Let's talk about "Convergent Rationality" 2019-06-12T21:53:35.356Z · score: 28 (9 votes)
X-risks are a tragedies of the commons 2019-02-07T02:48:25.825Z · score: 9 (5 votes)
My use of the phrase "Super-Human Feedback" 2019-02-06T19:11:11.734Z · score: 13 (8 votes)
Thoughts on Ben Garfinkel's "How sure are we about this AI stuff?" 2019-02-06T19:09:20.809Z · score: 25 (12 votes)
The role of epistemic vs. aleatory uncertainty in quantifying AI-Xrisk 2019-01-31T06:13:35.321Z · score: 14 (8 votes)
Imitation learning considered unsafe? 2019-01-06T15:48:36.078Z · score: 12 (6 votes)
Conceptual Analysis for AI Alignment 2018-12-30T00:46:38.014Z · score: 26 (9 votes)
Disambiguating "alignment" and related notions 2018-06-05T15:35:15.091Z · score: 43 (13 votes)
Problems with learning values from observation 2016-09-21T00:40:49.102Z · score: 0 (7 votes)
Risks from Approximate Value Learning 2016-08-27T19:34:06.178Z · score: 1 (4 votes)
Inefficient Games 2016-08-23T17:47:02.882Z · score: 14 (15 votes)
Should we enable public binding precommitments? 2016-07-31T19:47:05.588Z · score: 0 (1 votes)
A Basic Problem of Ethics: Panpsychism? 2015-01-27T06:27:20.028Z · score: -4 (11 votes)
A Somewhat Vague Proposal for Grounding Ethics in Physics 2015-01-27T05:45:52.991Z · score: -3 (16 votes)

Comments

Comment by capybaralet on What I'm doing to fight Coronavirus · 2020-03-11T07:59:17.910Z · score: 2 (2 votes) · LW · GW

Sold out on the website. Any ideas where else to get one?

Otherwise I guess masks are a decent substitute (they don't need to be P95 for this purpose...)

Comment by capybaralet on Towards a mechanistic understanding of corrigibility · 2020-03-02T21:17:04.432Z · score: 7 (2 votes) · LW · GW

OK, thanks.

The TL;DR seems to be: "We only need a lower bound on the catastrophe/reasonable impact ratio, and an idea about how much utility is available for reasonable plans."

This seems good... can you confirm my understanding below is correct?

2) RE: "How much utility is available": I guess we can just set a targeted level of utility gain, and it won't matter if there are plans we'd consider reasonable that would exceed that level? (e.g. "I'd be happy if we can make 50% more paperclips at the same cost in the next year.")

1) RE: "A lower bound": this seems good because we don't need to know how extreme catastrophes could be, we can just say: "If (e.g.) the earth or the human species ceased to exist as we know it within the year, that would be catastrophic".

Comment by capybaralet on Defining Myopia · 2020-03-01T22:20:24.469Z · score: 1 (1 votes) · LW · GW

I agree. While interesting, the contents and title of this post seem pretty mismatched.

Comment by capybaralet on Towards a mechanistic understanding of corrigibility · 2020-03-01T21:14:34.405Z · score: 1 (1 votes) · LW · GW

I generally don't read links when there's no context provided, and think it's almost always worth it (from a cooperative perspective) to provide a bit of context.

Can you give me a TL;DR of why this is relevant or what your point is in posting this link?

Comment by capybaralet on One Way to Think About ML Transparency · 2020-03-01T20:48:00.167Z · score: 1 (1 votes) · LW · GW

If you have access to the training data, then DNNs are basically theory simulatable, since you can just describe the training algorithm and the initialization scheme. The use of random initialization seems like an obstacle, but we use pseudo-random numbers and can just learn the algorithms for generating those as well.

Comment by capybaralet on Towards a mechanistic understanding of corrigibility · 2020-03-01T20:36:41.638Z · score: 1 (1 votes) · LW · GW

I'm not sure it's the same thing as alignment... it seems there's at least 3 concepts here, and Hjalmar is talking about the 2nd, which is importantly different from the 1st:

  • "classic notion of alignment": The AI has the correct goal (represented internally, e.g. as a reward function)
  • "CIRL notion of alignment": AI has a pointer to the correct goal (but the goal is represented externally, e.g. in a human partner's mind)
  • "corrigibility": something else
Comment by capybaralet on Towards a mechanistic understanding of corrigibility · 2020-03-01T20:26:51.554Z · score: 1 (1 votes) · LW · GW

What do you mean "these things"?

Also, to clarify, when you say "not going to be useful for alignment", do you mean something like "...for alignment of arbitrarily capable systems"? i.e. do you think they could be useful for aligning systems that aren't too much smarter than humans?

Comment by capybaralet on Towards a mechanistic understanding of corrigibility · 2020-03-01T20:23:33.547Z · score: 1 (1 votes) · LW · GW

So IIUC, you're advocating trying to operate on beliefs rather than utility functions? But I don't understand why.

Comment by capybaralet on Towards a mechanistic understanding of corrigibility · 2020-03-01T20:11:57.132Z · score: 3 (2 votes) · LW · GW
We could instead verify that the model optimizes its objective while penalizing itself for becoming more able to optimize its objective.

As phrased, this sounds like it would require correctly (or at least conservatively) tuning the trade-off between these two goals, which might be difficult.

Comment by capybaralet on Towards a mechanistic understanding of corrigibility · 2020-03-01T20:08:31.062Z · score: 1 (1 votes) · LW · GW

One thing I found confusing about this post + Paul's post "Worst-case guarantees" (2nd link in the OP: https://ai-alignment.com/training-robust-corrigibility-ce0e0a3b9b4d) is that Paul says "This is the second guarantee from “Two guarantees,” and is basically corrigibility." But you say: "Corrigibility seems to be one of the most promising candidates for such an acceptability condition". So it seems like you guys might have somewhat different ideas about what corrigibility means.

Can you clarify what you think is the relationship?

Comment by capybaralet on Conceptual Analysis for AI Alignment · 2020-01-11T21:27:32.362Z · score: 1 (1 votes) · LW · GW

Here's a blog post arguing that conceptual analysis has been a complete failure, with a link to a paper saying the same thing: http://fakenous.net/?p=1130

Comment by capybaralet on Let's talk about "Convergent Rationality" · 2020-01-03T06:30:49.045Z · score: 1 (1 votes) · LW · GW
Sure, but within AI, intelligence is the main feature that we're trying very hard to increase in our systems that would plausibly let the systems we build outcompete us. We aren't trying to make AI systems that replicate as fast as possible. So it seems like the main thing to be worried about is intelligence.

Blaise Agüera y Arcas gave a keynote at this NeurIPS pushing ALife (motivated by specification problems, weirdly enough...: https://neurips.cc/Conferences/2019/Schedule?showEvent=15487).

The talk recording: https://slideslive.com/38921748/social-intelligence. I recommend it.


Comment by capybaralet on Let's talk about "Convergent Rationality" · 2020-01-03T06:19:49.063Z · score: 1 (1 votes) · LW · GW
With 0, the AI never does anything and so is basically a rock

I'm trying to point at "myopic RL", which does, in fact, do things.

You might object that all of these can be made state-dependent, but you can make your example state-dependent by including the current time in the state.

I do object, and still object, since I don't think we can realistically include the current time in the state. What we can include is: an impression of what the current time is, based on past and current observations. There's an epistemic/indexical problem here you're ignoring.

I'm not an expert on AIXI, but my impression from talking to AIXI researchers and looking at their papers is: finite-horizon variants of AIXI have this "problem" of time-inconsistent preferences, despite conditioning on the entire history (which basically provides an encoding of time). So I think the problem I'm referring to exists regardless.


Comment by capybaralet on A reductio ad absurdum for naive Functional/Computational Theory-of-Mind (FCToM). · 2020-01-03T06:12:30.382Z · score: 1 (1 votes) · LW · GW
Can't I say that the emulation is me, and does morally matter (via FCToM), and also the many people enslaved in the system morally matter (via regular morality)?

Yes.

What I'm saying is that, to use the language of the debate I referenced, "what kind of paper the equation is written on DOES matter".

It seems like you're saying that FCToM implies that if a physical system implements a morally relevant mathematical function, then the physical system itself cannot include morally relevant bits, and I don't see why that has anything to do with FCToM.

I'm saying "naive FCToM", as I've characterized it, says that. I doubt "naive FCToM" is even coherent. That's sort of part of my broader point (which I didn't make yet in this post).


Comment by capybaralet on Let's talk about "Convergent Rationality" · 2020-01-02T16:44:01.939Z · score: 1 (1 votes) · LW · GW
Sure, but within AI, intelligence is the main feature that we're trying very hard to increase in our systems that would plausibly let the systems we build outcompete us. We aren't trying to make AI systems that replicate as fast as possible. So it seems like the main thing to be worried about is intelligence.

I think I was maybe trying to convey too much of my high-level views here. What's maybe more relevant and persuasive here is this line of thought:

  • Intelligence is very multi-faceted
  • An AI that is super-intelligent in a large number (but small fraction) of the facets of intelligence could strategically outmanuver humans
  • Returning to the original point: such as AI could also be significantly less "rational" than humans

Also, nitpicking a bit: to a large extent, society is trying to make systems that are as competitive as possible at narrow, profitable tasks. There are incentives for excellence in many domains. FWIW, I'm somewhat concerned about replicators in practice, e.g. because I think open-ended AI systems operating in the real-world might create replicators accidentally/indifferently, and we might not notice fast enough.

My main opposition to this is that it's not actionable

I think the main take-away from these concerns is to realize that there are extra risk factors that are hard to anticipate and for which we might not have good detection mechanisms. This should increase pessimism/paranoia, especially (IMO) regarding "benign" systems.

Idk, if it's superintelligent, that system sounds both rational and competently goal-directed to me.

(non-hypothetical Q): What about if it has a horizon of 10^-8s? Or 0?

I'm leaning on "we're confused about what rationality means" here, and specifically, I believe time-inconsistent preferences are something that many would say seem irrational (prima face). But


Comment by capybaralet on Let's talk about "Convergent Rationality" · 2020-01-02T03:46:51.648Z · score: 1 (1 votes) · LW · GW
Perhaps you mean grey-goo type scenarios where we wouldn't call the replicator "intelligent", but it's nonetheless a good replicator? Are you worried about AI systems of that form? Why?

Yes, I'm worried about systems of that form (in some sense). The reason is: I think intelligence is just one salient feature of what makes a life-form or individual able to out-compete others. I think intelligence, and fitness even more so, are multifaceted characteristics. And there are probably many possible AIs with different profiles of cognitive and physical capabilities that would pose an Xrisk for humans.

For instance, any appreciable quantity of a *hypothetical* grey goo that could use any matter on earth to replicate (i.e. duplicate itself) once per minute would almost certainly consume the earth in less than one day (I guess modulo some important problems around transportation and/or its initial distribution over the earth, but you probably get the point).

More realistically, it seems likely that we will have AI systems that have some significant flaws but are highly competent at strategically relevant cognitive skills, able to think much faster than humans, and have very different (probably larger but a bit more limited) arrays of sensors and actuators than humans, which may pose some Xrisk.

The point is just that intelligence and rationality are import traits for Xrisk, but we should certainly not make the mistake of believing one/either/both are the only traits that matter. And we should also recognize that they are both abstractions and simplifications that we believe are often useful but rarely, if ever, sufficient for thorough and effective reasoning about AI-Xrisk.

Sure, I more meant competently goal-directed.

This is still, I think, not the important distinction. By "significantly restricted", I don't necessarily mean that it is limiting performance below a level of "competence". It could be highly competent, super-human, etc., but still be significantly restricted.

Maybe a good example (although maybe departing from the "restricted hypothesis space" type of example) would be an AI system that has a finite horizon of 1,000,000 years, but no other restrictions. There may be a sense in which this system is irrational (e.g. having time-inconsistent preferences), but it may still be extremely competently goal-directed.


Comment by capybaralet on Let's talk about "Convergent Rationality" · 2020-01-01T21:48:16.681Z · score: 3 (2 votes) · LW · GW

At a meta-level: this post might be a bit to under-developed to be worth trying to summarize in the newsletter; I'm not sure.

RE the summary:

  • I wouldn't say I'm introducing a single thesis here; I think there are probable a few versions that should be pulled apart, and I haven't done that work yet (nor has anyone else, FWICT).
  • I think the use of "must" in your summary is too strong. I would phrase it more like "unbounded increases the capabilities of an AI system drive an unbounded increase in the agenty-ness or rationality of the system".
  • The purported failure of capability control I'm imagining isn't because the AI subverts capability controls; that would be putting the cart before the horse. The idea is that an AI that doesn't conceptualize itself as an agent would begin to do so, and that very event is a failure of a form of "capability control", specifically the "don't build an agent" form. (N.B.: some people have been confused by my calling that a form of capability control...)
  • My point is stronger than this: "we could still have AI systems that are far more 'rational' than us, even if they still have some biases that they do not seek to correct, and this could still lead to x-risk." I claim that a system doesn't need to be very "rational" at all in order to pose significant Xrisk. It can just be a very powerful replicator/optimizer.

RE the opinion:

  • See my edit to the comment about "convergent goal-directedness", we might have some misunderstanding... To clarify my position a bit:
    • I think goal-directedness seems like a likely component of rationality, but we're still working on deconfusing rationality itself, so it's hard to say for sure
    • I think it's only a component and not the same thing, since I would consider an RL agent that has a significantly restricted hypothesis space to be goal-directed, but probably not highly rational. CRT would predict that (given a sufficient amount of compute and interaction) such an agent would have a tendency to expand its (effective) hypothesis space to address inadequacies. This might happen via recruiting resources in the environment and eventually engaging in self-modification.
  • I think CRT is not well-formulated or specified enough (yet) to be something that one can agree/disagree with, without being a bit more specific.
Comment by capybaralet on Might humans not be the most intelligent animals? · 2020-01-01T18:38:51.849Z · score: 1 (1 votes) · LW · GW

OK I understand. JTBC, my original statement was: " language and human culture created lead to a massive increase in returns to intelligence", not that larger brains (/greater intelligence) suddenly became valuable.

Comment by capybaralet on Might humans not be the most intelligent animals? · 2019-12-29T01:16:39.709Z · score: 1 (1 votes) · LW · GW

TBC: I'm alluding to others' scholarly arguments, which I'm not very familiar with. I'm not sure to what extent these arguments have empirical vs. theoretical basis.

Comment by capybaralet on Might humans not be the most intelligent animals? · 2019-12-29T01:15:46.862Z · score: 2 (2 votes) · LW · GW

Basic question: how can this be a sufficient explanation? There needs to be *some* advantage to having the bigger brain, it being "cheap" isn't a good enough explanation...

Comment by capybaralet on Might humans not be the most intelligent animals? · 2019-12-24T01:10:00.255Z · score: 9 (6 votes) · LW · GW

I think one of the strongest arguments for humans being the smartest animals is that the social environment that language and human culture created lead to a massive increase in returns to intelligence, and I think there's some evidence from what we know about evolution that the human neocortex ballooned around the same time that we're theorized to have developed language and culture.

Comment by capybaralet on Might humans not be the most intelligent animals? · 2019-12-24T00:10:40.144Z · score: 1 (1 votes) · LW · GW

" some animals that are more intelligent than animals" --> " some animals that are more intelligent than HUMANS "

Comment by capybaralet on What I talk about when I talk about AI x-risk: 3 core claims I want machine learning researchers to address. · 2019-12-05T17:39:48.360Z · score: 3 (2 votes) · LW · GW

TBC, I'm definitely NOT thinking of this as an argument for funding AI safety.

Comment by capybaralet on What I talk about when I talk about AI x-risk: 3 core claims I want machine learning researchers to address. · 2019-12-05T17:36:57.303Z · score: 3 (2 votes) · LW · GW

I'm definitely interested in hearing other ways of splitting it up! This is one of the points of making this post. I'm also interested in what you think of the ways I've done the breakdown! Since you proposed an alternative, I guess you might have some thoughts on why it could be better :)

I see your points as being directed more at increasing ML researchers respect for AI x-risk work and their likelihood of doing relevant work. Maybe that should in fact be the goal. It seems to be a more common goal.

I would describe my goal (with this post, at least, and probably with most conversations I have with ML people about Xrisk) as something more like: "get them to understand the AI safety mindset, and where I'm coming from; get them to really think about the problem and engage with it". I expect a lot of people here would reason in a very narrow and myopic consequentialist way that this is not as good a goal, but I'm unconvinced.

Comment by capybaralet on What I talk about when I talk about AI x-risk: 3 core claims I want machine learning researchers to address. · 2019-12-05T07:51:47.772Z · score: 3 (2 votes) · LW · GW

TBC, it's an unconference, so it wasn't really a talk (although I did end up talking a lot :P).

How sure are you that the people who showed up were objecting out of deeply-held disagreements, and not out of a sense that objections are good?

Seems like a false dichotomy. I'd say people were mostly disagreeing out of not-very-deeply-held-at-all disagreements :)

Comment by capybaralet on A list of good heuristics that the case for AI x-risk fails · 2019-12-04T23:58:10.456Z · score: 12 (4 votes) · LW · GW

Another important improvement I should make: rephrase these to have the type signature of "heuristic"!

Comment by capybaralet on What I talk about when I talk about AI x-risk: 3 core claims I want machine learning researchers to address. · 2019-12-03T17:12:11.952Z · score: 3 (4 votes) · LW · GW

No, my goal is to:

  • Identify a small set of beliefs to focus discussions around.
  • Figure out how to make the case for these beliefs quickly, clearly, persuasively, and honestly.

And yes, I did mean >1%, but I just put that number there to give people a sense of what I mean, since "non-trivial" can mean very different things to different people.


Comment by capybaralet on A list of good heuristics that the case for AI x-risk fails · 2019-12-03T06:44:27.163Z · score: 3 (2 votes) · LW · GW

Oh sure, in some special cases. I don't this this experience was particularly representative.

Comment by capybaralet on A list of good heuristics that the case for AI x-risk fails · 2019-12-03T05:04:12.105Z · score: 5 (7 votes) · LW · GW

Yeah I've had conversations with people who shot down a long list of concerned experts, e.g.:

  • Stuart Russell is GOFAI ==> out-of-touch
  • Shane Legg doesn't do DL, does he even do research? ==> out-of-touch
  • Ilya Sutskever (and everyone at OpenAI) is crazy, they think AGI is 5 years away ==> out-of-touch
  • Anyone at DeepMind is just marketing their B.S. "AGI" story or drank the koolaid ==> out-of-touch

But then, even the big 5 of deep learning have all said things that can be used to support the case....

So it kind of seems like there should be a compendium of quotes somewhere, or something.

Comment by capybaralet on Clarifying some key hypotheses in AI alignment · 2019-12-02T20:52:38.052Z · score: 6 (4 votes) · LW · GW

Nice chart!

A few questions and comments:

  • Why the arrow from "agentive AI" to "humans are economically outcompeted"? The explanation makes it sounds like it should point to "target loading fails"??
  • Suggestion: make the blue boxes without parents more apparent? e.g. a different shade of blue? Or all sitting above the other ones? (e.g. "broad basin of corrigibility" could be moved up and left).
Comment by capybaralet on A list of good heuristics that the case for AI x-risk fails · 2019-12-02T19:28:55.504Z · score: 8 (5 votes) · LW · GW

I pushed this post out since I think it's good to link to it in this other post. But there are at least 2 improvements I'd like to make and would appreciate help with:

Comment by capybaralet on LessWrong anti-kibitzer (hides comment authors and vote counts) · 2019-11-28T17:32:15.184Z · score: 1 (1 votes) · LW · GW

The link to user preferences is broken. Is there still this feature built-in? Or does the firefox thing still work?

Comment by capybaralet on Can indifference methods redeem person-affecting views? · 2019-11-17T01:28:59.231Z · score: 1 (1 votes) · LW · GW

Can you give a concrete example for why the utility function should change?

Comment by capybaralet on AI Safety "Success Stories" · 2019-10-24T00:01:41.942Z · score: 1 (1 votes) · LW · GW

I couldn't say without knowing more what "human safety" means here.

But here's what I imagine an example pivotal command looking like: "Give me the ability to shut-down unsafe AI projects for the foreseeable future. Do this while minimizing disruption to the current world order / status quo. Interpret all of this in the way I intend."

Comment by capybaralet on AI Safety "Success Stories" · 2019-10-21T19:43:17.770Z · score: 1 (1 votes) · LW · GW

OK, I think that makes some sense.

I dont know how I'd fill out the row, since I don't understand what is covered by the phrase "human safety", or what assumptions are being made about the proliferation of the technology, or more specifically, the characteristics of the humans who do possess the tech.

I think I was imagining that the pivotal tool AI is developed by highly competent and safety-conscious humans who use it to perform a pivotal act (or series of pivotal acts) that effectively precludes the kind of issues mentioned in Wei's quote there.

Comment by capybaralet on TAISU 2019 Field Report · 2019-10-16T06:34:38.468Z · score: 3 (2 votes) · LW · GW
Linda organized it as two 2 day unconferences held back-to-back

Can you explain how that is different from a 4-day unconference, more concretely?

Comment by capybaralet on TAISU 2019 Field Report · 2019-10-16T06:33:51.818Z · score: 7 (5 votes) · LW · GW
I think the workshop would be a valuable use of three days for anyone actively working in AI safety, even if they consider themselves "senior" in the field: it offered a valuable space for reconsidering basic assumptions and rediscovering the reasons why we're doing what we're doing.

This read to me as a remarkably strong claim; I assumed you meant something slightly weaker. But then I realized you said "valuable" which might mean "not considering opportunity cost". Can you clarify that?

And if you do mean "considering opportunity cost", I think it would be worth giving your ~strongest argument(s) for it!

For context, I am a PhD candidate in ML working on safety, and I am interested in such events, but unsure if they would be a valuable use of my time, and OTTMH would expect most of the value to be in terms of helping others rather than benefitting my own understanding/research/career/ability-to-contribute (I realize this sounds a bit conceited, and I didn't try to avoid that except via this caveat, and I really do mean (just) OTTMH... I think the reality is a bit more that I'm mostly estimating value based on heuristics). If I had been in the UK when they happened, I would probably have attended at least one.

But I think I am a bit unusual in my level of enthusiasm. And FWICT, such initiatives are not receiving much resources (including money and involvement of senior safety researchers) and potentially should receive A LOT more (e.g. 1-2 orders of magnitude). So the case for them being valuable (in general or for more senior/experienced researchers) is an important one!

Comment by capybaralet on AI Safety "Success Stories" · 2019-10-14T19:48:00.035Z · score: 2 (2 votes) · LW · GW

Does an "AI safety success story" encapsulate just a certain trajectory in AI (safety) development?

Or does it also include a story about how AI is deployed (and by who, etc.)?

I like this post a lot, but I think it ends up being a bit unclear because I don't think everyone has the same use cases in mind for the different technologies underlying these scenarios, and/or I don't think everyone agrees with the way in which safety research is viewed as contributing to success in these different scenarios... Maybe fleshing out the success stories, or referencing some more in-depth elaborations of them would make this clearer?

Comment by capybaralet on AI Safety "Success Stories" · 2019-10-14T19:43:16.838Z · score: 1 (1 votes) · LW · GW

I'm going to dispute a few cells in your grid.

  • I think pivotal tool story has low reliance on human safety (although I'm confused by that row in general).
  • Whether sovereigns would require restricted access is unclear. This is basically the question of whether single-agent, single-user alignment will likely produce a solution to multi-agent, multi-user alignment (in a timely manner).
  • ETA: the "interim quality of life improver" seems to roughly be talking about episodic RL, which I would classify as "medium" autonomy.
Comment by capybaralet on AI Safety "Success Stories" · 2019-10-14T19:38:37.175Z · score: 3 (2 votes) · LW · GW

I don't understand what you mean by "Reliance on human safety". Can you clarify/elaborate? Is this like... relying on humans' (meta-)philosophical competence? Relying on not having bad actors? etc...


Comment by capybaralet on AI Safety "Success Stories" · 2019-10-14T19:37:07.599Z · score: 1 (1 votes) · LW · GW

While that's true to some extent, a lot of research does seem to be motivated much more by some of these scenarios. For example, work on safe oracle designs seems primarily motivated by the pivotal tool success story.

Comment by capybaralet on Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More · 2019-10-11T06:38:35.437Z · score: 1 (1 votes) · LW · GW

No idea why this is heavily downvoted; strong upvoted to compensate.

I'd say he's discouraging everyone from working on the problems, or at least from considering such work to be important, urgent, high status, etc.

Comment by capybaralet on Review: Selfish Reasons to Have More Kids · 2019-10-03T06:04:28.903Z · score: 4 (1 votes) · LW · GW
It seems to me that the elephant in the room here are the peer effects.

By regressing on household identity, you capture how parents' efforts to control kids' peer groups influence outcomes. This was discussed (briefly) in the book, towards the beginning (~page30-40?)


Comment by capybaralet on Just Imitate Humans? · 2019-09-19T10:30:50.864Z · score: 2 (2 votes) · LW · GW
A Bayesian predictor of the human's behavior will consider the hypothesis Hg that the human does the sort of planning described above in the service of goal g. It will have a corresponding hypothesis for each such goal g. It seems to me, though, that these hypotheses will be immediately eliminated. The human's observed behavior won't include taking over the world or any other existentially dangerous behavior, as would have been implied by hypotheses of the form Hg.

This is a very good argument, and I'm still trying to decide how decisive I think it is.

In the meanwhile, I'll mention that I'm imagining the learner as something closer to a DNN than a Bayesian predictor. One image how how DNN learning often proceeds is as a series of "aha" moments (generating/revising highly general explanations of the data) interspersed/intermingled with something more like memorization of data-points that don't fit the current general explanations. That view makes it seem plausible that "planning" would emerge as an "aha" moment before being refined as "oh wait, bounded planning... with these heuristics... and these restrictions...", creating a dangerous window of time between "I'm doing planning" and "I'm planning like a human, warts and all".

Comment by capybaralet on Just Imitate Humans? · 2019-09-18T04:00:30.538Z · score: 4 (2 votes) · LW · GW

RE: "Imitation learning considered unsafe?" (I'm the author):

The post can basically be read as arguing that human imitation seems especially likely to produce mesa-optimization.

I agree with your response; this is also why I said: "Mistakes in imitating the human may be relatively harmless; the approximation may be good enough".

I don't agree with your characterization, however. The concern is not that it would have roughly human-like planning, but rather super-human planning (since this is presumably simpler according to most reasonable priors).

Comment by capybaralet on Distance Functions are Hard · 2019-09-14T19:15:45.876Z · score: 1 (1 votes) · LW · GW
I see. How about doing active learning of computable functions? That solves all 3 problems

^ I don't see how?

I should elaborate... it sounds like your thinking of active learning (where the AI can choose to make queries for information, e.g. labels), but I'm talking about *inter*active training, where a human supervisor is *also* actively monitoring the AI system, making queries of it, and intelligently selecting feedback for the AI. This might be simulated as well, using multiple AIs, and there might be a lot of room for good work there... but I think if we want to solve alignment, we want a deep and satisfying understanding of AI systems, which seems hard to come by without rich feedback loops between humans and AIs. Basically, by interactive training, I have in mind something where training AIs looks more like teaching other humans.

So at the very least, a superintelligent self-supervised learning system trained on loads of human data would have a lot of conceptual building blocks (developed in order to make predictions about its training data) which could be tweaked and combined to make predictions about human values (analogous to fine-tuning in the context of transfer learning).

I think it's a very open question how well we can expect advanced AI systems to understand or mirror human concepts by default. Adversarial examples suggest we should be worried that apparently similar concepts will actually be wildly different in non-obvious ways. I'm cautiously optimistic, since this could make things a lot easier. It's also unclear ATM how precisely AI concepts need to track human concepts in order for things to work out OK. The "basin of attraction" line of thought suggests that they don't need to be that great, because they can self-correct or learn to defer to humans appropriately. My problem with that argument is that it seems like we will have so many chances to fuck up that we would need 1) AI systems to be extremely reliable, or 2) for catastrophic mistakes to be rare, and minor mistakes to be transient or detectable. (2) seems plausible to me in many applications, but probably not all of the applications where people will want to use SOTA AI.

Re: gwern's article, RL does not seem to me like a good fit for most of the problems he describes. I agree active learning/interactive training protocols are powerful, but that's not the same as RL.

Yes ofc they are different.

I think algorithms the significant features of RL here are: 1) having the goal of understanding the world and how to influence it, and 2) doing (possibly implicit) planning. RL can also be pointed at narrow domains, but for a lot of problems, I think having general knowledge will be very valuable, and hard to replicate with a network of narrow systems.

I think the solution for autonomy is (1) solve calibration/distributional shift, so the system knows when it's safe to act autonomously (2) have the system adjust its own level of autonomy/need for clarification dynamically depending on the apparent urgency of its circumstances.

That seems great, but also likely to be very difficult, especially if we demand high reliability and performance.

Comment by capybaralet on Distance Functions are Hard · 2019-09-13T15:00:52.143Z · score: 1 (1 votes) · LW · GW

They're a pain because they involve a lot of human labor, slow down the experiment loop, make reproducing results harder, etc.

RE self-supervised learning: I don't see why we needed the rebranding (of unsupervised learning). I don't see why it would make alignment straightforward (ETA: except to the extent that you aren't necessarily, deliberately building something agenty). The boundaries between SSL and other ML is fuzzy; I don't think we'll get to AGI using just SSL and nothing like RL. SSL doesn't solve the exploration problem, if you start caring about exploration, I think you end up doing things that look more like RL.

I also tend to agree (e.g. with that gwern article) that AGI designs that aren't agenty are going to be at a significant competitive disadvantage, so probably aren't a satisfying solution to alignment, but could be a stop-gap.

Comment by capybaralet on Distance Functions are Hard · 2019-09-12T04:36:08.322Z · score: 1 (1 votes) · LW · GW

At the same time, the importance of having a good distance/divergence, the lack of appropriate ones, and the difficulty of learning them are widely acknowledged challenges in machine learning.

A distance function is fairly similar to a representation in my mind, and high-quality representation learning is considered a bit of a holy grail open problem.

Machine learning relies on formulating *some* sort of objective, which can be viewed as analogous to the choice of a good distance function, so I think the central point of the post (as I understood it from a quick glance) is correct: "specifying a good distance measure is not that much easier than specifying a good objective".

It's also an open question how much learning, (relatively) generic priors, and big data can actually solve the issue of weak learning signals and weak priors for us. A lot of people are betting pretty hard on that; I think its plausible, but not very likely. I think its more like a recipe for unaligned AI, and we need to get more bits of information about what we actually want into AI systems somehow. Highly interactive training protocols seem super valuable for that, but the ML community has a strong preference against such work because it is a massive pain compared to the non-interactive UL/SL/RL settings that are popular.




Comment by capybaralet on Two senses of “optimizer” · 2019-09-12T04:18:54.965Z · score: 1 (1 votes) · LW · GW

Yep. Good post. Important stuff. I think we're still struggling to understand all of this fully, and work on indifference seems like the most relevant stuff.

My current take is that as long as there is any "black-box" part of the algorithm which is optimizing for performance, then it may end up behaving like an optimizer_2, since the black box can pick up on arbitrary effective strategies.

(in partial RE to Rohin below): I wouldn't necessarily say that such an algorithm knows about its environment (i.e. has a good model), it may simply have stumbled upon an effective strategy for interacting with it (i.e. have a good policy).

Comment by capybaralet on Two senses of “optimizer” · 2019-09-12T04:12:54.291Z · score: 3 (2 votes) · LW · GW
It seems to me that the distinction is whether the optimizer has knowledge about the environment

Alternatively, you could say the distinction is whether the optimizer cares about the environment. I think there's a sense (or senses?) in which these things can be made/considered equivalent. I don't feel like I totally understand or am satisfied with either way of thinking about it, though.