Posts

Problems of evil 2021-04-19T08:06:42.895Z
The innocent gene 2021-04-05T03:31:29.782Z
The importance of how you weigh it 2021-03-29T04:59:41.327Z
On future people, looking back at 21st century longtermism 2021-03-22T08:23:06.743Z
Against neutrality about creating happy lives 2021-03-15T01:55:27.568Z
Care and demandingness 2021-03-08T07:03:42.755Z
Subjectivism and moral authority 2021-03-01T09:02:58.739Z
Two types of deference 2021-02-22T03:32:53.561Z
Contact with reality 2021-02-15T04:53:39.739Z
Killing the ants 2021-02-07T23:17:01.938Z
Believing in things you cannot see 2021-02-01T07:26:54.082Z
On clinging 2021-01-24T23:25:36.412Z
A ghost 2021-01-21T07:14:05.298Z
Actually possible: thoughts on Utopia 2021-01-18T08:27:39.428Z
Alienation and meta-ethics (or: is it possible you should maximize helium?) 2021-01-15T07:07:25.675Z
The impact merge 2021-01-13T07:26:42.605Z
Shouldn't it matter to the victim? 2021-01-11T07:16:28.453Z
Thoughts on personal identity 2021-01-08T04:19:19.637Z
Grokking illusionism 2021-01-06T05:50:57.598Z
The despair of normative realism bot 2021-01-03T23:07:08.767Z
Thoughts on being mortal 2021-01-01T19:17:17.697Z
Wholehearted choices and “morality as taxes” 2020-12-23T02:21:36.392Z

Comments

Comment by Joe Carlsmith (joekc) on Clarifying inner alignment terminology · 2021-02-19T21:33:00.566Z · LW · GW

Cool (though FWIW, if you're going to lean on the notion of policies being aligned with humans, I'd be inclined to define that as well, in addition to defining what it is for agents to be aligned with humans. But maybe the implied definition is clear enough: I'm assuming you have in mind something like "a policy is aligned with humans if an agent implementing that policy is aligned with humans."). 

Regardless, sounds like your definition is pretty similar to: "An agent is intent aligned if its behavioral objective is such that an arbitrarily powerful and competent agent pursuing this objective to arbitrary extremes wouldn't act in ways that humans judge bad"? If you see it as importantly different from this, I'd be curious.

Comment by Joe Carlsmith (joekc) on Clarifying inner alignment terminology · 2021-02-19T18:43:57.003Z · LW · GW

Aren't they now defined in terms of each other? 

"Intent alignment: An agent is intent aligned if its behavioral objective is outer aligned.

Outer alignment: An objective function  is outer aligned if all models that perform optimally on  in the limit of perfect training and infinite data are intent aligned."

Comment by Joe Carlsmith (joekc) on Clarifying inner alignment terminology · 2021-02-19T07:48:27.938Z · LW · GW

Thanks for writing this up. Quick question re: "Intent alignment: An agent is intent aligned if its behavioral objective is aligned with humans." What does it mean for an objective to be aligned with humans, on your view? You define what it is for an agent to be aligned with humans, e.g.: "An agent is aligned (with humans) if it doesn't take actions that we would judge to be bad/problematic/dangerous/catastrophic." But you don't say explicitly what it is for an objective to be aligned: I'm curious if you have a preferred formulation.

Is it something like: “the behavioral objective is such that, when the agent does ‘well’ on this objective, the agent doesn’t act in a way we would view as bad/problematic/dangerous/catastrophic." If so, it seems like a lot might depend on exactly how “well” the agent does, and what opportunities it has in a given context. That is, an “aligned” agent might not stay aligned if it becomes more powerful, but continues optimizing for the same objective (for example, a weak robot optimizing for beating me at chess might be "aligned" because it only focuses on making good chess moves, but a stronger one might not be, because it figures out how to drug my tea). Is that an implication you’d endorse? 

Or is the thought something like: "the behavioral objective such that, no matter how powerfully the agent optimizes for it, and no matter its opportunities for action, it doesn't take actions we would view as bad/problematic/dangerous/catastrophic"? My sense is that something like this is often the idea people have in mind, especially in the context of anticipating things like intelligence explosions. If this is what you have in mind, though, maybe worth saying so explicitly, since intent alignment in this sense seems like a different constraint than intent alignment in the sense of e.g. "the agent's pursuit of its behavioral objective does not in fact give rise to bad actions, given the abilities/contexts/constraints that will in fact be relevant to its behavior."

Comment by Joe Carlsmith (joekc) on On clinging · 2021-01-28T08:47:43.580Z · LW · GW

Interesting; I hadn't really considered that angle. Seems like this could also apply to other mental phenomena that might seem self-recommending (pleasure? rationality?), but which plausibly have other, more generally adaptive functions as well, so I would continue to wonder about other functions regardless.

Comment by Joe Carlsmith (joekc) on Grokking illusionism · 2021-01-26T07:41:43.626Z · LW · GW

I meant mental states in something more like the #1 sense -- and so, I think, does Frankish.

Comment by Joe Carlsmith (joekc) on Grokking illusionism · 2021-01-26T07:27:11.934Z · LW · GW

My sense is that the possibility of dynamics of this kind would be on people's radar in the philosophy community, at least.

Comment by Joe Carlsmith (joekc) on On clinging · 2021-01-25T07:42:56.433Z · LW · GW

Thanks :). I do think clinging often functions as an unnoticed lens on the world; though noticing it, in my experience, is also quite distinct from it "releasing." I also would've thought that depression can be an unnoticed (or at least, unquestioned) lens as well: e.g., a depressed person who is convinced that everything in the world is bad, that they'll never feel better again, etc.

Comment by Joe Carlsmith (joekc) on The impact merge · 2021-01-14T07:18:03.069Z · LW · GW

Glad to hear you found it useful.

Comment by Joe Carlsmith (joekc) on The impact merge · 2021-01-14T07:04:54.870Z · LW · GW

Thanks :) Re blog name: it isn't: "Hands" comes from a Martin Buber quote, and "Cities" from a phrase I believe I heard from A.J. Julius. I chose them partly as a personal reminder about the blog's aims.

Comment by Joe Carlsmith (joekc) on The impact merge · 2021-01-14T06:51:06.886Z · LW · GW

That's the one :)

Comment by Joe Carlsmith (joekc) on Grokking illusionism · 2021-01-10T05:59:20.598Z · LW · GW

I do remember that conversation, though I'm a bit hazy on the details of the argument you presented. Let me know if there's a write-up/summary somewhere, or if you create one in future. 

Comment by Joe Carlsmith (joekc) on Grokking illusionism · 2021-01-10T05:57:11.688Z · LW · GW

Thanks for explaining where you're coming from. 

Yet I experience that computation as the qualia of "blueness." How can that be? How can any computation of any kind create, or lead to qualia of any kind? You can say that it is just a story my brain is telling me that "I am seeing blue." I must not understand what is being claimed, because I agree with it and yet it doesn't remove the problem at all. Why does that story have any phenomenology to it? I can make no sense of the claim that it is an illusion.

As I understand it, the idea would be that, as weird as it may sound, there isn't any phenomenology to it. Rather: according to the story that your brain is telling, there is some phenomenology to it. But there isn't. That is, your brain's story doesn't create, lead to, or correlate with phenomenal blueness; rather, phenomenal blueness is something that the story describes, but which doesn't exist, in the same way that a story can describe unicorns without bringing them to life. 

Comment by Joe Carlsmith (joekc) on Grokking illusionism · 2021-01-07T08:35:07.392Z · LW · GW

I’m hopeful that if we actually had a worked out reductionist account of all the problematic intuitions, which we knew was right and which made illusionism true, then this would be at least somewhat helpful in making illusionism less mysterious. In particular, I’m hopeful that thoroughly and dutifully reconceptualizing our introspection and intuitions according to that theory — “when it seems to me like X, what’s going on is [insert actual gears level explanation, not just ‘neurons are firing’ or ‘my brain is representing its internal processing in a simplified and false way’]” — would make a difference.

Comment by Joe Carlsmith (joekc) on Grokking illusionism · 2021-01-07T08:25:28.860Z · LW · GW

Glad you found it helpful (or at least, as helpful as other work on the topic). So far in my engagement with Graziano (specifically, non-careful reads of his 2013 book and his 2019 “Toward a standard model of consciousness”), I don’t feel like I’ve taken away much more than the summary I gave above of Frankish’s view: namely, “introspective mechanisms ... track the processes involved in access consciousness and represent them using a simplified model” — something pretty similar to what Chalmers also says here on p. 34. I know Graziano focuses on attention in particular, and he talks more about e.g. sociality and cites some empirical work, but at a shallow glance I’m not sure I yet see really substantive and empirically grounded increases in specificity, beyond what seems like the general line amongst a variety of folks that “there’s some kind of global workspace-y thing, there’s some kind of modeling of that, this modeling involves simplifications/distortions/opacity of various kinds, these somehow explain whatever problem intuitions/reports need explaining." But I haven’t tried to look at Graziano closely. The “naive” vs. “sophisticated” descriptions in your blog post seem like a helpful way to frame his project.