Posts

Do factored sets elucidate anything about how to update everyday beliefs? 2021-11-22T06:51:15.655Z
Hope and False Hope 2021-09-04T09:46:23.513Z
Untangling 2021-08-29T14:26:49.176Z
Thinking about AI relationally 2021-08-16T22:03:07.780Z
Strategy vs. PR-narrative 2021-08-15T22:40:59.527Z
Evidence that adds up 2021-07-29T03:27:34.676Z
ELI12: how do libertarians want wages to work? 2021-06-24T07:00:02.206Z
Visualizing in 5 dimensions 2021-06-19T18:15:18.160Z

Comments

Comment by TekhneMakre on Morality is Scary · 2021-12-02T20:36:46.676Z · LW · GW
at least some researchers don't seem to consider that part of "alignment".

It's part of alignment. Also, it seems mostly separate from the part about "how do you even have consequentialism powerful enough to make, say, nanotech, without killing everyone as a side-effect?", and the latter seems not too related to the former.

Comment by TekhneMakre on Biology-Inspired AGI Timelines: The Trick That Never Works · 2021-12-02T17:18:34.179Z · LW · GW

Seems right, IDK. But still, that's a different kind of uncertainty than uncertainty about, like, the shape of algorithm-space.

Comment by TekhneMakre on Morality is Scary · 2021-12-02T12:55:10.689Z · LW · GW


So on the one hand you have values that are easily, trivially compatible, such as "I want to spend 1000 years climbing the mountains of Mars" or "I want to host blood-sports with my uncoerced friends with the holodeck safety on".

On the other hand you have insoluble, or at least apparently insoluble, conflicts: B wants to torture people, C wants there to be no torture anywhere at all. C wants to monitor everyone everywhere forever to check that they aren't torturing anyone or plotting to torture anyone, D wants privacy. E and F both want to be the best in the universe at quantum soccer, even if they have to kneecap everyone else to get that. Etc.

It's simply false that you can just put people in the throne as emperor of the universe, and they'll justly compromise about all conflicts. Or even do anything remotely like that.

How many people have conflictual values that they, effectively, value lexicographically more than their other values? Does decision theory imply that compromise will be chosen by sufficiently well-informed agents who do not have lexicographically valued conflictual values?

Comment by TekhneMakre on Morality is Scary · 2021-12-02T12:41:30.809Z · LW · GW


> All the rest is an act of shared imagination. It’s a dream we weave around a status game.
> They’re part of the dream of reality in which they exist, a dream that feels no less obvious and true to them than ours does to us.
> Moral ‘truths’ are acts of imagination. They’re ideas we play games with.

IDK, I feel like you could say the same sentences truthfully about math, and if you "went with the overall vibe" of them, you might be confused and mistakenly think math was "arbitrary" or "meaningless", or doesn't have a determinate tendency, etc. Like, okay, if I say "one element of moral progress is increasing universalizability", and you say "that's just the thing your status cohort assigns high status", I'm like, well, sure, but that doesn't mean it doesn't also have other interesting properties, like being a tendency across many different peoples; like being correlated with the extent to which they're reflecting, sharing information, and building understanding; like resulting in reductionist-materialist local outcomes that have more of material local things that people otherwise generally seem to like (e.g. not being punched, having food, etc.); etc. It could be that morality has tendencies, but not without hormesis and mutually assured destrubtion and similar things that might be removed by aligned AI.

Comment by TekhneMakre on Biology-Inspired AGI Timelines: The Trick That Never Works · 2021-12-02T12:23:49.154Z · LW · GW

Hold on, I guess this actually means that for a natural interpretation of "entropy" in "generic uncertainty about maybe being wrong, without other extra premises, should increase the entropy of one's probability distribution over AGI," that statement is actually false. If by "entropy" we mean "entropy according to the uniform measure", it's false. What we should really mean is entropy according to one's maximum entropy distribution (as the background measure), in which case the statement is true.

Comment by TekhneMakre on Biology-Inspired AGI Timelines: The Trick That Never Works · 2021-12-02T03:52:51.483Z · LW · GW

I have calculated the number of computer operations used by evolution to evolve the human brain - searching through organisms with increasing brain size - by adding up all the computations that were done by any brains before modern humans appeared. It comes out to 10^43 computer operations. AGI isn't coming any time soon!

And yet, because your reasoning contains the word "biological", it is just as invalid and unhelpful as Moravec's original prediction.

I agree that the conclusion about AGI not coming soon is invalid, so the following isn't exactly responding to what you say. But: ISTM the evolution thing is somewhat qualitatively different from Moravec or Stack More Layers, in that it softly upper bounds the uncertainty about the algorithmic knowledge needed to create AGI. IDK how easy it would be to implement an evolution that spits out AGI, but that difficulty seems like it should be less conceptually uncertain than the difficulty of understanding enough about AGI to do something more clever with less compute. Like, we could extrapolate out 3 OOMs of compute/$ per decade to get an upper bound: very probably AGI before 2150-ish, if Moore's law continues. Not very certain, or helpful if you already think AGI is very likely soon-ish, but it has nonzero content.

Comment by TekhneMakre on Biology-Inspired AGI Timelines: The Trick That Never Works · 2021-12-02T03:37:13.042Z · LW · GW

Now having read the rest of the essay... I guess "maximum entropy" is just straight up confusing if you don't insert the "...given assumptions XYZ". Otherwise it sounds like there's such a thing as "the maximum-entropy distribution", which doesn't exist: you have to cut up the possible worlds somehow, and different ways of cutting them up produces different uniform distributions. (Or in the continuous case, you have to choose a measure in order to do integration, and that measure contains just as much information as a probability distribution; the uniform measure says that all years are the same, but you could also say all orders of magnitude of time since the Big Bang are the same, or something else.) So how you cut up possible worlds changes the uniform distribution, i.e. the maximum entropy distribution. So the assumptions that go into how you cut up the worlds, are determining your maximum entropy distribution.

Comment by TekhneMakre on Biology-Inspired AGI Timelines: The Trick That Never Works · 2021-12-02T03:05:40.295Z · LW · GW

(I'm taking the tack that "you might be wrong" isn't just already accounted for in your distributions, and you're now considering a generic update on "you might be wrong".)

so you're more confident about your AGI beliefs, and OpenPhil is less confident. Therefore, to the extent that you might be wrong, the world is going to look more like OpenPhil's forecasts of how the future will probably look

Informally, this is simply wrong: the specificity in OpenPhil's forecasts is some other specificity added to some hypothetical max-entropy distribution, and it can be a totally different sort of specificity than yours (rather than simply a less confident version of yours).

Formally: It's true that if you have a distribution P, and then update on "I might be wrong about the stuff that generated this distribution" to the distribution P', then P' should be higher entropy than P; so P' be more similar in that it's higher entropy to other distributions Q with higher entropy than P. That doesn't mean P' will be more similar than P in terms of what it says will happen, to some other higher entropy distribution Q. You could increase the entropy of P by spreading its mass over more outcomes that Q thinks are impossible; this would make P' further from Q than P is from Q, on natural measures of distance, e.g. KL-divergence (quasi-metric) or earth-mover or whatever. (For the other direction of KL divergence, you could have P reallocate mass away from areas Q thinks are likely; this would be natural if P and Q semi-agreed on a likely outcome, so that P' is more agnostic and has higher expected surprise according to Q. We can simultaneously have KL(P,Q) < KL(P',Q) and KL(Q,P) < KL(Q,P').)

(Also I think for basically any random variable X we can have |E_P(X) - E_Q(X)| < |E_P'(X) - E_Q(X)| for all degrees of wrongness giving P' from P.)

If you put higher probabilities on AGI arriving in the years before 2050, then, on average, you're concentrating more probability into each year that AGI might possibly arrive, than OpenPhil does.

This is true for years before 2050, but not necessarily for years after 2050, if your distribution e.g. has a thick tail and OpenPhil has a thin tail. It's true for all years if both of your distributions are just constant probabilities in each year, and maybe for some other similar kinds of families.

Your probability distribution has lower entropy [than OpenPhil's].

Not true in general, by the above. (It's true that Eliezer's distribution for "AGI before OpenPhil's median, yea/nay?" has lower entropy than OpenPhil's, but that would be true for any two distributions with different medians!)

So to the extent that you're wrong, it should shift your probability distributions in the direction of maximum entropy.

This seems right. (Which might be away from OpenPhil's distribution.) The update from P to P' looks like mixing in some less-specific prior. It's hard to say what it should be; it's supposed to be maximum-entropy given some background information, but IDK what the right way is to put a maximum entropy distribution on the space of years (for one thing, it's non-compact; for another, the indistinguishability of years that could give a uniform distribution or a Poisson distribution seems pretty dubious, and I'm not sure what to do if there's not a clear symmetry to fall back on). So I'm not even sure that the median should go up!

Is Humbali right that generic uncertainty about maybe being wrong, without other extra premises, should increase the entropy of one's probability distribution over AGI,

Yes.

(I mean, if what you think you're maybe wrong about, is specifically some arguments that previously updated you to be less confident than some so-called "maximum"-entropy distribution, then you'd decrease your entropy when you put more weight on being wrong. This isn't generic wrongness, since it's failing to doubt the assumptions that went into the "maximum"-entropy distribution, which apparently you can coherently doubt, since previously some arguments left you to fall back on some other higher-entropy distribution based on weaker assumptions. But I guess it could look like you were supposed to fall back on the lower-entropy distribution, if that felt like the "background".)

thereby moving out its median further away in time?

Not necessarily. It depends what the max-entropy distribution looks like, i.e. what assumptions you're falling back on if you're wrong.

Comment by TekhneMakre on Frame Control · 2021-11-30T03:16:40.506Z · LW · GW

+1

Comment by TekhneMakre on Frame Control · 2021-11-29T23:09:45.412Z · LW · GW

A relevant aspect of in-person interactions is that I think they involve a lot more "plasticity" of the people. In terms of how much B "is given write access" to C's soul, it tends (with variance) to be something like (abstracting over content):

C reads B's writing < C listens to B speaking < C listens to B and watches B acting < C is physically present with B < C is physically present with B and is speaking with B < C is physically present with B and is acting in concert with B

An example of what I mean by B having written to C's soul is that C can "hear B's voice" even when B isn't there; e.g. C reflexively imagines what B would say about what C is doing. Or more abstractly, a proposition B said might bounce around in C's head, being chewed on and propagated. B has somewhat literally made an impression on C. C might adopt mannerisms of B. C might do to D actions that imitate "deepening" (hence correlatedly subtly invasive or coercive or deceptive) actions done to C by B (because, oh, that's how connection works, apparently).

(Obviously in general there's huge mutual benefits to this soul-writing thing, which explains why people do it, which explains why it's vulnerable to exploitation.)

Comment by TekhneMakre on Frame Control · 2021-11-29T20:38:55.576Z · LW · GW

I think I agree with ~everything in your two comments, and yet reading them I want to push back on something, not exactly sure what, but something like: look, there's this thing (or many things with a family resemblance) that happens and it's bad, and somehow it's super hard to describe / see it as it's happening.... and in particular I suspect the easiest, the first way out of it, the way out that's most readily accessible to someone mired in an "oops my internal organs are hooked up to a vampiric force" situation, does not primarily / mainly involve much understanding or theorizing (at least given our collective current level of understanding about these things), and rather involves something with a little more of "wild" vibe, the vibe of running away, of suddenly screaming NO, of asserting meaningful propositions confidently from a perspective, etc. And I get some of this vibe from the OP; like part of the message is (what I'm interpreting to be) the stance someone takes when calling something "frame control" (or "gaslighting" or "emotional abuse" or "cult" or what-have-you).

Which, I still agree with the things you say, and the post does make lots of sort-of-specific, sort-of-vague claims, and gives good data with debatable interpretation, and so on. But there's also this sort of necessarily pre-theoretic theoretic action happening, and I guess I want to somehow have that [hypothesis mixed with judgement mixed with action] be possible as well, including in the common space. (Like, the action is theoretic in that you're reifying some pattern (e.g. "frame control"). It's almost necessarily pre-theoretic, in the sense that you don't even close to fully understand it and it's probably only very roughly joint-carving, because the pattern itself involves making you confused about what's happening and less able to clearly understand patterns. It's an action, a judgement that something is really seriously wrong and you need to change it, a mental motion that rejects something previously accepted, that catapults you out of a satisficing basin; and you're doing this action in a way that somewhat strongly depends or is helped by the non-joint-carving unrefined concept, like "this thing, IDK what it is really, but it's really bad and I have to get out of it, and after escaping I'll think about it more".)

I see you your comments as partly rejecting, or at least incidentally pushing against, this sort of action: to "do it in a way that telegraphs the early-stage-ness" is, when speaking from a pre-theoretic standpoint, in tension with the vibe/action of sharply reclaiming one's own perspective even when that perspective is noticeably incoherent ("something was deeply wrong, I don't know what"). Like, it's definitely a better artifact if you put in the right epistemic tags that point towards uncertainty, places to refine and investigate, etc.; but that's harder to do and requires the author to be detailedly tracking a more complicated boundary around known and unknown, in a way that's, like, not the first mental motion that (AFAIK) has to happen to get the minimum viable concept to self-coordinate on a narrative that says the thing is bad. Internally coordinating on a narrative that X-whatever-it-is is bad, seems important if you're going to have to first push against X in big ways, before it's very feasible to get a better understanding of X. (There's bucket errors here, and it could be helpful to clarify that; but that's maybe sort of the point: someone who's been given a heavy dose of frame control is bucket-errored such that they doubt the goodness of holding their own perspective in part because it's been tied up with other catastrophic things such as disagreeing with their social environment without having a coherent alternative or a coherent / legible grounds for disagreeing.)

Comment by TekhneMakre on Frame Control · 2021-11-29T17:57:49.735Z · LW · GW
I still haven't had the issues come back I did prior to that reclaiming moment, and I've had no further detection of unhonored or ignored pain

Interesting, thanks for the data!

I'll be curious to see your further writing.

burn frame control with fire

Well, setting a fire might require you to get too close; nuking it from orbit is maybe prudenter.

Comment by TekhneMakre on Frame Control · 2021-11-29T10:18:26.889Z · LW · GW

>Indeed this feels kind of epistemically hopeless to ever evaluate from the outside? I don't really know what to do with this thought but it felt important to note.

Does seem good to note, and it would be nice to have more theory about this. We could upgrade our individual abilities to notice when we're being frame controlled / etc.; we could upgrade our collective abilities to aggregate information about whether / how someone is systematically or intentionally harmful; we could close social niches that call up abusive behavior.

I think one piece of the puzzle might be something like:

(1) B can't abuse C without C having the capacity to notice *at some point*. Maybe it's in a year when C isn't enmeshed in the situation; maybe it's only after C has read other accounts of abuse, or other accounts about B specifically;

(2) If B is intentionally* abusing people, B will tend to abuse multiple people, or one person across long time periods. (I don't know if this is true; it's easy enough to imagine B abusing only one person, but it seems unlikely to be intentionional; why would B only use this strategy in one isolated situation?

(3) If B is effectively, skillfully abusing people, B will tend to abuse multiple people, or one person across long time periods. (Because how else would B be good at frame control / etc.? This might not be true because there's skill transfer; e.g. B might do a lot of work that involves deeply understanding people, which is otherwise benign, but gives them the tools to deeply fuck with people in isolated circumstances.)

(4) If you're on the lookout for frame control and such, it's harder to have it happen to you. But being on the lookout is a lot of work.

To the extent that these are all true, ISTM it would be good to somehow be much more willing to publicly discuss stuff like this about specific people. Obviously there's huge issues with scapegoating, and basically people lying. But, it seems that there's value on the table, where giving public reports like "I felt harmed by my interaction with B", being agnostic about intentionality or even causality, and without scapegoating on that basis, would allow future targets to effectively invest their limited capacity to detect stuff (and then make further reports, which opens us up to a sort of streetlight-bias, which we could hopefully correct for).


Another tack, is the idea of investigations---investigative journalism, or (cross-)examinations in a criminal trial. Simply having someone ask searching questions can reveal stuff that's hard to pin down by default. I wonder how much abuse could be revealed with a series of three 3-hour interviews, or something.



(By "intentionally doing X" I don't mean knowingly, conciously, deliberately, or endorsedly, but I do mean more than systematically. "B systematically does X" means simply that B tends to do X more than usual, more than some default, etc. If B intentionally does X, then B systematically does X. But say B is ugly, and people systematically treat ugly people in such a way that X is suitable behavior in that context; then I'd say B does X systematically, but only with very weak intentionality; the intentionality routes through other people, so it makes more sense to say that other people intentionally evoke X. By "B intentionally does X" I mean that X is an aim of B. The more super-ordinate the aim is, in a hierarchy of aims, the more intentional it is; the more B's acheiving effect X is robust to situations, the more intentional it is.)

Comment by TekhneMakre on Frame Control · 2021-11-29T01:14:30.168Z · LW · GW

[Mostly unrelated but sparked by skimming this comment]

It occurs to me that another question around frame control, is: how can I / we facilitate social niches that don't require frame control? In the leadership example: how can I be more willing and able to be led effectively by someone who is e.g. deeply and truly criticized in front of the group? For example, this might involve being more careful about not falling into misinformation cascades, and more intentional about hope.

Comment by TekhneMakre on Frame Control · 2021-11-29T01:08:02.159Z · LW · GW

I agree that's a key question, though it's plausible to me that the reframing is related to a lot of mental things, and so has lots of effects that I don't understand. E.g. if the reframing involves in some sense giving up on justice (<- just a speculation) then it could be locally behaviorally right (justice may be too costly in this case) while also accidentally involving more broadly giving up on justice including where justice would be good.

Comment by TekhneMakre on Yudkowsky and Christiano discuss "Takeoff Speeds" · 2021-11-28T12:53:16.092Z · LW · GW
AlphaFold 2 coming out of Deepmind and shocking the heck out of everyone in the field of protein folding with performance far better than they expected even after the previous shock of AlphaFold, by combining many pieces that I suppose you could find precedents for scattered around the AI field, but with those many secret sauces all combined in one place by the meta-secret-sauce of "Deepmind alone actually knows how to combine that stuff and build things that complicated without a prior example"?

Hm. I wonder if there's a bet to be extracted from this. Like: Eliezer says that Alphafold 2 beats [algorithms previous to Alphafold 2, but with 10x compute], and Paul says the latter beats the former? Or replace Alphafold 2 with anything that Eliezer thinks contains some amount of secret sauce over previous things (whether or not its performance is "on trend").

Comment by TekhneMakre on Frame Control · 2021-11-28T11:32:58.720Z · LW · GW

Here's some notes I took about the first some minutes of Gaslight (1944) (SPOILER alert. It's a very good movie, and somewhat relevant).

When he grabs the letter out of her hands he's like "Oh uh I was just worried about all the unhappy memories it's reminding you of". It's weird, it's a double move: on the one hand, most obviously it's a lie to cover up that he's worried about something else, but also it reveals that he's positioning himself as hyperconcerned about her. He doesn't excuse it by some selfish motive like "I became super curious about the letter" or "Your talking is annoying me" or whatever. Further, his supposed concern is about her "unhappy memories", positioning himself as an agent who takes it as a salient variable to track, what's going on with her memories and emotions; and implicitly, that he's an agent in the position to affect and manage her relationship with her memories and emotions.


And in the next breath, he explicitly tells her to forget all that unhappy stuff. He says "While you are afraid of anything, there cannot be any happiness for us"; "You must forget her". This sounds sort of innocent, especially in the context of concern, but it's ambiguous between a mere description/prediction of what will make them happy, vs. a threat of e.g. leaving or withholding happiness from her if she doesn't follow his orders. It's also just an obviously extreme (and implausible) statement when considered explicitly / from a third-person perspective, but in a way that could slip by as merely being high-intensity because the situation is high-intensity, rather than itself being a crazy statement.

Her only response is to say "Well, not [to forget] her, but what happened to her". Which, now that I think of it, makes the extremeness look intentional: by making an extreme statement and command, her corrective reasoning doesn't correct far enough. She'll remember her aunt, but has still tacitly agreed to forget what happened to her aunt; and this constitutes a step towards buying in to him being appropriate to give her orders about what to do with her mind.

He gives her his mother's pin as a gift, but then immediately takes it back. "You are inclined to lose things Paula." "I didn't realize that." "Oh just little things." It's a subtle dashing of hope; supposedly she's special enough for him to give her his heirloom, and he loves her and wants to symbolize her specialness, but then he "realizes" that she's unreliable; it's her fault, so *in opposition* to the motions of his naive love for her he has to maturely handle her. He's staying in the frame of concern that he'll keep the pin for safekepping so she doesn't lose it. He probes with a strong claim, and then when she notes her surprise, he "clarifies", which is in effect a subtle retreat, to a weaker and more difficult to verify claim ("just little things"); it's just on the edge where it's reasonable to trust someone's report, since maybe they noticed something you didn't, and it's hard to falsify. He's using her feedback about when she notices that his claims are implausible, to calibrate where and how far he can go in each case.

"There, now you'll remember where it is." "Oh don't be silly, of course I'll remember." "Oh I was teasing you my dear." More probing and walking-back. He reasserts his proposition about her forgetfulness, by implication ("now you'll remember [and you wouldn't otherwise]"); and then he pretend to not have actually been doing so; and he also implicitly asserts that she can't tell when he's teasing her.

"You've been forgetting things. Don't worry, you've been tired." "Yes that's it, I've been tired." He gives her an explanation that she can be sure will be accepted by the shared narrative; thus defusing the tension of the falsification of her experience without her having to directly contradict his false report; and thus walking her further into committing to agreeing that she's unreliable.

He took the broach, causing her to believe she'd harmed him, then when she accepts that she's been forgetful and is spooked and sad / guilty, he comforts her, pretending it's just that she's tired. It's a smokescreen for him: he appears to be comforting, when he knows he can keep the ruse up, and that she'll now believe the ruse and won't suspect him. And really what's happening is that he's dispelling her localized, specific confusion---her confidence that something weird just happened because it contradicts her memory, her concentration of epistemic force in time, her attention on the details of a context in which a trick is in fact being played on her and she's noticed and is trying to form a coherent hypothesis---he dispels that by giving her an explanation which tacitly still assumes that in fact what happened was that she was forgetful.

"Don't worry." More telling her what to feel. "It's not valuable." Knowing that she wasn't worried that he was worried about the monetary value; then when she apologizes more because it's his family hierloom, he doesn't contradict her. That way the positive transcript---what he actually said out loud---reads like he's being forgiving and saying that she doesn't owe him, the negative transcript shows that he's letting her insist that she's at fault, harmed him. Reminds me of a dinner guest insisting on helping clear the table and the host super-insisting it's fine; the guest is sometimes in a sense insincere. He sets it up so that it's she, not he, who brings the true harm of losing an heirloom into the conversation; subtly this reinforces that she's worse than he's letting on, and it's only his generosity that's keeping them together, despite what she would know if only she checks to see that she's bad.

"It hurts me when you're ill and fanciful." After confronting her, challenging her to assert---while he's staring her down, after being cross with her and denying he's cross with her---that she is actually perceiving that the maid despises her. "I hope you're not starting to imagine things again." "You're not, are you Paula?" Either she's starting to imagine things again, or the maid despises her; the former is a catastrophized version of the obvious other hypothesis, that she's imagining / mistaken about just this particular thing. By catastrophizing, he removes reasonable options---either deny her husband's narrative based on an uncertain perception, or else admit to being teetering on the edge of losing her mental continence.

"But my dear, I thought you were only being polite, why didn't you tell me you really wanted to see her?" Right after he just yelled at her to get her to say no. He silences her, and pretends that he's left communication channels perfectly well open if only she would just use them in the obvious way.

He lets her infer that she forgot he's taking her out, lets her stew in that for like 20 seconds, talks about something else, then tells her it was indeed a surprise.

Et cetera.

(Damn, that dude really likes jewels.)

Comment by TekhneMakre on Frame Control · 2021-11-28T10:37:14.182Z · LW · GW

(1) Thanks for writing this, it seems very important.

(2) This:

Ultimately, checking in with how you actually feel is the answer. I don’t mean to imply this is easy; it’s often really hard to know how you feel, and maybe it changes often and frame controllers put in a lot of effort to obfuscate this. But in the end, careful attention to your own sensations are your saving grace.

I think there's something basically irreplaceable about checking in with how you actually feel; e.g. it's thankfully harder for frame control to hack, ISTM, though checking in is also hard to do (e.g. because you "actually feel" like the leader is truly important and that you'd follow them anywhere even if it "superficially not actually" hurts a lot).

I want though to raise a flag for it not being a sufficient answer, and for more theorizing about how frame control and related things work, and how to navigate around them, and such. Like, it feels like there's a specific missing technology here.

Comment by TekhneMakre on Frame Control · 2021-11-28T10:11:29.380Z · LW · GW


[I feel like the following question might be triggering, not sure. It references your childhood. The triggering I expect is maybe something like, the question conflates / juxtaposes two things that are similar, but importantly very different, such that if the distinction weren't kept solidly in mind, there'd be strong psychic forces pointing in opposite directions? Idk.]
Anyway: I notice that you say:
> But a key aspect of frame control is reframing harm as good
And also, from https://knowingless.com/2018/09/21/trauma-narrative/ :
> And then I realized that that’s what my father had done to me – he’d given me the ability to experience life with such ongoing lightness, and what he’d done had been worth it. All I’d been through had been worth it. If I could go back in time, I wouldn’t change my life at all. This pain was mine, now, chosen by me, held by me deliberately, and nothing about it was wrong.

Prima facie these seem in tension. What are the differences between harm reframed as good, vs. harm... "reclaimed as good"? A reaction I had reading your "reclaiming" was like, there's something off, it mostly seems desirable, but also there's some loss of integrity / preciseness or something, or like, the pain wasn't fully interpreted / honored, or the pain probably had some further telos, or the pain was subtly ignored, and that ignoring has something deep in common with someone being frame controlled into trying to ignore harm or think it's good.

ETA: maybe the thing supposedly being elided, is the concept of justice.

Comment by TekhneMakre on Christiano, Cotra, and Yudkowsky on AI progress · 2021-11-27T10:22:48.930Z · LW · GW

And then Paul's response to Eliezer is like "but engines don't just appear without precedent, there's worse partial versions of them beforehand, much more so if people are actually trying to do locomotion; so even if knocking out a piece of the AI that FOOMs would make it FOOM much slower, that doesn't tell us much about the lead-up to FOOM, and doesn't tell us that the design considerations that go into the FOOMer are particularly discontinuous with previously explored design considerations"?

Comment by TekhneMakre on Christiano, Cotra, and Yudkowsky on AI progress · 2021-11-27T09:45:11.197Z · LW · GW


Why do you use this form? Do you lean more on:
1. Historical trends that look hyperbolic;
2. Specific dynamical models like: let α be the synergy between "different innovations" as they're producing more innovations; this gives f'(x) = f(x)^(1+α) *; or another such model?;
3. Something else?

I wonder if there's a Paul-Eliezer crux here about plausible functional forms. For example, if Eliezer thinks that there's very likely also a tech tree of innovations that change the synergy factor α, we get something like e.g. (a lower bound of) f'(x) = f(x)^f(x). IDK if there's any help from specific forms; just that, it's plausible that there's forms that are (1) pretty simple, pretty straightforward lower bounds from simple (not necessarily high confidence) considerations of the dynamics of intelligence, and (2) look pretty similar to hyperbolic growth, until they don't, and the transition happens quickly. Though maybe, if Eliezer thinks any of this and also thinks that these superhyperbolic synergy dynamics are already going on, and we instead use a stochastic differential equation, there should be something more to say about variance or something pre-End-times.

*ETA: for example, if every innovation combines with every other existing innovation to give one unit of progress per time, we get the hyperbolic f'(x) = f(x)^2; if innovations each give one progress per time but don't combine, we get the exponential f'(x) = f(x).

Comment by TekhneMakre on Comment on "Deception as Cooperation" · 2021-11-27T09:25:15.084Z · LW · GW
In that light, it could seem unnecessarily antagonistic to pick a particular codeword from a shared communication code and disparagingly call it "deceptive"—tantamount to the impudent claim that there's some objective sense in which a word can be "wrong."

This seems too strong; we can still reasonably talk about deception in terms of a background of signals in the same code. The actual situation is more like, there's lots of agents. Most of them use this coding in correspondence-y way (or if "correspondence" assumes too much, just, most of the agents use the coding in a particular way such that a listener who makes a certain stereotyped use of those signals (e.g. what is called "takes them to represent reality") will be systematically helped). Some agents instead use the channel to manipulate actions, which jumps out against this background as causing the stereotyped use to not achieve its usual performance (which is different from, the highly noticeable direct consequence of the signal (e.g., not wearing a mask) was good or bad, or the overall effect was net good or bad). Since the deceptive agents are not easily distinguishable from the non-deceptive agents, the deception somewhat works, rather than you just ignoring them or biting a bullet like "okay sure, they'll deceive me sometimes, but the net value of believing them is still higher than not, no problem!". That's why there's tension; you're so close to having a propositional protocol---it works with most agents, and if you could just do the last step of filtering out the deceivers, it'd have only misinformation, no disinformation---but you can't trivially do that filtering, so the deceivers are parasitic on the non-deceivers's network. And you're forced to either be mislead constantly; or else downgrade your confidence in the whole network, throwing away lots of the value of the messages from non-deceivers; or, do the more expensive work of filtering adversaries.

Comment by TekhneMakre on Do factored sets elucidate anything about how to update everyday beliefs? · 2021-11-22T07:09:03.185Z · LW · GW

Comment by TekhneMakre on Why I am no longer driven · 2021-11-17T03:45:24.389Z · LW · GW

I felt sad, like a sense of mourning, from this, and some sense of a "pregnant dusk" in a hopeful way.

Comment by TekhneMakre on A positive case for how we might succeed at prosaic AI alignment · 2021-11-17T03:12:53.112Z · LW · GW

Okay, I think I'm getting a little more where you're coming from? Not sure. Maybe I'll read the LCDT thing soon (though I'm pretty skeptical of those claims).

(Not sure if it's useful to say this, but as a meta note, from my perspective the words in the post aren't pinned down enough to make it at all clear that the hard part is (2) rather than (1); you say "natural" in (1), and I don't know what you mean by that such that (1) isn't hard.)

Maybe I'm not emphasizing how unnatural I think (A) is. Like, it's barely even logically consistent. I know that (A) is logically consistent, for some funny construal of "only trying", because Evan is a perfect imitation of Evan; and more generally a good WBE could maybe be appropriately construed as not trying to do anything other than imitate Evan; and ideally an FAI could be given an instruction so that it doesn't, say, have any appreciable impacts other than the impacts of an Evan-imitation. For anything that's remotely natural and not "shaped" like Evan is "shaped", I'm not sure it even makes sense to be only trying to imitate Evan; to imitate Evan you have to do a whole lot of stuff, including strategically arranging cognition, reason about far-reaching consequences in general, etc., which already constitutes trying to do something other than imitating Evan. When you're doing consequentialist reasoning, that already puts you very close in algorithm-space to malign strategic thinking, so "consequentialist but not deceptive (hence not malignly consequentialist)" is very unnatural; IMO like half of the whole the alignment problem is "get consequentialist reasoning that isn't consequentalisting towards some random thing".

Comment by TekhneMakre on A positive case for how we might succeed at prosaic AI alignment · 2021-11-17T02:07:03.681Z · LW · GW


Your steps (2)-(4) seem to rely fairly heavily on the naturality of the class described in (1), e.g. because (2) has to recognize (1)s which requires that we can point to (1)s. If by "with the [[sole?]] goal of imitating Evan" you mean that

A. the model is actually really *only* trying to imitate Evan,

B. the model is competent to not accidentally also try to do something else (e.g. because the ways it pursues its goal are themselves malign under distributional shift), and

C. the training process you use will not tip the internal dynamics of the model over into a strategically malign state (there was never any incentive to prevent that from happening any more robustly than just barely enough to get good answers on the training set, and I think we agree that there's a whole pile of [ability to understand and pursue far-reaching consequences] sitting in the model, making strategically malign states pretty close in model-space for natural metrics),

then yes this would plausibly not be deceptive, but it seems like a very unnatural class. I tried to argue that it's unnatural in the long paragraph with the different kinds of myopia, where "by (strong) default" = "it would be unnatural to be otherwise".

Comment by TekhneMakre on A positive case for how we might succeed at prosaic AI alignment · 2021-11-16T23:57:56.066Z · LW · GW
just replace “imitate HCH” with “imitate Evan” or something like that

So these are both training-myopic, meaning they both are being trained only to do the task right in front of them, and aren't (directly) rewarded for behavior that sacrifices reward now for reward in future episodes. Neither seem objective-myopic, meaning both of their objective functions are computed (seemingly necessarily) using far-reaching-consequences-understanding. Neither seem behavior-myopic, meaning both of them would successfully target far-reaching-consequences (by assumption of being competitive?). I think if you're either objective-non-myopic or behavior-non-myopic, then by default you're thought-non-myopic (meaning you in fact use far-reaching-consequences-understanding in your reasoning). I think if you're thought-non-myopic, then by default you're values-non-myopic, meaning you're pursuing specific far-reaching-consequences. I think if you're values-non-myopic, then you're almost certainly deceptive, by strong default.

We're just talking about step (1), so we're not talking about training at all right now. We're just trying to figure out what a natural class of agents would be that isn't deceptive.
For step (1) we're not trying to figure out what would happen by default if you trained a model on something, we're just trying to understand what it might look like for an agent to be myopic in a natural way.

In step (1) you wrote:

I think it is possible to produce a simple, natural description of myopia such that myopic agents are still capable of doing all the powerful things we might want out of an AGI but such that they never have any reason to be deceptive

I think if something happens by default, that's a kind of naturalness. Maybe I just want to strengthen the claims above to say "by strong default". In other words, I'm saying it's a priori very unnatural to have something that's behavior-non-myopic but thought-myopic, or thought-non-myopic but not deceptive, and overcoming that unnaturality is a huge hurdle. I would definitely be interested in your positive reasons for thinking this is possible.

Comment by TekhneMakre on A positive case for how we might succeed at prosaic AI alignment · 2021-11-16T22:36:06.767Z · LW · GW
I have no idea where you're getting this idea of an assemblage from; nowhere did I say anything about that.

Huh. There's definitely some miscommunication happening...

From the post:

For example, a myopic agent could myopically simulate a strongly-believed-to-be-safe non-myopic process such as HCH, allowing imitative amplification to be done without ever breaking a myopia guarantee
In general, I think it’s just not very hard to leverage careful recursion to turn non-myopic objectives into myopic objectives such that it’s possible for a myopic agent to do well on them

You give HCH + iterative amplification as an example, which I responded to. You say that in general, recursion can allow myopic agents to do well on non-myopic objectives; this sure sounds like making a kind of assemblage in order to get non-myopicness. You link: https://www.lesswrong.com/posts/YWwzccGbcHMJMpT45/ai-safety-via-market-making , which I hadn't seen before, but at a glance, it (1) predicts and manipulates humans, which are non-myopic reasoners, (2) involves iteration, and (3) an as additional component, uses Amp(M) (an assemblage of myopic reasoners, no?).

you could substitute in any other myopic objective that might be aligned and competitive instead.

Oops, there's more confusion here. HCH is a myopic objective? I could emit the sentence, "the AI is only trained to predict the answer given by HCH to the question that's right in front of it", but I don't think I understand a perspective in which that's really myopic, in the sense of not doing consequentialist reasoning about far-reaching plans, given that it's predicting (1) humans (2) in a big assemblage that (3) by hypothesis successfully answer questions about far-reaching plans (and (4) using Amp, which is a big spot where generalization (e.g. consequentialist generalization) comes in). Could you point me towards a more detailed writeup / discussion about what's meant by HCH being a relevantly myopic objective that responds to the objection about, well, its output does nevertheless get right answers to questions about far-reaching consequences?

myopic objective that might be aligned and competitive instead

I'm interested in whether objectives can be aligned and competitive and myopic. That still seems like the cat-belling step.


If that's how you want to define myopia/non-myopia then sure, you're welcome to call an HCH imitator non-myopic. But that's not the version of myopia that I'm working with/care about.

From point 1. of the OP:

I think it is possible to produce a simple, natural description of myopia such that myopic agents are still capable of doing all the powerful things we might want out of an AGI but such that they never have any reason to be deceptive

My best current guess is that you're saying something like, if the agent is myopic, that means it's only trained to try to solve the problem right in front of it; so it's not trained to hide its reasoning in order to game the system across multiple episodes? What's the argument that this implies non-deceptiveness? (Link would be fine.) I was trying to say, if it's predicting a far-conquences-understander, it has to do far-consequences-understanding, therefore it's able to do far-consequences-understanding, therefore it's (1) liable to, by default, in effect have values it pursues over far-consequences, and (2) is able to effectively pursue those values without further ado. The case for (2) is more clear, since arguendo it is able to do far-consequences-understanding. Maybe the case for (1) needs to be made.

Comment by TekhneMakre on Why do you believe AI alignment is possible? · 2021-11-16T18:52:45.522Z · LW · GW

Mostly, all good. (I'm mainly making this comment about process because it's a thing that crops up a lot and seems sort of important to interactions in general, not because it particularly matters in this case.) Just, "I meant you're intentionally moving the conversation away from trying to nail down specifics"; so, it's true that (1) I was intentionally doing X, and (2) X entails not particularly going toward nailing down specifics, and (3) relative to trying to nail down specifics, (2) entails systematically less nailing down of specifics. But it's not the case that I intended to avoid nailing down specifics; I just was doing something else. I'm not just saying that I wasn't *deliberately* avoiding specifics, I'm saying I was behaving differently from someone who has a goal or subgoal of avoiding specifics. Someone with such a goal might say some things that have the sole effect of moving the conversation away from specifics. For example, they might provide fake specifics to distract you from the fact they're not nailing down specifics; they might mock you or otherwise punish you for asking for specifics; they might ask you / tell you not to ask questions because they call for specifics; they might criticize questions for calling for specifics; etc. In general there's a potentially adversarial dynamic here, where someone intends Y but pretends not to intend Y, and does this by acting as though they intend X which entails pushing against Y; and this muddies the waters for people just intending X, not Y, because third parties can't distinguish them. Anyway, I just don't like the general cultural milieu of treating it as an ironclad inference that if someone's actions systematically result in Y, they're intending Y. It's really not a valid inference in theory or practice. The situation is sometimes muddied, such that it's appropriate to treat such people *as though* they're intending Y, but distinguishing this from a high-confidence proposition that they are in fact intending Y (even non-deliberately!) is important IMO.

Comment by TekhneMakre on Why do you believe AI alignment is possible? · 2021-11-16T12:09:54.348Z · LW · GW

a) worth doing?

Extremely so; you only ever get good non-specifics as the result having iteratively built up good specifics.

b) possible to do?

In general, yes. In this case? Fairly likely not; it's bad poetry, the senses that generated are high variance, likely nonsense, some chance of some sense. And alignment is hard and understanding minds is hard.

c) something you wish to do in this conversation?

Not so much, I guess. I mean, I think some of the metaphors I gave, e.g. the one about the 10 year old, are quite specific in themselves, in the sense that there's some real thing that happens when a human grows up which someone could go and think about in a well-defined way, since it's a real thing in the world; I don't know how to make more specific what, if anything, is supposed to be abstracted from that as an idea for understanding minds, and more-specific-ing seems hard enough that I'd rather rest it.

Thanks for noting explicitly. (Though, your thing about "deflecting" seems, IDK what, like you're mad that I'm not doing something, or something, and I'd rather you figure out on your own what it is you're expecting from people explicitly and explicitly update your expectations, so that you don't accidentally incorrectly take me (or whoever you're talking to) to have implicitly agreed to do something (maybe I'm wrong that's what happened). It's connotatively false to say I'm "intentionally deflecting" just because I'm not doing the thing you wanted / expected. Specific-ing isn't the only good conversational move and some good conversational moves go in the opposite direction.)

Comment by TekhneMakre on Ngo and Yudkowsky on alignment difficulty · 2021-11-16T10:52:38.342Z · LW · GW

Do you think you can encode good flint-knapping technique genetically? I doubt that.

I think I agree with your point, and think it's a more general and correct statement of the bottleneck; but, still, I think that genome does mainly affect the mind indirectly, and this is one of the constraints making it be the case that humans have lots of learning / generalizing capability. (This doesn't just apply to humans. What are some stark examples of animals with hardwired complex behaviors? With a fairly high bar for "complex", and a clear explanation of what is hardwired and how we know. Insects have some fairly complex behaviors, e.g. web building, ant-hill building, the tree-leaf nests of weaver ants, etc.; but IDK enough to rule out a combination of a little hardwiring, some emergence, and some learning. Lots of animals hunt after learning from their parents how to hunt. I think a lot of animals can walk right after being born? I think beavers in captivity will fruitlessly chew on wood, indicating that the wild phenotype is encoded by something simple like "enjoys chewing" (plus, learned desire for shelter), rather than "use wood for dam".)

An operationalization of "the genome directly programs the mind" would be that things like [the motions employed in flint-knapping] can be hardwired by small numbers of mutations (and hence can be evolved given a few million relevant years). I think this isn't true, but counterevidence would be interesting. Since the genome can't feasibly directly encode behaviors, or at least can't learn those quickly enough to keep up with a changing niche, the species instead evolves to learn behaviors on the fly via algorithms that generalize. If there were *either* mind-mind transfer, *or* direct programming of behavior by the genome, then higher frequency changes would be easier and there'd be less need for fluid intelligence. (In fact it's sort of plausible to me (given my ignorance) that humans are imitation specialists and are less clever than Neanderthals were, since mind-mind transfer can replace intelligence.)

Comment by TekhneMakre on [deleted post] 2021-11-16T09:55:30.981Z

If some of our measure is in a simulation that's being run to determine whether our measure in real worlds will acausally bargain to get gains from trade, it's maybe a defection against the bargaining process to force the universe to provide a lot of compute for us (e.g. by running an intergalactic civilization that's crypographically verified to actually be running), before we've done the bargaining, or at the very least legibly truly precommitted to a bargaining process. Otherwise we force simulators to either waste a lot of resources simulating us, or else give up on bargaining with us altogether. (We'd not even necessarily get value from those resources, if e.g. there's an initial period of scrambling to expand into the lightcone before the party starts.)

Comment by TekhneMakre on A positive case for how we might succeed at prosaic AI alignment · 2021-11-16T09:07:36.534Z · LW · GW

(Seems someone -7'd this; would be interested in why.)

Comment by TekhneMakre on A positive case for how we might succeed at prosaic AI alignment · 2021-11-16T08:34:22.219Z · LW · GW
Certainly it doesn't matter what substrate the computation is running on.

I read Yudkowsky as positing some kind of conservation law. Something like, if the plans produced by your AI succeed at having specifically chosen far-reaching consequences if implemented, then the AI must have done reasoning about far-reaching consequences. Then (I'm guessing) Yudkowsky is applying that conservation law to [a big assemblage of myopic reasoners which outputs far-reaching plans], and concluding that either the reasoners weren't myopic, or else the assemblage implements a non-myopic reasoner with the myopic reasoners as a (mere) substrate.

Reasoning correctly about far-reaching consequences by default (1) has mistargeted consequences, and (2) is done by summoning a dangerous reasoner.

Such optimizers can still end up producing actions with far-reaching consequences on the world if they deploy their optimization power in the service of an objective like imitating HCH that requires producing actions with particular consequences, however.

I think what you're saying here implies that you think it is feasible to assemble myopic reasoners into a non-myopic reasoner, without compromising safety. My possibly straw understanding, is that the way this is supposed to happen in HCH is that, basically, the humans providing the feedback train the imitator(s) to implement a collective message-passing algorithm that answers any reasonable question or whatever. This sounds like a non-answer, i.e. it's just saying "...and then the humans somehow assemble myopic reasoners into a non-myopic reasoner". Where's the non-myopicness? If there's non-myopicness happening in each step of the human consulting HCH, then the imitator is imitating a non-myopic reasoner and so is non-myopic (and this is compounded by distillation steps). If there isn't non-myopicness happening in each step, how does it come in to the assembly?

Comment by TekhneMakre on Why do you believe AI alignment is possible? · 2021-11-16T07:59:02.545Z · LW · GW
One could say that organs are in fact subagents, they have different goals.

I wouldn't want to say that too much. I'd rather say that an organ serves a purpose. It's part of a design, part of something that's been optimized, but it isn't mainly optimizing, or as you say, it's not intelligent. More "pieces which can be assembled into an optimizer", less "a bunch of little optimizers", and maybe it would be good if the human were doing the main portion of the assembling, whatever that could mean.

humans with sufficiently many metaphorical hands and eyes in the year 2200 could look superintelligent to humans in 2021, same as how our current reasoning capacity in math, cogsci, philosophy etc could look superhuman to cavemen

Hm. This feels like a bit of a different dimension from the developmental analogy? Well, IDK how the metaphor of hands and eyes is meant. Having more "hands and eyes", in the sense of the bad poetry of "something you can weild or perceive via", feels less radical than, say, what happens when a 10-year-old meets someone they can have arguments with and learns to argue-think.

Just wondering, could an AI have an inner model of the world independent from human's inner model of the world, and yet exist in this hybrid state you mention? Or must they necessarily share a common model or significantly collaborate and ensure their models align at all times?

IDK, it's a good question. I mean, we know the AI has to be doing a bunch of stuff that we can't do, or else there's no point in having an AI. But it might not have to quite look like "having its own model", but more like "having the rest of the model that the human's model is trying to be". IDK. Also could replace "model" with "value" or "agency" (which goes to show how vague this reasoning is).

Comment by TekhneMakre on Ngo and Yudkowsky on alignment difficulty · 2021-11-16T03:23:18.680Z · LW · GW
most of the genome is junk. That's pretty strong evidence that the size of the genome is not itself a taut constraint.

My guess is that this is a total misunderstanding of what's meant by "genomic bottleneck". The bottleneck isn't the amount of information storage, it's the fact that the genome can only program the mind in a very indirect, developmental way, so that it can install stuff like "be more interested in people" but not "here's how to add numbers".

Comment by TekhneMakre on Ngo and Yudkowsky on alignment difficulty · 2021-11-16T03:08:45.096Z · LW · GW


> I expect the first alignment solution you can actually deploy in real life, in the unlikely event we get a solution at all, looks like 98% "don't think about all these topics that we do not absolutely need and are adjacent to the capability to easily invent very dangerous outputs" and 2% "actually think about this dangerous topic but please don't come up with a strategy inside it that kills us".

Some ways that it's hard to make a mind not think about certain things:
1. Entanglement.
1.1. Things are entangled with other things.
--Things are causally entangled. X causes Y, Y causes X, Z causes X and Y, X and Y cause Z and you've conditioned on Z; and chains of these.
--Things are constitutively entangled. A computer is a computer and is also usually wires, so thinking about computers entails thinking about wires, and vice versa.
--Things are telically entangled; X serves the purpose Y or vice versa, X and Y serve the same purpose, X and Y are served by the same Z.
--Things are logically entangled; the way X works may be like the way Y works, so understanding X also constitutes understanding Y to some extent.
1.2. When you think about X, and X is entangled with Y, you also think about Y.
2. Meta-level transfer. If you think about X, part of how you do that is by inventing and refining ways to think in general; even if you aren't thinking about Y, this makes it much easier to later think about Y. (This is maybe just an extreme form of logical entanglement, but feels less "about the things themselves".)
3. The need for pointers. As they say: Don't think about pink elephants. To specify what it is that the AI is supposed to not think about, you have to say *something*; how do you point at the forbidden thoughts, in a way that's robust across all contexts and all possible conceptual factorings of the domain, without already spinning up instances of the forbidden thoughts?

-------

Some ML experiments someone could run:
1. Train a NN A to recognize images, but at the same time, train its weights so that its activations can't be used to distinguish Xs from Ys (where X and Y are categories in the recognition task, say). That is, you train a NN B that, given image x, takes A(x)'s activations as inputs, and tries to predict whether x is X or Y; then you update the weights of A along the gradient that decreases B's accuracy at that task (so, you freeze the weights of B for this step). When does this converge at all? Can you get it to converge so that A is SOTA on all inputs that aren't Xs or Ys, and it assigns Xs and Ys to X or Y randomly?

2. Similar setup, but now B is just an ordinary X vs. Y classifier, and we train A so that you can't predict any of B's activations*. Does A end up being able to distinguish Xs from Ys? (Probably, and this speaks to the pointer problem; just saying, don't think about stuff like such-and-so (e.g. the activations of B), isn't yet enough to actually not think about it.

*Say, with a linear map, or whatever. Well, maybe we want to exclude the last layer of B or something, since that's close to just training A to not be able to recognize X vs. Y.

3. Train A to recognize all the images, except train it (in some way) to not be able to distinguish Xs from Ys. Now, see how much additional fine-tuning is needed to further train this trained A to predict Xs and Ys (now without the anti-training). Entanglement predicts that there's not much further training needed.

Comment by TekhneMakre on Why do you believe AI alignment is possible? · 2021-11-15T20:01:38.842Z · LW · GW
even hands and eyes can be existentially dangerous

That seems right, though it's definitely harder to be an x-risk without superintelligence; e.g. even a big nuclear war isn't a guaranteed extinction, nor an extremely infectious and lethal virus (because, like, an island population with a backup of libgen could recapture a significant portion of value).

necessarily be a man-machine hybrid

I hope not, since that seems like an additional requirement that would need independent work. I wouldn't know concretely how to use the hybridizing capability, that seems like a difficult puzzle related to alignment. I think the bad poetry was partly trying to say something like: in alignment theory, you're *trying* to figure out how to safely have the AI be more autonomous---how to design the AI so that when it's making consequential decisions without supervision, it does the right thing or at least not a permanently hidden or catastrophic thing. But this doesn't mean you *have to* "supervise" (or some other high-attention relationship that less connotes separate agents, like "weild" or "harmonize with" or something) the AI less and less; more supervision is good.

subagents (man and machine) able to talk to each other in english?

IDK. Language seems like a very good medium. I wouldn't say subagent though, see below.

Would this be referring to superhuman capabilities that are narrow in nature?

This is a reasonable interpretation. It's not my interpretation; I think the bad poetry is talking about the difference between one organic whole vs two organic wholes. It's trying to say that having the AI be genuinely generally intelligent doesn't analytically imply that the AI is "another agent". Intelligence does seem to analytically imply something like consequentialist reasoning; but the "organization" (whatever that means) of the consequentialist reasoning could take a shape other than "a unified whole that coherently seeks particular ends" (where the alignment problem is to make it seek the right ends). The relationship between the AI's mind and the human's mind could instead look more like the relationship between [the stuff in the human's mind that was there only at or after age 10] and [the stuff in the human's mind that was there only strictly before age 10], or the relationship between [one random subset of the organs and tissues and cells in an animal] and [the rest of the organs and tissues and cells in the that animal that aren't in the first set]. (I have very little idea what this would look like, or how to get it, so I have no idea whether it's a useful notion.)

Comment by TekhneMakre on Why do you believe AI alignment is possible? · 2021-11-15T11:21:25.979Z · LW · GW
I'll even accept poetry

I will now drive a small truck through the door you left ajar. (This is indeed bad poetry, so it's not coherent and not an answer and also not true, but it has some chance of being usefully evocative.)

It seems as though when I learn new information, ideas, or thought processes, they become available for my use towards my goals, and don't threaten my goals. To judge between actions, usually most of what I want to attend to figuring out is the likely consequences of the actions, rather than the evaluation of those consequences (excepting evaluations that are only about further consequences), indicating that to the extent my values are able to notice when they are under threat, they are not generally under threat by other thought processes. When I have unsatisfied desires, it seems they're usually mostly unsatisfied because I don't know which actions to take to bring about certain consequences, and I can often more or less see what sort of thing I would do, at some level of abstraction, to figure out which actions to take; suggesting that there is such a thing as "mere problem solving thought", because that's the sort of thought that I think I can see as a meta-level plan that would work, i.e., my experience from being a mind suggests that there is an essentially risk-free process I can undertake to gain fluency in a domain that lays the domain bare to the influence of my values. An FAI isn't an artifact, it's a hand and an eye. The FAI doing recursive self-improvement is the human doing recursive self-improvement. The FAI is densely enmeshed in low-latency high-frequency feedback relationships with the humans that resemble the relationships between different mental elements of my mental model of the room around me, or between those and the micro-tasks I'm performing and the context of those micro-tasks. A sorting algorithm has no malice, a growing crystal has no malice, and likewise a mind crystallizing well-factored ontology, from the primordial medium of low-impact high-context striving, out into new domains, has no malice. The neocortex is sometimes at war with the hardwired reward, but it's not at war with Understanding, unless specifically aimed that way by social forces; there's no such thing as "values" that are incompatible with Understanding, and all that's strictly necessary for AGI is Understanding, though we don't know how to sift baby from bath-lava. The FAI is not an agent! It defers to the human not for "values" or "governance" or "approval" but for context and meaning and continuation; it's the inner loop in an intelligent process, the C code that crunches the numbers. The FAI is a mighty hand growing out of the programmer's forehead. Topologically the FAI is a bubble in space that is connected to another space; metrically the FAI bounds an infinite space (superintelligence), but from our perspective is just a sphere (in particular, it's bounded). The tower of Babylon, but it's an inverted pyramid that the operator balances delicately. Or, a fractal, a tree say, where the human controls the angle of the branches and the relative lengths, propagating infinitely up but with fractally bounded impact. Big brute searches in algorithmically barren domains, small careful searches in algorithmically rich domains. The Understanding doesn't come from consequentialist reasoning; consequentialist reasoning constitutively requires Understanding; so the door is open to just think and not do anything. Algorithms have no malice. Consequentialist reasoning has malice. (Algorithms are shot through with consequentiality, but that's different from being aimed at consequences.) I mostly don't gain Understanding and algorithms via consequentialist reasoning, but by search+recognition, or by the play of thoughts against each other. Search is often consequentialist but doesn't have to be. One can attempt to solve a Rubik's cube without inevitably disassembling it and reassembling it in order. The play of thoughts against each other is logical coherence, not consequentialism. The FAI is not a predictive processor with its set-point set by the human, the FAI and the human are a single predictive processor.

Comment by TekhneMakre on What would we do if alignment were futile? · 2021-11-14T23:11:31.832Z · LW · GW
I bet a lot of them are persuadable in the next 2 to 50 years.

They may be persuadable that, in a non-emergency situation, they should slow down when their AI seems like it's teetering on the edge of recursive self-improvement. It's much harder to persuade them to

1. not publish their research that isn't clearly "here's how to make an AGI", and/or

2. not try to get AGI without a good theory of alignment, when "the other guys" seem only a few years away from AGI.

So ~everyone will keep adding to the big pool of ~public information and ideas about AI, until it's not that hard to get the rest of the way to AGI, at which point some people showing restraint doesn't help by that much.

Comment by TekhneMakre on Comments on Carlsmith's “Is power-seeking AI an existential risk?” · 2021-11-13T10:12:44.474Z · LW · GW
“stuff the human brain does easily in a half-second”

This is ambiguous between tasks the brain does in a half-second, vs. everything the brain does in a half-second. In a half-second the brain does a bunch of stuff to perform well in instance of the half-second long task it's currently doing, and also it's doing other stuff to e.g. learn how to do perform well in future instances of the task, and to "understand" the elements of the task insofar as those elements will also appear in other tasks. AFAIK ML is systematically more convincing about task performance than about transfer.

Comment by TekhneMakre on Khamisi Wahyu's Shortform · 2021-11-13T07:23:12.430Z · LW · GW

If you want to get famous for being a musician, it helps to just love the music for itself. One mechanism is that you'll reliably mistakenly devote too much of your resources to bad strategies for the real goal, worse than just investing in the straightforward subgoal (though it's not clear why this would happen systematically). Another mechanism is that people might be trying to detect and filter for your actual goals, e.g. a politician might do better to actually care about good governance (or whatever) rather than their real goal of gaining power (or whatever) because people might be trying to avoid voting for power seekers.

Comment by TekhneMakre on Stop button: towards a causal solution · 2021-11-12T23:12:29.992Z · LW · GW
However, it doesn't think that they could mistakenly want to press the button.

Okay. (Seems fine to assume that this makes sense arguendo, since the problem is hard anyway, but worth keeping in mind that this is vague point in the proposal, and in particular for proposals like this to be progress, ISTM it has to be the case that "the human wants to press the button" is a simpler / easier / more concrete / more reliable / more specifiable thing for us to believe the AI can know, than other instances of "the human wants X". Which seems plausible, but unclear to me.)

Comment by TekhneMakre on Stop button: towards a causal solution · 2021-11-12T22:29:58.842Z · LW · GW
Does the AI think the humans could be mistaken about this variable?

I should have rather said, does the AI think the humans could mistakenly press the button, even when they of course correctly "know whether the AI should stop".

So if it sees a human pressing the button, the policy won't conclude that the human didn't press the button, but instead will conclude that it is in the U=B scenario

I'm saying that it might see what looks like a button press.... but if V is potentially large compared to B, the cost of a mistake (e.g., the camera input was fabricated, the button wasn't actually pressed) is large. So the AI has incentive to disable any mechanism that would *prematurely* shut it down, i.e. shut it down before it has time to be really sure; this is correct from V_f's perspective, which is what the AI is up to before the button is pressed.

Comment by TekhneMakre on Stop button: towards a causal solution · 2021-11-12T21:56:13.860Z · LW · GW

There's also something weird being assumed, about it making sense to define utility functions that only care about some counterfactual worlds. (I mean this is a reasonable assumption that people make, but it seems weird in general.) Like, this seems in tension with acausal bargaining / threats. If V_f wants V, doesn't it want what V says is good, and V might have opinions about other worlds (for example: "there shouldn't be torture, anywhere, even in counterfactual worlds"), and so optimizing for V_f optimizes even worlds where not-f?

Comment by TekhneMakre on Stop button: towards a causal solution · 2021-11-12T21:41:43.807Z · LW · GW

Interesting! (I didn't read the post carefully, FYI.)

the AI will assume that the human changes their mind and stops pressing the button when evaluating Vf, because that's what the counterfactual would involve.

Unfortunately, it will also assume that the human didn't press the button, even after the human did press the button. If B is big, it will assume the human did press the button, even if they didn't. Maybe this is quantitatively ok, because it can be pretty confident one way or the other...??

For the purposes of evaluating the utility function, the AI assumes it has no influence on whether the human wants to press the button or not.

There's some weirdness here with what the AI thinks of this variable "should stop", and how it interacts with the world. Does the AI think the humans could be mistaken about this variable? It might be much more worried about the humans mistakenly pressing, or more worried about mistakenly not pressing, depending on V and B, and so distort the humans's info.

I also don't see why the AI doesn't disable the shutdown button, and then observe whether the humans try to press it, and then go "Ah okay, so B is the one true utility function. Now I will crazily optimize B" and do something crazy rather than actually shutting down *because the button was pressed*.

Comment by TekhneMakre on Discussion with Eliezer Yudkowsky on AGI interventions · 2021-11-12T18:15:45.804Z · LW · GW

I think the model clearly applies, though almost certainly the effect is less strictly binary than in the surprise party example.

I expect the annoyance to make him a little bit biased, but still open to the idea and still maintaining solid epistemics.

This is roughly a crux for me, yeah. I think dozens of people emailing him would cause him to (fairly reasonably, actually!) infer that something weird is going on (e.g., people are in a crazy echo chamber) and that he's being targeted for unwanted attention (which he would be!). And it seems important, in a unilateralist's curse way, that this effect is probably unrelated to the overall size of the group of people who have these beliefs about AI. Like, if you multiply the number of AI-riskers by 10, you also multiply by 10 the number of people who, by some context-unaware individual judgement, think they should cold-email Tao. Some of these people will be correct that they should do something like that, but it seems likely that many of such people will be incorrect.

Comment by TekhneMakre on Discussion with Eliezer Yudkowsky on AGI interventions · 2021-11-12T06:11:47.430Z · LW · GW

[epistemic status: just joking around]

corrigibility being "anti-natural" in a certain sense

https://imgur.com/a/yYH3frW.gif

Comment by TekhneMakre on Discussion with Eliezer Yudkowsky on AGI interventions · 2021-11-12T05:18:49.056Z · LW · GW
Every word can be true but It seems overwhelmingly pessimistic in a way that is not helpful, mainly due to nothing in it being actionable. 

Ongoingly describing a situation accurately is a key action item.

Comment by TekhneMakre on Discussion with Eliezer Yudkowsky on AGI interventions · 2021-11-12T05:10:45.350Z · LW · GW

Please keep the unilateralist's curse in mind when considering plans like this. https://nickbostrom.com/papers/unilateralist.pdf

There's a finite resource that gets used up when someone contacts Person in High Demand, which is roughly, that person's openness to thinking about whether your problem is interesting.