Posts

If digital goods in virtual worlds increase GDP, do we actually become richer? 2024-04-19T10:06:55.417Z
No77e's Shortform 2023-02-18T12:45:42.224Z
Could evolution produce something truly aligned with its own optimization standards? What would an answer to this mean for AI alignment? 2023-01-08T11:04:40.642Z
What's wrong with the paperclips scenario? 2023-01-07T17:58:35.866Z
Is recursive self-alignment possible? 2023-01-03T09:15:21.304Z
Is checking that a state of the world is not dystopian easier than constructing a non-dystopian state? 2022-12-27T20:57:27.663Z

Comments

Comment by No77e (no77e-noi) on The first future and the best future · 2024-04-25T12:52:54.043Z · LW · GW

From a purely utilitarian standpoint, I'm inclined to think that the cost of delaying is dwarfed by the number of future lives saved by getting a better outcome, assuming that delaying does increase the chance of a better future.

That said, after we know there's "no chance" of extinction risk, I don't think delaying would likely yield better future outcomes. On the contrary, I suspect getting the coordination necessary to delay means it's likely that we're giving up freedoms in a way that may reduce the value of the median future and increase the chance of stuff like totalitarian lock-in, which decreases the value of the average future overall.

I think you're correct that there's also to balance the "other existential risks exist" consideration in the calculation, although I don't expect it to be clear-cut.

Comment by No77e (no77e-noi) on Magic by forgetting · 2024-04-24T20:54:13.994Z · LW · GW

Even if you manage to truly forget about the disease, there must exist a mind "somewhere in the universe" that is exactly the same as yours except without knowledge of the disease. This seems quite unlikely to me, because you having the disease has interacted causally with the rest of your mind a lot by when you decide to erase its memory. What you'd really need to do is to undo all the consequences of these interactions, which seems a lot harder to do. You'd really need to transform your mind into another one that you somehow know is present "somewhere in the multiverse" which seems also really hard to know.

Comment by No77e (no77e-noi) on Superexponential Conceptspace, and Simple Words · 2024-04-17T08:21:48.166Z · LW · GW

I deliberately left out a key qualification in that (slightly edited) statement, because I couldn't explain it until today.

I might be missing something crucial because I don't understand why this addition is necessary. Why do we have to specify "simple" boundaries on top of saying that we have to draw them around concentrations of unusually high probability density? Like, aren't probability densities in Thingspace already naturally shaped in such a way that if you draw a boundary around them, it's automatically simple? I don't see how you run the risk of drawing weird, noncontiguous boundaries if you just follow the probability densities.

Comment by No77e (no77e-noi) on Modern Transformers are AGI, and Human-Level · 2024-03-26T20:14:42.939Z · LW · GW

One way in which "spending a whole lot of time working with a system / idea / domain, and getting to know it and understand it and manipulate it better and better over the course of time" could be solved automatically is just by having a truly huge context window. Example of an experiment: teach a particular branch of math to an LLM that has never seen that branch of math.

Maybe humans have just the equivalent of a sort of huge content window spanning selected stuff from their entire lifetimes, and so this kind of learning is possible for them.

Comment by No77e (no77e-noi) on Self-driving car bets · 2023-07-30T13:12:28.036Z · LW · GW

You mention eight cities here. Do they count for the bet? 

Comment by No77e (no77e-noi) on No77e's Shortform · 2023-03-13T10:27:35.754Z · LW · GW

Waluigi effect also seems bad for s-risk. "Optimize for pleasure, ..." -> "Optimize for suffering, ...".

Comment by No77e (no77e-noi) on No77e's Shortform · 2023-03-13T08:26:24.319Z · LW · GW

Iff LLM simulacra resemble humans but are misaligned, that doesn't bode well for S-risk chances. 

Comment by No77e (no77e-noi) on No77e's Shortform · 2023-03-13T08:25:26.711Z · LW · GW

An optimistic way to frame inner alignment is that gradient descent already hits a very narrow target in goal-space, and we just need one last push.

A pessimistic way to frame inner misalignment is that gradient descent already hits a very narrow target in goal-space, and therefore S-risk could be large.

Comment by No77e (no77e-noi) on No77e's Shortform · 2023-03-09T15:57:46.643Z · LW · GW

We should implement Paul Christiano's debate game with alignment researchers instead of ML systems

Comment by No77e (no77e-noi) on No77e's Shortform · 2023-03-09T15:51:33.296Z · LW · GW

This community has developed a bunch of good tools for helping resolve disagreements, such as double cruxing. It's a waste that they haven't been systematically deployed for the MIRI conversations. Those conversations could have ended up being more productive and we could've walked away with a succint and precise understanding about where the disagreements are and why.

Comment by No77e (no77e-noi) on Is recursive self-alignment possible? · 2023-03-05T17:58:18.987Z · LW · GW

Another thing one might wonder about is if performing iterated amplification with constant input from an aligned human (as "H" in the original iterated amplification paper) would result in a powerful aligned thing if that thing remains corrigible during the training process.

Comment by No77e (no77e-noi) on Robin Hanson’s latest AI risk position statement · 2023-03-04T10:59:54.136Z · LW · GW

The comment about tool-AI vs agent-AI is just ignorant (or incredibly dismissive) of mesa-optimizers and the fact that being asked to predict what an agent would do immediately instantiates such an agent inside the tool-AI. It's obvious that a tool-AI is safer than an explicitely agentic one, but not for arbitrary levels of intelligence.

This seems way too confident to me given the level of generality of your statement. And to be clear, my view is that this could easily happen in LLMs based on transformers, but what other architectures? If you just talk about how a generic "tool-AI" would or would not behave, it seems to me that you are operating on a level of abstraction far too high to be able to make such specific statements with confidence.

Comment by No77e (no77e-noi) on No77e's Shortform · 2023-02-26T18:17:20.919Z · LW · GW

If you try to write a reward function, or a loss function, that caputres human values, that seems hopeless. 

But if you have some interpretability techniques that let you find human values in some simulacrum of a large language model, maybe that's less hopeless.

The difference between constructing something and recognizing it, or between proving and checking, or between producing and criticizing, and so on...

Comment by No77e (no77e-noi) on No77e's Shortform · 2023-02-18T15:41:30.399Z · LW · GW

Why this shouldn't work? What's the epistemic failure mode being pointed at here?

Comment by No77e (no77e-noi) on Should we cry "wolf"? · 2023-02-18T13:40:44.689Z · LW · GW

While you can "cry wolf" in maybe useful ways, you can also state your detailed understanding of each specific situation as it arises and how it specifically plays into the broader AI risk context.

Comment by No77e (no77e-noi) on On Board Vision, Hollow Words, and the End of the World · 2023-02-18T13:00:19.366Z · LW · GW

As impressive as ChatGPT is on some axes, you shouldn't rely too hard on it for certain things because it's bad at what I'm going to call "board vision" (a term I'm borrowing from chess).

How confident are you that you cannot find some agent within ChatGPT with excellent board vision through more clever prompting than what you've experimented with?

Comment by No77e (no77e-noi) on No77e's Shortform · 2023-02-18T12:45:42.426Z · LW · GW

As a failure mode of specification gaming, agents might modify their own goals. 

As a convergent instrumental goal, agents want to prevent their goals to be modified.

I think I know how to resolve this apparent contradiction, but I'd like to see other people's opinions about it.

Comment by No77e (no77e-noi) on All AGI Safety questions welcome (especially basic ones) [~monthly thread] · 2023-01-12T10:10:15.546Z · LW · GW

I'm going to re-ask all my questions that I don't think have received a satisfactory answer. Some of them are probably basic, some other maybe less so:
 

  1. Why would CEV be difficult to learn?
  2. Why is research into decision theories relevant to alignment?
  3. Is checking that a state of the world is not dystopian easier than constructing a non-dystopian state?
  4. Is recursive self-alignment possible?
  5. Could evolution produce something truly aligned with its own optimization standards? What would an answer to this mean for AI alignment?
Comment by No77e (no77e-noi) on Could evolution produce something truly aligned with its own optimization standards? What would an answer to this mean for AI alignment? · 2023-01-09T08:02:42.315Z · LW · GW

I am trying to figure out what is the relation between "alignment with evolution" and "short-term thinking". Like, imagine that some people get hit by magical space rays, which make them fully "aligned with evolution". What exactly would such people do?

I think they would become consequentialists smart enough that they could actually act to maximize inclusive genetic fitness. I think Thou Art Godshatter is convincing.

But what if the art or the philosophy makes it easier to get laid? So maybe in such case they would do the art/philosophy, but they would feel no intrinsic pleasure from doing it, like it would all be purely instrumental, willing to throw it all away if on second thought they find out that this is actually not maximizing reproduction?

Yeah that's what I would expect.

How would they even figure out what is the reproduction-optimal thing to do? Would they spend some time trying to figure out the world? (The time that could otherwise be spent trying to get laid?) Or perhaps, as a result of sufficiently long evolution, they would already do the optimal thing instinctively? (Because those who had the right instincts and followed them, outcompeted those who spent too much time thinking?)

I doubt that being governed by instincts can outperform a sufficiently smart agent reasoning from scratch, given sufficiently complicated environment. Instincts are just heuristics after all...

But would that mean that the environment is fixed? Especially, if the most important part of the environment is other people? Maybe the humanity would get locked in an equilibrium where the optimal strategy is found, and everyone who tries doing something else is outcompeted; and afterwards those who do the optimal strategy more instinctively outcompete those who need to figure it out. What would such equilibrium look like?

Ohhh interesting, I have no idea... it seems plausible that it could happen though!

Comment by No77e (no77e-noi) on Could evolution produce something truly aligned with its own optimization standards? What would an answer to this mean for AI alignment? · 2023-01-09T07:53:29.540Z · LW · GW

No, I mean "humans continue to evolve genetically, and they never start self-modifying in a way that makes evolution impossible (e.g., by becoming emulations)."

Comment by No77e (no77e-noi) on Open & Welcome Thread - January 2023 · 2023-01-08T20:01:19.708Z · LW · GW

For some reason I don't get e-mail notifications when someone replies to my posts or comments. My e-mail is verified and I've set all notifications to "immediately". Here's what my e-mail settings look like: 

Comment by No77e (no77e-noi) on What's wrong with the paperclips scenario? · 2023-01-07T20:16:33.742Z · LW · GW

I agree with you here, although something like "predict the next token" seems more and more likely. Although I'm not sure if this is in the same class of goals as paperclip maximizing in this context, and if the kind of failure it could lead to would be similar or not.

Comment by No77e (no77e-noi) on What's wrong with the paperclips scenario? · 2023-01-07T18:51:15.152Z · LW · GW

Yes, this makes a lot of sense, thank you. 

Comment by No77e (no77e-noi) on What's wrong with the paperclips scenario? · 2023-01-07T18:45:58.135Z · LW · GW

Do you mean that no one will actually create exactly a paperclips maximizer or no agent of that kind? I.e. with goals such as "collect stamps", or "generate images"? Because I think Eliezer meant to object to that class of examples, rather than only that specific one, but I'm not sure.

Comment by No77e (no77e-noi) on What's wrong with the paperclips scenario? · 2023-01-07T18:01:28.841Z · LW · GW

The last Twitter reply links to a talk from MIRI which I haven't watched. I wouldn't be surprised if MIRI also used this metaphor in the past, but I can't recall examples off the top of my head right now.

Comment by No77e (no77e-noi) on Is recursive self-alignment possible? · 2023-01-03T09:31:06.935Z · LW · GW

I use Eliezer Yudkowsky in my example because it makes the most sense. Don't read anything else into it, please.

Comment by No77e (no77e-noi) on Is recursive self-alignment possible? · 2023-01-03T09:29:08.185Z · LW · GW

I publish posts like this one to clarify my doubts about alignment. I don't pay attention to whether I'm beating a dead horse or if there's previous literature about my questions or ideas. Do you think this is an OK practice? One pro is that people like me learn faster, and one con is that it may pollute the site with lower-quality posts.

Comment by No77e (no77e-noi) on All AGI Safety questions welcome (especially basic ones) [~monthly thread] · 2022-11-02T14:02:50.006Z · LW · GW

Thanks for the answer. It clarifies a little bit, but I still feel like I don't fully grasp its relevance to alignment. I have the impression that there's more to the story than just that?

Comment by No77e (no77e-noi) on All AGI Safety questions welcome (especially basic ones) [~monthly thread] · 2022-11-02T10:47:21.289Z · LW · GW

Why is research into decision theories relevant to alignment?

Comment by No77e (no77e-noi) on AI as a Civilizational Risk Part 4/6: Bioweapons and Philosophy of Modification · 2022-11-02T10:11:01.011Z · LW · GW

Can someone explain to me why Pasha's posts are downvoted so much? I don't think they are great, but this level of negative karma seems disproportioned to me. 

Comment by No77e (no77e-noi) on publishing alignment research and exfohazards · 2022-10-31T19:24:28.236Z · LW · GW

This looks like something that would be useful also for alignment orgs, if they want to organize their research in siloes, as Yudkowsky often suggests (if they haven't already implemented systems like this one).

Comment by No77e (no77e-noi) on Comment reply: my low-quality thoughts on why CFAR didn't get farther with a "real/efficacious art of rationality" · 2022-06-11T07:39:44.607Z · LW · GW

Ah, I see your point now, and it makes sense. If I had to summarize it (and reword it in a way that appeals to my intuition), I'd say that the choice of seeking the truth is not just about "this helps me," but about "this is what I want/ought to do/choose". Not just about capabilities. I don't think I disagree at this point, although perhaps I should think about it more.

I had the suspicion that my question would be met with something at least a bit removed inference-wise from where I was starting, since my model seemed like the most natural one, and so I expected someone who routinely thinks about this topic to have updated away from it rather than not having thought about it.

Regarding the last paragraph: I already believed your line "increasing a person's ability to see and reason and care (vs rationalizing and blaming-to-distract-themselves and so on) probably helps with ethical conduct." It didn't seem to bear on the argument in this case because it looks like you are getting alignment for free by improving capabilities (if you reason with my previous model, otherwise it looks like your truth-alignment efforts somehow spill over to other values, which is still getting something for free due to how humans are built I'd guess).

Also... now that I think about it, what Harry was doing with Draco in HPMOR looks a lot like aligning rather than improving capabilities, and there were good spill-over effects (which were almost the whole point in that case perhaps). 
 

Comment by No77e (no77e-noi) on Comment reply: my low-quality thoughts on why CFAR didn't get farther with a "real/efficacious art of rationality" · 2022-06-09T11:09:54.780Z · LW · GW

One is thinking about how to build aligned intelligence in a machine, the other is thinking about how to build aligned intelligence in humans and groups of humans.  

Is this true though? Teaching rationality improves capability in people but shouldn't necessarily align them. People are not AIs, but their morality doesn't need to converge under reflection. 

And even if the argument is "people are already aligned with people", you still are working on capabilities when dealing with people and on alignment when dealing with AIs.

Teaching rationality looks more similar to AI capabilities research than AI alignment research to me.

Comment by No77e (no77e-noi) on AGI Ruin: A List of Lethalities · 2022-06-07T15:21:58.249Z · LW · GW

Why not shoot for something less ambitious?

I'll give myself a provisional answer. I'm not sure if it satisfies me, but it's enough to make me pause: Anything short of CEV might leave open an unacceptably high chance of fates worse than death.

Comment by No77e (no77e-noi) on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-07T11:50:56.339Z · LW · GW

Should a "ask dumb questions about AGI safety" thread be recurring? Surely people will continue to come up with more questions in the years to come, and the same dynamics outlined in the OP will repeat. Perhaps this post could continue to be the go-to page, but it would become enormous (but if there were recurring posts they'd lose the FAQ function somewhat. Perhaps recurring posts and a FAQ post?). 

Comment by No77e (no77e-noi) on AGI Ruin: A List of Lethalities · 2022-06-06T19:49:14.321Z · LW · GW

The first thing generally, or CEV specifically, is unworkable because the complexity of what needs to be aligned or meta-aligned for our Real Actual Values is far out of reach for our FIRST TRY at AGI.  Yes I mean specifically that the dataset, meta-learning algorithm, and what needs to be learned, is far out of reach for our first try.  It's not just non-hand-codable, it is unteachable on-the-first-try because the thing you are trying to teach is too weird and complicated.


Why is CEV so difficult? And if CEV is impossible to learn first try, why not shoot for something less ambitious? Value is fragile, OK, but aren't there easier utopias?

Many humans would be able to distinguish utopia from dystopia if they saw them, and humanity's only advantage over an AI is that the brain has "evolution presets". 

Humans are relatively dumb, so why can't even a relatively dumb AI learn the same ability to distinguish utopias from dystopias?

To anyone reading: don't interpret these questions as disagreement. If someone doesn't, for example, understand a mathematical proof, they might express disagreement with the proof while knowing full well that they haven't discovered a mistake in it and that they are simply confused.