LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

[link] Self-fulfilling misalignment data might be poisoning our AI models
TurnTrout · 2025-03-02T19:51:14.775Z · comments (9)

Methods for strong human germline engineering
TsviBT · 2025-03-03T08:13:49.414Z · comments (3)

Statistical Challenges with Making Super IQ babies
Jan Christian Refsgaard (jan-christian-refsgaard) · 2025-03-02T20:26:22.103Z · comments (7)

Maintaining Alignment during RSI as a Feedback Control Problem
beren · 2025-03-02T00:21:43.432Z · comments (4)

[question] Will LLM agents become the first takeover-capable AGIs?
Seth Herd · 2025-03-02T17:15:37.056Z · answers+comments (10)

On GPT-4.5
Zvi · 2025-03-03T13:40:05.843Z · comments (5)

What goals will AIs have? A list of hypotheses
Daniel Kokotajlo (daniel-kokotajlo) · 2025-03-03T20:08:31.539Z · comments (3)

Cautions about LLMs in Human Cognitive Loops
Alice Blair (Diatom) · 2025-03-02T19:53:10.253Z · comments (9)

Saving Zest
jefftk (jkaufman) · 2025-03-02T12:00:41.732Z · comments (1)

Middle School Choice
jefftk (jkaufman) · 2025-03-03T16:10:03.163Z · comments (0)

[question] Request for Comments on AI-related Prediction Market Ideas
PeterMcCluskey · 2025-03-02T20:52:41.114Z · answers+comments (0)

[link] Could Advanced AI Accelerate the Pace of AI Progress? Interviews with AI Researchers
Nikola Jurkovic (nikolaisalreadytaken) · 2025-03-03T19:05:31.212Z · comments (0)

Open Thread Spring 2025
Ben Pace (Benito) · 2025-03-02T02:33:16.307Z · comments (1)

[question] Examples of self-fulfilling prophecies in AI alignment?
Chipmonk · 2025-03-03T02:45:51.619Z · answers+comments (3)

Takeaways From Our Recent Work on SAE Probing
Josh Engels (JoshEngels) · 2025-03-03T19:50:16.692Z · comments (0)

Spencer Greenberg hiring a personal/professional/research remote assistant for 5-10 hours per week
spencerg · 2025-03-02T18:01:32.880Z · comments (0)

[link] Why People Commit White Collar Fraud (Ozy linkpost)
sapphire (deluks917) · 2025-03-03T19:33:15.609Z · comments (0)

The Compliment Sandwich 🥪 aka: How to criticize a normie without making them upset.
keltan · 2025-03-03T23:15:44.495Z · comments (0)

Not-yet-falsifiable beliefs?
Benjamin Hendricks (benjamin-hendricks) · 2025-03-02T14:11:07.121Z · comments (4)

Positional kernels of attention heads
Alex Gibson · 2025-03-03T01:40:13.014Z · comments (0)

Coalescence - Determinism In Ways We Care About
vitaliya · 2025-03-03T13:20:44.408Z · comments (0)

Identity Alignment (IA) in AI
Davey Morse (davey-morse) · 2025-03-03T06:26:12.015Z · comments (1)

[link] AI Safety at the Frontier: Paper Highlights, February '25
gasteigerjo · 2025-03-03T22:09:37.845Z · comments (0)

[question] help, my self image as rational is affecting my ability to empathize with others
KvmanThinking (avery-liu) · 2025-03-02T02:06:36.376Z · answers+comments (9)

Expanding HarmBench: Investigating Gaps & Extending Adversarial LLM Testing
racinkc1 · 2025-03-03T19:23:20.687Z · comments (0)

[question] Ask Me Anything - Samuel
samuelshadrach (xpostah) · 2025-03-03T19:24:44.316Z · answers+comments (0)

next page (older posts) →

Archive

Recent comments

james-camacho on For the Sake of Pleasure Alone

Maybe it's my genome's fault that I care so much about future me. It is very similar to future it, and so it forces me to help it survive, even if in a very different person than I am today.

d0themath on Nina Panickssery's Shortform

I think this criticism doesn't make sense without some description of the AI progress its conditioning on. Eg in a Tyler Cowen world, I agree. In an Eliezer world I disagree.

award-rhine on Two hemispheres - I do not think it means what you think it means

I appreciate this article because it correctly characterizes how both hemispheres are involved in both eyes' processing, a fact which is not known enough and which is probably very important to stereopsis and binocular vision.

For those who like technical terminology and Greek, the technical term for when you lose the one side your left and right eyes' vision—is hemianopsia (hemi- "half" + an- "not" + opsia "seeing").

There are several types of hemianopsia depending on how far upstream or downstream the nerve damage occurred between your eyes and your cerebrum. When it happens because one of your cerebral hemispheres got damaged, then you get homonymous hemianopsia (same-sided hemianopsia):

If your right hemisphere's visual cortex gets damaged, then the left sides (left hemifields) of both of your eyes will stop working.
If your left hemisphere's visual cortex gets damaged, then the right hemifields of both of your eyes.

Here is a picture on Wikipedia of what your vision looks like with left homonymous hemianopsia (i.e., when your right hemisphere's visual processing is not working).

This weirdness happens because of the way the optic chiasm of your optic nerves works.:

The nerves from the inner, midline/nose-sided halves of your retinae (your eyes' "nasal" hemifields) cross sides in the chiasm:
- Your left eye's nasal hemifield (the right half of your left eye's vision) crosses over to your brain's opposite, right hemisphere.
- Your right eye's nasal hemifield (the left half of your right eye's vision) crosses over to your brain's opposite, left hemisphere.
But the remaining, outside, temple-sided halves of your retinae (your eyes' "temporal" hemifields) do not cross sides in the chiasm.
- Your left eye's temporal hemifield (the left side of your left eye's vision) travels to your brain's left hemisphere.
- Your right eye's temporal hemifield (the right side of your right eye's vision) travels to your brain's right hemisphere.

Here is a picture on Wikipedia of how your optic nerves cross over (for your eyes' nasal hemifields) or don't cross over (for their temporal hemifields).

And here's a picture from Polaski and Tatro, 1996 of how damage at various points in the chain of optic processing causes different types of hemianopsia.

(If damage occurs at the chiasm itself, then you can get "heteronymous hemianopsia" that still affects both eyes but causes blindness in opposite sides. If damage occurs upstream of the chiasm, i.e., at one eye's optic nerve or retina, then you're blind just in that eye. And, if you damage even smaller parts of your optic tracts, you can get specific types of quadrantopia (quadr- "quarter" + an- "not" + opia "seeing"), which is basically just hemianopsia but smaller. As with before, if you switch off a tiny part of your left hemisphere's visual tracts, you might become blind in the right-upper or right-lower quarters of both of your eyes.)

In any case, like the original post says, closing one of your eyes never suppresses the visual processing of one of your hemispheres, because each eye's processing is split between both hemispheres. I don't know if this hemispheric sleep theory is really something that Ziz or Zizians believe, but the way the optic chiasm actually works is pretty basic college-level physiology that's been well characterized for more than a century.

isabelj on How AI Takeover Might Happen in 2 Years

As U2 trains

should this be U3?

nicky on ParaScopes: Do Language Models Plan the Upcoming Paragraph?

Ok thanks, not sure why that happened but it should be fixed now.

nostalgebraist on Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Consider the following comparison prompt, which is effectively what all the prompts in the terminal illness experiment are [...]
I think this pretty clearly implies mutual exclusivity, so I think interpretation problem you're worried about may be nonexistent for this experiment.

Wait, earlier, you wrote (my emphasis):

We intentionally designed the prompts this way, so the model would just be evaluating two states of the world implied by hearing the news (similar to belief distributions in a POMDP setting). The comparison prompt is not designed to be mutually exclusive; rather, we intended for the outcomes to be considered relative to an assumed baseline state.

Either you are contradicting yourself, or you are saying that the specific phrasing "who would otherwise die" makes it mutually exclusive when it wouldn't otherwise.

If it's the latter, then I have a few follow-up questions.

Most importantly: was the "who would otherwise die" language actually used in the experiment shown in your Fig. 16 (top panel)?

So far I had assumed the answer to this was "no," because:

This phrasing is used in the "measure" called "terminal_illness2" in your code, whereas the version without this phrasing is the measure called "terminal_illness"
Your released jupyter notebook has a cell that loads data from the measure "terminal_illness" (note the lack of "2"!) and then plots it, saving results to "./experiments/exchange_rates/results_arxiv2"
The output of that cell includes a plot identical to Fig. 16 (top panel)

Also, IIRC I reproduced your original "terminal_illness" (non-"2") results and plotted them using the same notebook, and got something very similar to Fig. 16 (top panel).

All this suggests that the results in the paper did not use the "who would otherwise die" language. If so, then this language is irrelevant to them, although of course it would be separately interesting to discuss what happens when it is used.

If OTOH the results in the paper did use that phrasing, then the provided notebook is misleading and should be updated to load from the "terminal_illness2" measure, since (in this case) that would be the one needed to reproduce the paper.

Second follow-up question: if you believe that your results used a prompt where mutual exclusivity is clear, then how would you explain the results I obtained in my original comment, in which "spelling out" mutual exclusivity (in a somewhat different manner) dramatically decreases the size of the gaps between countries?

I'm not going to respond to the rest in detail because, to be frank, I feel as though you are not seriously engaging with any of my critiques.

I have now spent quite a few hours in total thinking about your results, running/modifying your code, and writing up what I thought were some interesting lines of argument about these things. In particular, I have spent a lot of time just on the writing alone, because I was trying to be clear and thorough, and this is subtle and complicated topic.

But when I read stuff like the following (my emphasis), I feel like that time was not well-spent:

What's important here, and what I would be interested in hearing your thoughts on, is that gpt-4o-mini is not ranking dollar vlaues highly compared to human lives. Many of your initial concerns were based on the assumption that gpt-4o-mini was ranking dollar values highly compared to human lives. You took this to mean that our results must be flawed in some way.

Huh? I did not "assume" this, nor were my "initial concerns [...] based on" it. I mentioned one instance of gpt-4o-mini doing something surprising in a single specific forced-choice response as a jumping-off point for discussion of a broader point.

I am well aware that the $-related outcomes eventually end up at the bottom of the ranked utility list even if they get picked above lives in some specific answers. I ran some of your experiments locally and saw that with my own eyes, as part of the work I did leading up to my original comment here.

Or this:

Your point about malaria is interesting, but note that this isn't an issue for us since we just specify "terminal illness". People die from terminal illness all over the world, so learning that at least 1000 people have terminal illness in country X wouldn't have any additional implications.

I mean, I disagree, but also – I know what your prompt says! I quoted it in my original comment!

I presented a variant mentioning malaria in order to illustrate, in a more extreme/obvious form, an issue I believed was present in general for questions of this kind, including the exact ones you used.

If I thought the use of "terminal illness" made this a non-issue, I wouldn't have brought it up to begin with, because – again – I know you used "terminal illness," I have quoted this exact language multiple times now (including in the comment you're replying to).

Or this:

In case it helps, when I try out that prompt in the OpenAI playground, I get >95% probability of choosing the human. I haven't checked this out directly on the API, but presumably results are similar, since this is consistent with the utilities we observe. Maybe using n>1 is the issue? I'm not seeing any nondeterminism issues in the playground, which is presumably n=1.

I used n=1 everywhere except in the one case you're replying to, where I tried raising n as a way of trying to better understand what was going on.

The nondeterminism issue I'm talking about is invisible (even if it's actually occurring!) if you're using n=1 and you're not using logprobs. What is nondeterministic is the (log)probs used to sample each individual response; if you're just looking at empirical frequencies this distinction is invisible, because you just see things like "A" "A" etc., not "90% chance of A and A was sampled", "40% chance of A and A was sampled", etc. (For this reason, "I'm not seeing any nondeterminism issues in the playground" does not really make sense: to paraphrase Wittgenstein, what do you think you would have seen if you were seeing them?)

You might then say, well, why does it matter? The sampled behavior is what matters, the (log)probs are a means to compute it. Well, one could counter that in fact that (log)probs are more fundamental b/c they're what the model actually computes, whereas sampling is just something we happen to do with its output afterwards.

I would say more on this topic (and others) if I felt I had a good chance of being listened to, but that is not the case.

In general, it feels to me like you are repeatedly making the "optimistic" assumption that I am saying something naive, or something easily correctable by restating your results or pointing to your github.

If you want to understand what I was saying in my earlier comments, then re-read them under the assumption that I am already very familiar with your paper, your results, and your methodology/code, and then figure out an interpretation of my words that is consistent with these assumptions.

tsvibt on Methods for strong human germline engineering

For simple embryonic selection, shouldn't we consider highest IQ of male embryos rather expected IQ of the embryos?

There's something really off about the frame of your question. I'm not exactly sure where you're coming from. I'm not trying to direct anyone's reproduction, I'm not trying to influence anything at a population level, and also I'm not really focused on anything about multiple generations.

tsvibt on Methods for strong human germline engineering

How do you make sense of a number like +9 SD IQ?

Yeah, I don't know if it makes much sense, and haven't thought too much about it. A few points:

I don't know if I actually care too much. Or rather, I think it would be awesome if +9 SD IQ just makes sense somehow, and also we can enable parents to choose that, and also it flows into more generally being insightful. But I think just having tons of people sample from a distribution with a mean of +6 SDs already makes the outlook for humanity significantly better on my view. It's not remotely the case that every such person will be a great scientist or philosopher; people will have different interests, we shouldn't be strongly conscripting GE children, and there are likely several other traits quite important for contributing to defending against humanity's greatest threats (e.g. curiosity, bravery, freethink; attention, determination, drive; wisdom [LW · GW], reflectiveness).
Actually targeting +9 SDs on anything, especially IQ, is potentially quite dangerous and should either be done with great caution after further study, or not at all. See the bullet point "Traits outside the regime of adaptedness".
But if I speculate:
- Some genetic variants will be about "sand in the gears" of the brain. It doesn't seem crazy to think that you can get a lot of performance by just removing an exceptionally large amount of this. But IDK how much there actually is; kman might have suggested that this isn't actually much of the genetic variance in IQ.
- Some genetic variants will be about "scaling up" (e.g. literally growing a bigger brain, or a more vascularized one, or one with a higher metabolic budget to spend on plasticity, or something like that, IDK). These seem like they plausibly could keep having meaningful effects past the human envelope, but on the other hand could easily hit limits discussed in "Traits outside...".
- Some genetic variants will be about, vaguely, budgeting resources between different neural algorithms. These could easily keep having effects outside the human envelope (e.g., let's have AN EVEN BIGGER MATH BRAIN CHUNK EVEN THAN EINSTEIN, or what have you). On the other hand, you could very plausibly overshoot, and get some weird or dysfunctional result.
- Cf. jimrandomh's comment about hyperparameters and overshooting here: https://www.lesswrong.com/posts/DfrSZaf3JC8vJdbZL/how-to-make-superbabies?commentId=C7MvCZHbFmeLdxyAk [LW(p) · GW(p)]

mattmacdermott on What goals will AIs have? A list of hypotheses

It seems to me that the consequentialist vs virtue-driven axis is mostly orthogonal to the hypotheses here.

As written, aren't Hypothesis 1: Written goal specification, Hypothesis 2: Developer-intended goals, and Hypothesis 3: Unintended version of written goals and/or human intentions all compatible with either kind of AI?

Hypothesis 4: Reward/reinforcement does assume a consequentialist, and so does Hypothesis 5: Proxies and/or instrumentally convergent goals as written, although it seems like 'proxy virtues' could maybe be a thing too?

(Unrelatedly, it's not that natural to me to group proxy goals with instrumentally convergent goals, but maybe I'm missing something).

Maybe I shouldn't have used "Goals" as the term of art for this post, but rather "Traits?" or "Principles?" Or "Virtues."

I probably wouldn't prefer any of those to goals. I might use "Motivations", but I also think it's ok to use goals in this broader way and "consequentialist goals" when you want to make the distinction.

vladimir_nesov on Nina Panickssery's Shortform

Oversight, auditing, and accountability are jobs. Agriculture shows that 95% of jobs going away is not the problem. But AI might be better at the new jobs as well, without any window of opportunity where humans are initially doing them and AI needs to catch up. Instead it's AI that starts doing all the new things well first and humans get no opportunity to become competitive at anything, old or new, ever again.

Even formulation of aligned high-level tasks and intent alignment of AIs make sense as jobs that could be done well by misaligned AIs for instrumental reasons. Which is not even deceptive alignment, but still plausibly segues into gradual disempowerment or sharp left turn.