LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] Sticker Shortcut Fallacy — The Real Worst Argument in the World
ymeskhout · 2024-06-12T14:52:41.988Z · comments (15)

Economics Roundup #1
Zvi · 2024-03-26T14:00:06.332Z · comments (4)

Fertility Roundup #2
Zvi · 2023-10-17T13:20:01.901Z · comments (30)

[link] Was a Subway in New York City Inevitable?
Jeffrey Heninger (jeffrey-heninger) · 2024-03-30T00:53:21.314Z · comments (4)

Is Yann LeCun strawmanning AI x-risks?
Chris_Leong · 2023-10-19T11:35:08.167Z · comments (4)

[link] The Coming Wave
PeterMcCluskey · 2023-09-28T22:59:58.551Z · comments (1)

Weighing Animal Worth
jefftk (jkaufman) · 2023-09-28T13:50:06.752Z · comments (11)

The Drowning Child
Tomás B. (Bjartur Tómas) · 2023-10-22T16:39:53.016Z · comments (8)

Decent plan prize announcement (1 paragraph, $1k)
lukehmiles (lcmgcd) · 2024-01-12T06:27:44.495Z · comments (19)

Twin Peaks: under the air
KatjaGrace · 2024-05-31T01:20:04.624Z · comments (2)

An experiment on hidden cognition
Olli Järviniemi (jarviniemi) · 2024-07-22T03:26:05.564Z · comments (2)

Control Symmetry: why we might want to start investigating asymmetric alignment interventions
domenicrosati · 2023-11-11T17:27:10.636Z · comments (1)

Beta Tester Request: Rallypoint Bounties
lukemarks (marc/er) · 2024-05-25T09:11:11.446Z · comments (4)

[link] **In defence of Helen Toner, Adam D'Angelo, and Tasha McCauley**
mrtreasure · 2023-12-06T02:02:32.004Z · comments (3)

Tackling Moloch: How YouCongress Offers a Novel Coordination Mechanism
Hector Perez Arenas (hector-perez-arenas) · 2024-05-15T23:13:48.501Z · comments (8)

Using an LLM perplexity filter to detect weight exfiltration
Adam Karvonen (karvonenadam) · 2024-07-21T18:18:05.612Z · comments (11)

Clipboard Filtering
jefftk (jkaufman) · 2024-04-14T20:50:02.256Z · comments (1)

$250K in Prizes: SafeBench Competition Announcement
ozhang (oliver-zhang) · 2024-04-03T22:07:41.171Z · comments (0)

[link] Let's Design A School, Part 2.2 School as Education - The Curriculum (General)
Sable · 2024-05-07T19:22:21.730Z · comments (3)

A Visual Task that's Hard for GPT-4o, but Doable for Primary Schoolers
Lennart Finke (l-f) · 2024-07-26T17:51:28.202Z · comments (4)

Changing Contra Dialects
jefftk (jkaufman) · 2023-10-26T17:30:10.387Z · comments (2)

[link] Executive Dysfunction 101
DaystarEld · 2024-05-23T12:43:13.785Z · comments (1)

AXRP Episode 30 - AI Security with Jeffrey Ladish
DanielFilan · 2024-05-01T02:50:04.621Z · comments (0)

Testing for consequence-blindness in LLMs using the HI-ADS unit test.
David Scott Krueger (formerly: capybaralet) (capybaralet) · 2023-11-24T23:35:29.560Z · comments (2)

[link] OpenAI Superalignment: Weak-to-strong generalization
Dalmert · 2023-12-14T19:47:24.347Z · comments (3)

If a little is good, is more better?
DanielFilan · 2023-11-04T07:10:05.943Z · comments (16)

[link] Structured Transparency: a framework for addressing use/mis-use trade-offs when sharing information
habryka (habryka4) · 2024-04-11T18:35:44.824Z · comments (0)

A Review of In-Context Learning Hypotheses for Automated AI Alignment Research
alamerton · 2024-04-18T18:29:33.892Z · comments (4)

[question] How to Model the Future of Open-Source LLMs?
Joel Burget (joel-burget) · 2024-04-19T14:28:00.175Z · answers+comments (9)

[link] MIRI's July 2024 newsletter
Harlan · 2024-07-15T21:28:17.343Z · comments (2)

Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?
scasper · 2024-07-30T14:57:06.807Z · comments (0)

[link] Announcing Open Philanthropy's AI governance and policy RFP
Julian Hazell (julian-hazell) · 2024-07-17T02:02:39.933Z · comments (0)

[question] What ML gears do you like?
Ulisse Mini (ulisse-mini) · 2023-11-11T19:10:11.964Z · answers+comments (4)

[link] Report: Evaluating an AI Chip Registration Policy
Deric Cheng (deric-cheng) · 2024-04-12T04:39:45.671Z · comments (0)

[question] Impressions from base-GPT-4?
mishka · 2023-11-08T05:43:23.001Z · answers+comments (25)

Housing Roundup #9: Restricting Supply
Zvi · 2024-07-17T12:50:05.321Z · comments (8)

Evaluating Sparse Autoencoders with Board Game Models
Adam Karvonen (karvonenadam) · 2024-08-02T19:50:21.525Z · comments (1)

[question] Any real toeholds for making practical decisions regarding AI safety?
lukehmiles (lcmgcd) · 2024-09-29T12:03:08.084Z · answers+comments (6)

Alignment by default: the simulation hypothesis
gb (ghb) · 2024-09-25T16:26:00.552Z · comments (17)

I didn't think I'd take the time to build this calibration training game, but with websim it took roughly 30 seconds, so here it is!
mako yass (MakoYass) · 2024-08-02T22:35:21.136Z · comments (2)

[question] Could there be "natural impact regularization" or "impact regularization by default"?
tailcalled · 2023-12-01T22:01:46.062Z · answers+comments (6)

[link] AI Alignment [Progress] this Week (11/05/2023)
Logan Zoellner (logan-zoellner) · 2023-11-07T13:26:21.995Z · comments (0)

[link] Was Partisanship Good for the Environmental Movement?
Jeffrey Heninger (jeffrey-heninger) · 2024-05-15T17:30:54.796Z · comments (0)

Paper Summary: The Koha Code - A Biological Theory of Memory
jakej (jake-jenks) · 2023-12-30T22:37:13.865Z · comments (2)

5 psychological reasons for dismissing x-risks from AGI
Igor Ivanov (igor-ivanov) · 2023-10-26T17:21:48.580Z · comments (6)

[question] Would you have a baby in 2024?
martinkunev · 2023-12-25T01:52:04.358Z · answers+comments (76)

Defense Against The Dark Arts: An Introduction
Lyrongolem (david-xiao) · 2023-12-25T06:36:06.278Z · comments (36)

A conceptual precursor to today's language machines [Shannon]
Bill Benzon (bill-benzon) · 2023-11-15T13:50:51.226Z · comments (6)

Utility is not the selection target
tailcalled · 2023-11-04T22:48:20.713Z · comments (1)

[link] Alignment work in anomalous worlds
Tamsin Leake (carado-1) · 2023-12-16T19:34:26.202Z · comments (4)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

keltan on Review: Dr Stone

I really loved Dr Stone. It gave me the feeling that the science as magic sequences gave me. A deep appreciation for reality and the power it brings an individual to understand it deeply. I really hope to have more rationalists watch it in future.

I also recommend “Science fell in love, so I tried to prove it” for stats nerds. And “My Hero Academia” as the main character embodies “Tsuyoku Naritai!”

david-matolcsi on You can, in fact, bamboozle an unaligned AI into sparing your life

I think the difference is that I, personally, and I think many other humans have this nonlinearity in our utility function that I'm willing to pay a galaxy in the worls we win, for keeping Earth in the world we lose. If there are other AIs in the multiverse that have similarly non-linear interests in our Universe, they can also bargain for planets, but I suspect these will be quite rare, as they don't already have a thing in our Universe they want to protect. So I think it will be hard to outbid humanity for Earth in particular.

There could be other trades that the AIs who have linear returns can still make, like producing objects that are both paperclips and corkscrews if that's more efficient, but that doesn't really affect our deal about Earth.

david-matolcsi on You can, in fact, bamboozle an unaligned AI into sparing your life

I agree you can't make actually binding commitments. But I think the kid-adult example is actually a good illustration of what I want to do: if a kid makes a solemn commitment to spend one in hundred million fraction of his money on action figures when he becomes a rich adult, I think that would usually work. And that's what we are asking from our future selves.

tailcalled on [LDSL#3] Information-orientation is in tension with magnitude-orientation

I've considered it, though there is a tradeoff between realism, completeness and complexity.

I could code a whole Thing which aims to capture every LDSL dynamic I know of, though in that case the code would be very long and also contain factors that I need to describe in later posts in the series.

Alternatively I could simplify it, e.g. by just taking the near-final distributions without addressing more involved questions of how you end up with such distributions, though then it may seem a bit arbitrary when one can change the distributions to get different results.

Or I could let the simulation be incomplete, only showing one facet of LDSL even though other facets would be verifiably wrong in the simulated data, but then at least that facet would be fairly robust to variations in assumptions.

Obviously to some extent I can balance the approaches but I would be curious what approach most aligns with what you want to see.

Edit: maybe as an example of a complexity to consider, there's the whole When causation does not imply correlation: robust violations of the Faithfulness axiom issue I basically haven't discussed yet, where funky statistical dynamics appear in optimized systems. I could either explicitly simulate this despite not having written about it yet, or hard-code distributions that take this into account with no simulation, or just not have this phenomenon at all when it's not absolutely needed for the main point.

tsvibt on You can, in fact, bamboozle an unaligned AI into sparing your life

They got way more of the Everett branches, so to speak. Suppose that the Pseudosuchians had a 20% chance of producing croc-FAI. So starting at the Triassic, you have that 20% of worlds become croc-god worlds, and 80% become a mix of X-god worlds for very many different Xs; maybe only 5% of worlds produce humans, and only .01% produce Humane-gods.

Maybe doing this with Pseudosuchians is less plausible than with humans because you can more easily model what Humane-gods would bargain for, because you have access to humans. But that's eyebrow-raising. What about Corvid-gods, etc. If you can do more work and get access to vastly more powerful acausal trade partners, seems worth it; and, on the face of it, the leap from [acausal trade is infeasible, period] to [actually acausal trade with hypothetical Humane-gods is feasible] seems bigger than the jump from [trade with Humane-gods is feasible] to [trade with Corvid-gods is feasible] or [trade with Cetacean-gods is feasible], though IDK of course. (Then there's the jump to [trade with arbitrary gods from the multiverse]. IDK.)

tailcalled on tailcalled's Shortform

To some extent true, but consider the analogy to a thesis like "Quantum chromodynamics is the strongest governing factor for life on Earth." Is this sentence also problematic because it addresses locations and energy levels that have no relevance for Earth?

cbiddulph on the case for CoT unfaithfulness is overstated

I agree that reading the CoT could be very useful, and is a very promising area of research. In fact, I think reading CoTs could be a much more surefire interpretability method than mechanistic interpretability, although the latter is also quite important.

I feel like research showing that CoTs aren't faithful isn't meant to say "we should throw out the CoT." It's more like "naively, you'd think the CoT is faithful, but look, it sometimes isn't. We shouldn't take the CoT at face value, and we should develop methods that ensure that it is faithful."

Personally, what I want most out of a chain of thought is that its literal, human-interpretable meaning contains almost all the value the LLM gets out of the CoT (vs. immediately writing its answer). Secondarily, it would be nice if the CoT didn't include a lot of distracting junk that doesn't really help the LLM (I suspect this was largely solved by o1, since it was trained to generate helpful CoTs).

I don't actually care much about the LLM explaining why it believes things that it can determine in a single forward pass, such as "French is spoken in Paris." It wouldn't be practically useful for the LLM to think these things through explicitly, and these thoughts are likely too simple to be helpful to us.

If we get to the point that LLMs can frequently make huge, accurate logical leaps in a single forward pass that humans can't follow at all, I'd argue that at that point, we should just make our LLMs smaller and focus on improving their explicit CoT reasoning ability, for the sake of maintaining interpretability.

quila on You can, in fact, bamboozle an unaligned AI into sparing your life

by virtue of happening 10 million years ago or whatever

Why would the time it happens at matter?

ape-in-the-coat on Beauty and the Bets

You are creating a related but different and also complicated problem: the Two Child Problem, which is notoriously ambiguous. "Then you are told the state of one of the coins" can have many meanings.
If I ask the experimenter "choose one of the coins randomly and tell me what it is" then I am not able to update my probability. It will still be 1/2 that the coins are the same.
If I ask the experimenter "is there at least one heads?" then I will be able to update. If they say yes I can update to 1/3, if they say no I can update to 1.

What if no one is answering your questions? You are just shown one coin with no insight about the algorithm according to which the experimenter showed you it, other that this is the outcome of the coin toss. There is actually the least presumptious way to reason about such things. And this is the one I described.

But nevermind that. Let's not go on an unnecessary tangent about this problem. For the sake of the argument I'm happy to concede that both 1/2 and 1/3 are reasonable answers to it. However Conitzer's reasoning would imply that only 1/2 is the correct answer. As a matter of fact he simply assumes that 1/2 has to be correct, refusing to entertain the idea that it's not the case. Not unlike Briggs in Technicolor.

Conitzer's problem can be simplified further by letting Beauty flip a coin herself on Monday and Tuesday.
She wakes up Monday and flips a coin. She wakes up Tuesday and flips a coin. That's it.
After flipping a coin, what should her credence be that the coin flips are the same?
Do you disagree now that the answer is 1/2?

If she just flipped the coin then the answer is 1/2. If she observed the event "the coin is Tails" then the answer is 1/3. If she observed the event "the coin is Heads" the answer is 1/3. But doesn't she always observes one of these events in every iteration of the experiment? No, she doesn't.

This is the same situation as with Technicolor Sleeping Beauty. She observes the event "Blue" instead of "Blue or Red" only when she's configured her event space in a specific way, by precommited to this outcome in particular. Likewise here. When the Beauty precommited to Tails, flips the coin and sees that the coin is indeed Tails she has observed the event "the coin is Tails", when she did no precommitments or the coin turned out to be the other side - she observed the event "the coin is Heads or Tails".

I think it is clearly 1/2 precisely because there is no new evidence. The violation of the Reflection Principle is secondary. More importantly, something has gone wrong if we think she can flip a coin and update the probability of the coins being the same.

Of course something has gone wrong. This is what you get when you add amnesia to probability theory problems - it messes the event space in a counterintuitive way. By default you are able only to observe the most general events which have probability one. Like "I'm awake at least once in the experiment" or "the room is either Blue or Red at least once in the experiment". To observe more specific events you need precommitments.

To see that this actually works, check the betting arguments for rare event and technicolor sleeping beauty from the post. Likewise we can construct a betting argument for the Conitzer's example, where you go through an iterated experiment and are asked to make one per experiment bet on the fact that both coins did not produce the same outcome, with betting odds a bit worse than 1:1. The optimal betting strategy is not always refusing the bet, as it would've been if the probability actually always was 1/2 and you did not get any new evidence.

She isn't even able to observe that a sequence of tosses happens "at least once" in the experiment.

She is able to to observe that a particular sequence of tosses happens "at least once" in the experiment only if she has precommited to guessing this particular sequence. Otherwise she, indeed does not observe this event.

vlad-sitalo on Building an Epistemic Status Tracker

Found https://fatebook.io/ that seems alive, trying it out now