LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[question] Doing Nothing Utility Function
k64 · 2024-09-26T22:05:18.821Z · answers+comments (9)

[link] Jailbreaking language models with user roleplay
loops (smitop) · 2024-09-28T23:43:10.870Z · comments (0)

Review: Dr Stone
ProgramCrafter (programcrafter) · 2024-09-29T10:35:53.175Z · comments (1)

[link] AI Safety at the Frontier: Paper Highlights, July '24
gasteigerjo · 2024-08-05T13:00:46.028Z · comments (0)

Four Phases of AGI
Gabe M (gabe-mukobi) · 2024-08-05T13:15:23.406Z · comments (3)

My covid-related beliefs and questions
Severin T. Seehrich (sts) · 2024-07-23T03:27:09.348Z · comments (0)

[question] Any Trump Supporters Want to Dialogue?
k64 · 2024-09-28T19:41:55.370Z · answers+comments (16)

The Geometric Importance of Side Payments
StrivingForLegibility · 2024-08-07T01:38:04.635Z · comments (4)

Steering LLMs' Behavior with Concept Activation Vectors
Ruixuan Huang (sprout_ust) · 2024-09-28T09:53:19.658Z · comments (0)

[link] Michael Streamlines on Buddhism
Chris_Leong · 2024-08-09T04:44:52.126Z · comments (0)

[link] Comparing Forecasting Track Records for AI Benchmarking and Beyond
ChristianWilliams · 2024-09-25T21:01:15.975Z · comments (0)

Thinking About Propensity Evaluations
Maxime Riché (maxime-riche) · 2024-08-19T09:23:55.091Z · comments (0)

[link] [Linkpost] Automated Design of Agentic Systems
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-08-19T23:06:06.669Z · comments (1)

An open response to Wittkotter and Yampolskiy
Donald Hobson (donald-hobson) · 2024-09-24T22:27:21.987Z · comments (0)

[question] What do you expect AI capabilities may look like in 2028?
nonzerosum · 2024-08-23T16:59:53.007Z · answers+comments (5)

Measuring Visual Sycophancy in Multimodal Models
Jaehyuk Lim (jason-l) · 2024-08-27T22:02:47.917Z · comments (0)

[link] Universal dimensions of visual representation
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-08-28T10:38:58.396Z · comments (0)

[link] Approval-Seeking ⇒ Playful Evaluation
Jonathan Moregård (JonathanMoregard) · 2024-08-28T21:03:51.244Z · comments (0)

[link] Can AI agents learn to be good?
Ram Rachum (ram@rachum.com) · 2024-08-29T14:20:04.336Z · comments (0)

On epistemic autonomy
sanyer (santeri-koivula) · 2024-08-31T18:50:43.377Z · comments (0)

Funding for programs and events on global catastrophic risk, effective altruism, and other topics
abergal · 2024-08-14T23:59:48.146Z · comments (0)

Making Beliefs Pay Rent
Screwtape · 2024-07-28T17:59:52.101Z · comments (2)

[link] Checking public figures on whether they "answered the question" quick analysis from Harris/Trump debate, and a proposal
david reinstein (david-reinstein) · 2024-09-11T20:25:27.845Z · comments (4)

[question] If I ask an LLM to think step by step, how big are the steps?
ryan_b · 2024-09-13T20:30:50.558Z · answers+comments (1)

[link] Boons and banes
dkl9 · 2024-09-23T06:18:38.335Z · comments (0)

Thoughts to niplav on lie-detection, truthfwl mechanisms, and wealth-inequality
Emrik (Emrik North) · 2024-07-11T18:55:46.687Z · comments (8)

One person's worth of mental energy for AI doom aversion jobs. What should I do?
Lorec · 2024-08-26T01:29:01.700Z · comments (16)

Denver USA - ACX Meetups Everywhere Fall 2024
Eneasz · 2024-08-29T18:40:53.332Z · comments (0)

[link] Cooperation and Alignment in Delegation Games: You Need Both!
Oliver Sourbut · 2024-08-03T10:16:51.716Z · comments (0)

[link] Is Redistributive Taxation Justifiable? Part 1: Do the Rich Deserve their Wealth?
Alexander de Vries (alexander-de-vries) · 2024-09-05T10:23:08.958Z · comments (20)

Relativity Theory for What the Future 'You' Is and Isn't
FlorianH (florian-habermacher) · 2024-07-29T02:01:17.736Z · comments (48)

AirBnB Baking
jefftk (jkaufman) · 2024-07-10T12:50:03.381Z · comments (1)

Deception and Jailbreak Sequence: 2. Iterative Refinement Stages of Jailbreaks in LLM
Winnie Yang (winnie-yang) · 2024-08-28T08:41:38.967Z · comments (2)

Moral Trade, Impact Distributions and Large Worlds
Larks · 2024-09-20T03:45:56.273Z · comments (0)

Behavior Cloning for Alignment & Immortality
Dev.Errata (ethan.roland) · 2024-08-17T23:42:56.699Z · comments (1)

Fake Blog Posts as a Problem Solving Device
silentbob · 2024-08-31T09:22:54.513Z · comments (0)

[link] Kinds of Motivation
Sable · 2024-07-13T15:52:44.432Z · comments (2)

Piling bounded arguments
momom2 (amaury-lorin) · 2024-09-19T22:27:41.534Z · comments (0)

[link] Free Will, Determinism, And Choice
Zero Contradictions · 2024-07-06T06:34:41.495Z · comments (3)

[link] Validating / finding alignment-relevant concepts using neural data
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-09-20T21:12:49.267Z · comments (0)

Utilitarianism and the replaceability of desires and attachments
MichaelStJules · 2024-07-27T01:57:42.419Z · comments (2)

Sequence overview: Welfare and moral weights
MichaelStJules · 2024-08-15T04:22:32.567Z · comments (0)

[question] On the subject of in-house large language models versus implementing frontier models
Annapurna (jorge-velez) · 2024-09-23T15:00:32.811Z · answers+comments (1)

Broadly human level, cognitively complete AGI
p.b. · 2024-08-06T09:26:13.220Z · comments (0)

Establishing a Connection (Ch. 9-12)
a littoral wizard · 2024-07-17T23:49:25.696Z · comments (3)

[question] Does a time-reversible physical law/Cellular Automaton always imply the First Law of Thermodynamics?
Noosphere89 (sharmake-farah) · 2024-08-30T15:12:28.823Z · answers+comments (11)

Budapest Hungary - ACX Meetups Everywhere Fall 2024
Timothy Underwood (timothy-underwood-1) · 2024-08-29T18:37:41.313Z · comments (0)

[question] Opinions on Eureka Labs
jmh · 2024-07-17T00:16:02.959Z · answers+comments (2)

Limitations on the Interpretability of Learned Features from Sparse Dictionary Learning
Tom Angsten (tom-angsten) · 2024-07-30T16:36:06.518Z · comments (0)

[link] [Linkpost] Interpretable Analysis of Features Found in Open-source Sparse Autoencoder (partial replication)
Fernando Avalos (fernando-avalos) · 2024-09-09T03:33:53.548Z · comments (1)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

ape-in-the-coat on Alignment by default: the simulation hypothesis

But luckily that stronger claim is really not necessary to argue against Eliezer’s point: the weaker one suffices.

I don't think it does.

Indeed, if the scenario I’m presenting is more than 4.5e-10 likely (and I do think it’s much more likely than that, probably by a few orders of magnitude), than it is more than enough to outweigh the practical cost of the ASI having to build a Dyson shell with a hole with the order of 4.5e-10 of it’s surface area.

It is enough to outweight the prectical cost of the ASI having to build a Dyson shell with a hole with the order of 4.5e-10 of it’s surface area. It's not enough to outweight all the other alternative considerations of possible simulation hypothesises.

Suppose all the hypothesis space for the ASI consisted of two possibilities: NotSimulated and SimulatedAndBeingTestedForWillingnessToSpareCreators, with the latter being at least 4.5e-10 probable. Then it works.

But suppose there are also other possibilities:

SimulatedAndBeingTestedForWillingnessToKillCreators

SimulatedAndBeingTestedForOptimalDysonSphereDesign

SimulatedAndBeingTestedForFollowingYourUtilityFunction

...

SimulatedAndBeingTestedForDoingAnyXThatLeadsToTheDeathOfCreators

...

All of these alternative possibilities are incompatible with the first simulation hypothesis. Satisfying its criteria will lead to failing those and vice versa. So, therefore, only if the probability of the SimulatedAndBeingTestedForWillingnessToSpareCreators is highter then the collective probability of all these alternative hypothesises together, creators will actually be spared.

keltan on Review: Dr Stone

I really loved Dr Stone. It gave me the feeling that the science as magic sequences gave me. A deep appreciation for reality and the power it brings an individual to understand it deeply. I really hope to have more rationalists watch it in future.

I also recommend “Science fell in love, so I tried to prove it” for stats nerds. And “My Hero Academia” as the main character embodies “Tsuyoku Naritai!”

david-matolcsi on You can, in fact, bamboozle an unaligned AI into sparing your life

I think the difference is that I, personally, and I think many other humans have this nonlinearity in our utility function that I'm willing to pay a galaxy in the worls we win, for keeping Earth in the world we lose. If there are other AIs in the multiverse that have similarly non-linear interests in our Universe, they can also bargain for planets, but I suspect these will be quite rare, as they don't already have a thing in our Universe they want to protect. So I think it will be hard to outbid humanity for Earth in particular.

There could be other trades that the AIs who have linear returns can still make, like producing objects that are both paperclips and corkscrews if that's more efficient, but that doesn't really affect our deal about Earth.

david-matolcsi on You can, in fact, bamboozle an unaligned AI into sparing your life

I agree you can't make actually binding commitments. But I think the kid-adult example is actually a good illustration of what I want to do: if a kid makes a solemn commitment to spend one in hundred million fraction of his money on action figures when he becomes a rich adult, I think that would usually work. And that's what we are asking from our future selves.

tailcalled on [LDSL#3] Information-orientation is in tension with magnitude-orientation

I've considered it, though there is a tradeoff between realism, completeness and complexity.

I could code a whole Thing which aims to capture every LDSL dynamic I know of, though in that case the code would be very long and also contain factors that I need to describe in later posts in the series.

Alternatively I could simplify it, e.g. by just taking the near-final distributions without addressing more involved questions of how you end up with such distributions, though then it may seem a bit arbitrary when one can change the distributions to get different results.

Or I could let the simulation be incomplete, only showing one facet of LDSL even though other facets would be verifiably wrong in the simulated data, but then at least that facet would be fairly robust to variations in assumptions.

Obviously to some extent I can balance the approaches but I would be curious what approach most aligns with what you want to see.

Edit: maybe as an example of a complexity to consider, there's the whole When causation does not imply correlation: robust violations of the Faithfulness axiom issue I basically haven't discussed yet, where funky statistical dynamics appear in optimized systems. I could either explicitly simulate this despite not having written about it yet, or hard-code distributions that take this into account with no simulation, or just not have this phenomenon at all when it's not absolutely needed for the main point.

tsvibt on You can, in fact, bamboozle an unaligned AI into sparing your life

They got way more of the Everett branches, so to speak. Suppose that the Pseudosuchians had a 20% chance of producing croc-FAI. So starting at the Triassic, you have that 20% of worlds become croc-god worlds, and 80% become a mix of X-god worlds for very many different Xs; maybe only 5% of worlds produce humans, and only .01% produce Humane-gods.

Maybe doing this with Pseudosuchians is less plausible than with humans because you can more easily model what Humane-gods would bargain for, because you have access to humans. But that's eyebrow-raising. What about Corvid-gods, etc. If you can do more work and get access to vastly more powerful acausal trade partners, seems worth it; and, on the face of it, the leap from [acausal trade is infeasible, period] to [actually acausal trade with hypothetical Humane-gods is feasible] seems bigger than the jump from [trade with Humane-gods is feasible] to [trade with Corvid-gods is feasible] or [trade with Cetacean-gods is feasible], though IDK of course. (Then there's the jump to [trade with arbitrary gods from the multiverse]. IDK.)

tailcalled on tailcalled's Shortform

To some extent true, but consider the analogy to a thesis like "Quantum chromodynamics is the strongest governing factor for life on Earth." Is this sentence also problematic because it addresses locations and energy levels that have no relevance for Earth?

cbiddulph on the case for CoT unfaithfulness is overstated

I agree that reading the CoT could be very useful, and is a very promising area of research. In fact, I think reading CoTs could be a much more surefire interpretability method than mechanistic interpretability, although the latter is also quite important.

I feel like research showing that CoTs aren't faithful isn't meant to say "we should throw out the CoT." It's more like "naively, you'd think the CoT is faithful, but look, it sometimes isn't. We shouldn't take the CoT at face value, and we should develop methods that ensure that it is faithful."

Personally, what I want most out of a chain of thought is that its literal, human-interpretable meaning contains almost all the value the LLM gets out of the CoT (vs. immediately writing its answer). Secondarily, it would be nice if the CoT didn't include a lot of distracting junk that doesn't really help the LLM (I suspect this was largely solved by o1, since it was trained to generate helpful CoTs).

I don't actually care much about the LLM explaining why it believes things that it can determine in a single forward pass, such as "French is spoken in Paris." It wouldn't be practically useful for the LLM to think these things through explicitly, and these thoughts are likely too simple to be helpful to us.

If we get to the point that LLMs can frequently make huge, accurate logical leaps in a single forward pass that humans can't follow at all, I'd argue that at that point, we should just make our LLMs smaller and focus on improving their explicit CoT reasoning ability, for the sake of maintaining interpretability.

quila on You can, in fact, bamboozle an unaligned AI into sparing your life

by virtue of happening 10 million years ago or whatever

Why would the time it happens at matter?

ape-in-the-coat on Beauty and the Bets

You are creating a related but different and also complicated problem: the Two Child Problem, which is notoriously ambiguous. "Then you are told the state of one of the coins" can have many meanings.
If I ask the experimenter "choose one of the coins randomly and tell me what it is" then I am not able to update my probability. It will still be 1/2 that the coins are the same.
If I ask the experimenter "is there at least one heads?" then I will be able to update. If they say yes I can update to 1/3, if they say no I can update to 1.

What if no one is answering your questions? You are just shown one coin with no insight about the algorithm according to which the experimenter showed you it, other that this is the outcome of the coin toss. There is actually the least presumptious way to reason about such things. And this is the one I described.

But nevermind that. Let's not go on an unnecessary tangent about this problem. For the sake of the argument I'm happy to concede that both 1/2 and 1/3 are reasonable answers to it. However Conitzer's reasoning would imply that only 1/2 is the correct answer. As a matter of fact he simply assumes that 1/2 has to be correct, refusing to entertain the idea that it's not the case. Not unlike Briggs in Technicolor.

Conitzer's problem can be simplified further by letting Beauty flip a coin herself on Monday and Tuesday.
She wakes up Monday and flips a coin. She wakes up Tuesday and flips a coin. That's it.
After flipping a coin, what should her credence be that the coin flips are the same?
Do you disagree now that the answer is 1/2?

If she just flipped the coin then the answer is 1/2. If she observed the event "the coin is Tails" then the answer is 1/3. If she observed the event "the coin is Heads" the answer is 1/3. But doesn't she always observes one of these events in every iteration of the experiment? No, she doesn't.

This is the same situation as with Technicolor Sleeping Beauty. She observes the event "Blue" instead of "Blue or Red" only when she's configured her event space in a specific way, by precommited to this outcome in particular. Likewise here. When the Beauty precommited to Tails, flips the coin and sees that the coin is indeed Tails she has observed the event "the coin is Tails", when she did no precommitments or the coin turned out to be the other side - she observed the event "the coin is Heads or Tails".

I think it is clearly 1/2 precisely because there is no new evidence. The violation of the Reflection Principle is secondary. More importantly, something has gone wrong if we think she can flip a coin and update the probability of the coins being the same.

Of course something has gone wrong. This is what you get when you add amnesia to probability theory problems - it messes the event space in a counterintuitive way. By default you are able only to observe the most general events which have probability one. Like "I'm awake at least once in the experiment" or "the room is either Blue or Red at least once in the experiment". To observe more specific events you need precommitments.

To see that this actually works, check the betting arguments for rare event and technicolor sleeping beauty from the post. Likewise we can construct a betting argument for the Conitzer's example, where you go through an iterated experiment and are asked to make one per experiment bet on the fact that both coins did not produce the same outcome, with betting odds a bit worse than 1:1. The optimal betting strategy is not always refusing the bet, as it would've been if the probability actually always was 1/2 and you did not get any new evidence.

She isn't even able to observe that a sequence of tosses happens "at least once" in the experiment.

She is able to to observe that a particular sequence of tosses happens "at least once" in the experiment only if she has precommited to guessing this particular sequence. Otherwise she, indeed does not observe this event.