LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Best-of-n with misaligned reward models for Math reasoning
Fabien Roger (Fabien) · 2024-06-21T22:53:21.243Z · comments (0)

[LDSL#2] Latent variable models, network models, and linear diffusion of sparse lognormals
tailcalled · 2024-08-09T19:57:56.122Z · comments (2)

Analogy Bank for AI Safety
utilistrutil · 2024-01-29T02:35:13.746Z · comments (0)

[link] Managing Emotional Potential Energy
adamShimi · 2024-07-10T18:20:45.640Z · comments (4)

Announcing Convergence Analysis: An Institute for AI Scenario & Governance Research
David_Kristoffersson · 2024-03-07T21:37:00.526Z · comments (1)

Tend to your clarity, not your confusion
Severin T. Seehrich (sts) · 2024-03-11T15:09:24.099Z · comments (1)

New social credit formalizations
KatjaGrace · 2024-03-11T19:00:06.201Z · comments (3)

The Garden of Eden
Alexander Turok · 2024-07-22T16:07:42.509Z · comments (2)

[link] my theory of the industrial revolution
bhauth · 2024-02-28T21:07:55.274Z · comments (7)

[link] Libs vs Frameworks, Middle-Level Regularities vs Theories
adamShimi · 2024-07-04T19:01:59.440Z · comments (0)

Whirlwind Tour of Chain of Thought Literature Relevant to Automating Alignment Research.
sevdeawesome · 2024-07-01T05:50:49.498Z · comments (0)

[link] Increasing IQ by 10 Points is Possible
George3d6 · 2024-03-19T20:48:41.277Z · comments (50)

[link] [Talk transcript] What “structure” is and why it matters
Alex_Altair · 2024-07-25T15:49:00.844Z · comments (0)

Is it justifiable for non-experts to have strong opinions about Gaza?
Yair Halberstadt (yair-halberstadt) · 2024-01-08T17:31:21.934Z · comments (12)

Text Posts from the Kids Group: 2019
jefftk (jkaufman) · 2024-06-23T13:20:01.495Z · comments (0)

[link] The natural boundaries between people
Chipmonk · 2024-02-23T01:09:28.592Z · comments (2)

Bent or Blunt Hoods?
jefftk (jkaufman) · 2024-01-09T20:10:11.545Z · comments (0)

Invest in ACX Grants projects!
Saul Munn (saul-munn) · 2024-03-06T20:27:04.616Z · comments (0)

AXRP Episode 34 - AI Evaluations with Beth Barnes
DanielFilan · 2024-07-28T03:30:07.192Z · comments (0)

[link] How are voluntary commitments on vulnerability reporting going?
Adam Jones (domdomegg) · 2024-02-22T08:43:56.996Z · comments (1)

Extinction-level Goodhart's Law as a Property of the Environment
VojtaKovarik · 2024-02-21T17:56:02.052Z · comments (0)

From the outside, American schooling is weird
Jacob G-W (g-w1) · 2024-03-28T22:45:30.485Z · comments (4)

I Want XMP But I Know Why I Can't Have It
jefftk (jkaufman) · 2024-01-19T15:30:07.492Z · comments (0)

Inducing human-like biases in moral reasoning LMs
Artyom Karpov (artkpv) · 2024-02-20T16:28:11.424Z · comments (3)

[link] Selections From "The Trouble With Being Born"
Arjun Panickssery (arjun-panickssery) · 2024-02-20T10:07:02.780Z · comments (2)

[link] 11 diceware words is enough
DanielFilan · 2024-02-15T00:13:43.420Z · comments (6)

Trying to align humans with inclusive genetic fitness
peterbarnett · 2024-01-11T00:13:29.487Z · comments (5)

[question] Are there high-quality surveys available detailing the rates of polyamory among Americans age 18-45 in metropolitan areas in the United States?
Evan_Gaensbauer · 2024-01-18T23:50:52.053Z · answers+comments (0)

Eliminating Cookie Banners is Hard
jefftk (jkaufman) · 2024-01-13T03:00:04.843Z · comments (15)

[question] Why do so many think deception in AI is important?
Prometheus · 2024-01-13T08:14:58.671Z · answers+comments (12)

Case Studies in Reverse-Engineering Sparse Autoencoder Features by Using MLP Linearization
Jacob Dunefsky (jacob-dunefsky) · 2024-01-14T02:06:00.290Z · comments (0)

Making the "stance" explicit
NicholasKees (nick_kees) · 2024-02-16T23:57:11.265Z · comments (3)

[link] Masculinity—A Case For Courage
James Stephen Brown (james-brown) · 2024-06-04T00:04:48.411Z · comments (0)

[question] Money Pump Arguments assume Memoryless Agents. Isn't this Unrealistic?
Dalcy (Darcy) · 2024-08-16T04:16:23.159Z · answers+comments (6)

Johannes' Biography
Johannes C. Mayer (johannes-c-mayer) · 2024-01-03T13:27:19.329Z · comments (0)

AI #77: A Few Upgrades
Zvi · 2024-08-20T00:20:09.717Z · comments (3)

Red Line Ashmont Train is Now Approaching
jefftk (jkaufman) · 2023-12-14T02:50:11.382Z · comments (2)

[link] Is There Really a Child Penalty in the Long Run?
Maxwell Tabarrok (maxwell-tabarrok) · 2024-05-17T11:56:22.892Z · comments (6)

[question] How much fraud is there in academia?
ChristianKl · 2023-11-16T11:50:41.544Z · answers+comments (10)

[link] [EA xpost] The Rationale-Shaped Hole At The Heart Of Forecasting
dschwarz · 2024-04-02T17:40:44.278Z · comments (2)

Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs
Winnie Yang (winnie-yang) · 2024-08-22T07:32:07.600Z · comments (1)

aisafety.info, the Table of Content
Charbel-Raphaël (charbel-raphael-segerie) · 2023-12-31T13:57:15.916Z · comments (1)

Blessed information, garbage information, cursed information
tailcalled · 2024-04-18T16:56:17.370Z · comments (8)

On excluding dangerous information from training
ShayBenMoshe (shay-ben-moshe) · 2023-11-17T11:14:54.847Z · comments (5)

[link] The dangers of reproducing while old
garymm · 2023-11-17T05:55:56.897Z · comments (6)

Less Anti-Dakka
Mateusz Bagiński (mateusz-baginski) · 2024-05-31T09:07:10.450Z · comments (5)

GPT-3.5 judges can supervise GPT-4o debaters in capability asymmetric debates
Charlie George (charlie-george) · 2024-08-27T20:44:08.683Z · comments (7)

[link] Predicting the future with the power of the Internet (and pissing off Rob Miles)
Writer · 2023-12-15T17:37:29.695Z · comments (9)

Would you benefit from, or object to, a page with LW users' reacts?
Raemon · 2024-08-20T16:35:47.568Z · comments (6)

Detecting What's Been Seen
jefftk (jkaufman) · 2023-11-24T03:30:06.691Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

viliam on Making a conservative case for alignment

Going by today's standards, we should have banned Gwern in 2012.

And I think that would have been a mistake.

I wonder how many other mistakes we made. The problem is, we won't get good feedback on this.

joao-ribeiro-medeiros on Neutrality

Love this!!! To inspect the concept of neutrality as a software like process driven institution is a very illuminating approach.

I would add that there is a very relevant discussion in the AI sphere which is profoundly connected to the points you are making: the existence of bias in AI models.

To identify and measure bias is in a sense to identify and measure lack of neutrality, so it follows that, to define bias, one must first be very rigorous on the definition of neutrality.

This can seem simple for some of the more pedestrian AI tasks, but can become increasingly sophisticated as we introduce AI as an essential piece in workflows and institutions.

AI algorithms can be heavily biased, datasets can be biased and even data structures can be biased.

I feel this is a topic which you can further explore in the future. Thank you for this!

kave on Why imperfect adversarial robustness doesn't doom AI control

I'm curious about Claude's coauthorship of this post. How did it help?

kave on Why imperfect adversarial robustness doesn't doom AI control

The crucial point is: when the untrusted model gets to step 6, it sees R in the history instead of A, so it knows its attempt was replaced.

Why does it know its attempt was replaced? Because it can guess it wouldn't have produced R? If it can tell that, how come it can't guess it specifically would have produced A?

mr-hire on Ayn Rand’s model of “living money”; and an upside of burnout

also, this 'subconscious parts going on strike' theory makes slightly different predictions than the 'is it good for the whole system/live' theory

for instance, i predict that you can have 'dead parts' that e.g. give people social anxiety based on past trauma, even though it's no longer actually relevant to their current situation.

and that if you override this social anxiety using 'live willpower' for a while, you can get burnout, even though the willpower is in some sense 'correct' about what would be good for the overall flourishing of the system given the current reality.

egor-timatkov on "It's a 10% chance which I did 10 times, so it should be 100%"

I think what Justin is saying is that finding a single monogamous partner is not significantly different from finding two, three, etc. For some things you only care about succeeding once. So a 63% chance of success (any number of times) means a .63 expected value (because all successes after the first have a value of 0).

Meanwhile for other things, such as polyamorous partners, 2 partners is meaningfully better than one, so the expected value truly is 1, because you will get one partner on average. (Though this assumes 2 partners is twice as good as one, we can complicate this even more if we assume that 2 partners is better, but not twice as good)

aprilsr on "The Solomonoff Prior is Malign" is a special case of a simpler argument

In case it's a helpful data point: lines of reasoning sorta similar to the ones around the infohazard warning seemed to have interesting and intense psychological effects on me one time. It's hard to separate out from other factors, though, and I think it had something to do with the fact that lately I've been spending a lot of time learning to take ideas seriously on an emotional level instead of only an abstract one.

jonas-hallgren on Leon Lang's Shortform

If you look at the Active Inference community there's a lot of work going into PPL-based languages to do more efficient world modelling but that shit ain't easy and as you say it is a lot more compute heavy.

I think there'll be a scaling break due to this but when it is algorithmically figured out again we will be back and back with a vengeance as I think most safety challenges have a self vs environment model as a necessary condition to be properly engaged. (which currently isn't engaged with LLMs wolrd modelling)

p-b-1 on Rauno's Shortform

There was one comment on twitter that the RLHF-finetuned models also still have the ability to play chess pretty well, just their input/output-formatting made it impossible for them to access this ability (or something along these lines). But apparently it can be recovered with a little finetuning.

habryka4 on Monthly Roundup #24: November 2024

Indeed. I fixed it. Let's see whether it repeats itself (we got kind of malformed HTML from the RSS feed).