LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Secular Solstice Round Up 2024
dspeyer · 2024-11-21T10:49:36.682Z · comments (12)

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback
Marcus Williams · 2024-11-07T15:39:06.854Z · comments (6)

[link] [Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF
Leon Lang (leon-lang) · 2024-10-22T13:57:41.125Z · comments (0)

D&D.Sci Coliseum: Arena of Data Evaluation and Ruleset
aphyer · 2024-10-29T01:21:03.075Z · comments (12)

Which evals resources would be good?
Marius Hobbhahn (marius-hobbhahn) · 2024-11-16T14:24:48.012Z · comments (4)

Bounty for Evidence on Some of Palisade Research's Beliefs
benwr · 2024-09-23T20:01:20.917Z · comments (4)

[link] Michael Dickens' Caffeine Tolerance Research
niplav · 2024-09-04T15:41:53.343Z · comments (3)

Forecasting One-Shot Games
Raemon · 2024-08-31T23:10:05.475Z · comments (0)

How to hire somebody better than yourself
lemonhope (lcmgcd) · 2024-08-28T08:12:53.450Z · comments (5)

Looking back on the Future of Humanity Institute - Asterisk
jakeeaton · 2024-11-19T00:44:40.928Z · comments (0)

AI as a powerful meme, via CGP Grey
TheManxLoiner · 2024-10-30T18:31:58.544Z · comments (8)

Toy Models of Feature Absorption in SAEs
chanind · 2024-10-07T09:56:53.609Z · comments (8)

[link] New o1-like model (QwQ) beats Claude 3.5 Sonnet with only 32B parameters
Jesse Hoogland (jhoogland) · 2024-11-27T22:06:12.914Z · comments (1)

I finally got ChatGPT to sound like me
lsusr · 2024-09-17T09:39:59.415Z · comments (18)

[link] MIRI's September 2024 newsletter
Harlan · 2024-09-16T18:15:40.785Z · comments (0)

AI #88: Thanks for the Memos
Zvi · 2024-10-31T15:00:07.412Z · comments (5)

[link] What Ketamine Therapy Is Like
Sable · 2024-11-11T11:09:08.602Z · comments (8)

[link] Epistemic status: poetry (and other poems)
Richard_Ngo (ricraz) · 2024-11-21T18:13:17.194Z · comments (5)

Live Machinery: An Interface Design Philosophy for Wholesome AI Futures
Sahil · 2024-11-01T17:24:09.957Z · comments (2)

The Shallow Bench
Karl Faulks (karl-faulks) · 2024-11-05T05:07:27.357Z · comments (5)

Conflating value alignment and intent alignment is causing confusion
Seth Herd · 2024-09-05T16:39:51.967Z · comments (18)

Work with me on agent foundations: independent fellowship
Alex_Altair · 2024-09-21T13:59:16.706Z · comments (5)

We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap
johnswentworth · 2024-09-19T22:22:05.307Z · comments (47)

AI #80: Never Have I Ever
Zvi · 2024-09-10T17:50:08.074Z · comments (20)

[link] Analyzing how SAE features evolve across a forward pass
bensenberner · 2024-11-07T22:07:02.827Z · comments (0)

~80 Interesting Questions about Foundation Model Agent Safety
RohanS · 2024-10-28T16:37:04.713Z · comments (4)

AI #91: Deep Thinking
Zvi · 2024-11-21T14:30:06.930Z · comments (9)

Start an Upper-Room UV Installation Company?
jefftk (jkaufman) · 2024-10-19T02:00:10.691Z · comments (9)

[link] The Choice Transition
owencb · 2024-11-18T12:30:56.198Z · comments (4)

[link] Dangerous capability tests should be harder
LucaRighetti (Error404Dinosaur) · 2024-11-21T17:20:50.610Z · comments (3)

[link] Literacy Rates Haven't Fallen By 20% Since the Department of Education Was Created
Maxwell Tabarrok (maxwell-tabarrok) · 2024-11-22T20:53:59.007Z · comments (0)

In defense of technological unemployment as the main AI concern
tailcalled · 2024-08-27T17:58:01.992Z · comments (36)

Economics Roundup #3
Zvi · 2024-09-10T13:50:06.955Z · comments (9)

[question] "Deception Genre" What Books are like Project Lawful?
Double · 2024-08-28T17:19:52.172Z · answers+comments (20)

Motivation control
Joe Carlsmith (joekc) · 2024-10-30T17:15:50.881Z · comments (7)

Monthly Roundup #24: November 2024
Zvi · 2024-11-18T13:20:06.086Z · comments (14)

Reading RFK Jr so that you don’t have to
braces · 2024-11-22T00:59:19.583Z · comments (0)

Which LessWrong/Alignment topics would you like to be tutored in? [Poll]
Ruby · 2024-09-19T01:35:02.999Z · comments (12)

Minimal Motivation of Natural Latents
johnswentworth · 2024-10-14T22:51:58.125Z · comments (14)

Dave Kasten's AGI-by-2027 vignette
davekasten · 2024-11-26T23:20:47.212Z · comments (7)

How difficult is AI Alignment?
Sammy Martin (SDM) · 2024-09-13T15:47:10.799Z · comments (6)

Unit economics of LLM APIs
dschwarz · 2024-08-27T16:51:22.692Z · comments (0)

Time Efficient Resistance Training
romeostevensit · 2024-10-07T15:15:44.950Z · comments (10)

AI #89: Trump Card
Zvi · 2024-11-07T16:30:05.684Z · comments (12)

Startup Success Rates Are So Low Because the Rewards Are So Large
AppliedDivinityStudies (kohaku-none) · 2024-10-10T20:22:01.557Z · comments (6)

Formalizing the Informal (event invite)
abramdemski · 2024-09-10T19:22:53.564Z · comments (0)

[link] Things I learned talking to the new breed of scientific institution
Abhishaike Mahajan (abhishaike-mahajan) · 2024-08-29T14:00:14.844Z · comments (6)

MATS AI Safety Strategy Curriculum v2
DanielFilan · 2024-10-07T22:44:06.396Z · comments (6)

Australian AI Safety Forum 2024
Liam Carroll (liam-carroll) · 2024-09-27T00:40:11.451Z · comments (0)

[link] An Interactive Shapley Value Explainer
James Stephen Brown (james-brown) · 2024-09-28T05:01:21.169Z · comments (9)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

the-gears-to-ascension on the gears to ascenscion's Shortform

qaci seems to require the system having an understanding-creating property that makes it a reliable historian. have been thinking about this, have more to say, currently rather raw and unfinished.

diziet on "Map of AI Futures" - An interactive flowchart

A bit of feedback: the "We get a second chance at building AGI" outcome should not be an outcome or perhaps rephrased.

tsvibt on Raemon's Shortform

The former doesn't necessarily imply the latter in general, because even if we are systematically underestimating the realistic upper bound for our skill level in these areas, we would still have to deal with diminishing marginal returns to investing in any particular one.

On the other hand, even if what you say is true, skill headroom may still imply that it's worth building shared arts around such skills. Shareability and build-on-ability changes the marginal returns a lot.

viliam on Alignment is not intelligent

The outcome depends on the details of the algorithm. Have you tried writing actual code?

If the code is literally "evaluate all options, choose the one that leads to more cups; if there is more than one such option, choose randomly", then the agent will choose randomly, because all options lead to the same amount of cups. That's what the algorithm literally says. Information like "at some moment the algorithm will change" has no impact on the predicted number of cups, which is literally the only thing the algorithm cares about.

When at midnight you delete this code, and upload a new code saying "evaluate all options, choose the one that leads to more paperclips; if there is more than one such option, choose randomly", the agent will start the factory (if it wasn't started already), because now that is what the code says.

The thing that you probably imagine, is that the agent has a variable called "utility" and chooses the option that leads to the highest predicted value in that variable. That is not the same as the agent that tried to maximize cups. This agent would be a variable-called-utility maximizer.

(Also, come on, LLMs are notoriously bad at math, plus if you push them hard enough you can convince them of a lot of things.)

tsvibt on Passages I Highlighted in The Letters of J.R.R.Tolkien

Philology is philosophy, because it lets you escape the trap of the language you were born with. Much like mathematics, humanity's most ambitious such escape attempt, still very much in its infancy.

True...

If you really want to express the truth about what you feel and see, you need to be inventing new languages. And if you want to preserve a culture, you must not lose its language.

I think this is a mistake, made by many. It's a retreat and an abdication. We are in our native language, so we should work from there.

jonas-hallgren on How to use bright light to improve your life.

This has worked great btw! Thank you for the tip, I consistently get more deep sleep and around 10% more sleep with higher average quality, it's really good!

tailcalled on Crosspost: Developing the middle ground on polarized topics

If we think of the quantified abilities as the logarithms of the true abilities, then taking the log has likely massively increased the correlations by bringing the outliers into the bulk of the distribution.

sid-kap on Why you should be using a retinoid

Are you afraid of dry eyes/meibomian gland dysfunction at all? It seems like it's pretty common as a side effect of retinoids.

nostalgebraist on jbco's Shortform

AFAIK the distinction is that:

When you condition on a particular outcome for , it affects your probabilities for every other variable that's causally related to $X$ , in either direction.
- You gain information about variables that are causally downstream from $X$ (its "effects"). Like, if you imagine setting $X = x$ and then "playing the tape forward," you'll see the sorts of events that tend to follow from $X = x$ and not those that tend to follow from some other outcome $X = x^{'}$ .
- And, you gain information about variables that are causally upstream from $X$ (its "causes"). If you know that $X = x$ , then the causes of $X$ must have "added up to" that outcome for $X$ . You can rule out any configuration of the causes that doesn't "add up to" causing $X = x$ , and that affects your probability distributions for all of these causative variables.
When you use the do-operator to set $X$ to a particular outcome for X, it only affects your probabilities for the "effects" of $X$ , not the "causes." (The first sub-bullet above, not the second.)

For example, suppose hypothetically that I cook dinner every evening. And this process consists of these steps in order:

" $W$ ": considering what ingredients I have in the house
" $X$ ": deciding on a particular meal to make, and cooking it
" $Y$ ": eating the food
" $Z$ ": taking a moment after the meal to take stock of the ingredients left in the kitchen

Some days I have lots of ingredients, and I prepare elaborate dinners. Other days I don't, and I make simple and easy dinners.

Now, suppose that on one particular evening, I am making instant ramen ( $X = m a k i n g i n s t a n t r a m e n$ ). We're given no other info about this evening, but we know this.

What can we conclude from this? A lot, it turns out:

In $Y$ , I'll be eating instant ramen, not something else.
In $W$ , I probably didn't have many ingredients in the house. Otherwise I would have made something more elaborate.
In $Z$ , I probably don't see many ingredients on the shelves (a result of what we know about $W$ ).

This is what happens when we condition on $X = m a k i n g i n s t a n t r a m e n$ .

If instead we apply the do-operator to $X = m a k i n g i n s t a n t r a m e n$ , then:

We learn nothing about $W$ , and from our POV it is still a sample from the original unconditional distribution for $W$ .
We can still conclude that I'll be eating ramen afterwards, in $Y$ .
We know very little about $Z$ (the post-meal ingredient survey) for the same reason we know nothing about $W$ .

Concretely, this models a situation where I first survey my ingredients like usual, and am then forced to make instant ramen by some force outside the universe (i.e. outside our W/X/Y/Z causal diagram).

And this is a useful concept, because we often want to know what would happen if we performed just such an intervention!

That is, we want to know whether it's a good idea to add a new cause to the diagram, forcing some variable to have values we think lead to good outcomes.

To understand what would happen in such an intervention, it's wrong to condition on the outcome using the original, unmodified diagram – if we did that, we'd draw conclusions like "forcing me to make instant ramen would cause me to see relatively few ingredients on the shelves later, after dinner."

johnswentworth on leogao's Shortform

I have heard people say this so many times, and it is consistently the opposite of my experience. The random spontaneous conversations at conferences are disproportionately shallow and tend toward the same things which have been discussed to death online already, or toward the things which seem simple enough that everyone thinks they have something to say on the topic. When doing an activity with friends, it's usually the activity which is novel and/or interesting, while the conversation tends to be shallow and playful and fun but not as substantive as the activity. At work, spontaneous conversations generally had little relevance to the actual things we were/are working on (there are some exceptions, but they're rarely as high-value as ordinary work).