LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Goal-Completeness is like Turing-Completeness for AGI
Liron · 2023-12-19T18:12:29.947Z · comments (26)

The Shortest Path Between Scylla and Charybdis
Thane Ruthenis · 2023-12-18T20:08:34.995Z · comments (8)

They are made of repeating patterns
quetzal_rainbow · 2023-11-13T18:17:43.189Z · comments (4)

GPT-2030 and Catastrophic Drives: Four Vignettes
jsteinhardt · 2023-11-10T07:30:06.480Z · comments (5)

[link] Can AI Outpredict Humans? Results From Metaculus's Q3 AI Forecasting Benchmark
ChristianWilliams · 2024-10-10T18:58:46.041Z · comments (2)

Applications of Chaos: Saying No (with Hastings Greer)
Elizabeth (pktechgirl) · 2024-09-21T16:30:07.415Z · comments (16)

AI #82: The Governor Ponders
Zvi · 2024-09-19T13:30:04.863Z · comments (8)

On Complexity Science
Garrett Baker (D0TheMath) · 2024-04-05T02:24:32.039Z · comments (19)

Unlearning via RMU is mostly shallow
Andy Arditi (andy-arditi) · 2024-07-23T16:07:52.223Z · comments (3)

[link] Finding Backward Chaining Circuits in Transformers Trained on Tree Search
abhayesian · 2024-05-28T05:29:46.777Z · comments (1)

Observations on Teaching for Four Weeks
ClareChiaraVincent · 2024-05-06T16:55:59.315Z · comments (14)

[link] on the dollar-yen exchange rate
bhauth · 2024-04-07T04:49:53.920Z · comments (21)

[link] Announcing Human-aligned AI Summer School
Jan_Kulveit · 2024-05-22T08:55:10.839Z · comments (0)

Paper in Science: Managing extreme AI risks amid rapid progress
JanB (JanBrauner) · 2024-05-23T08:40:40.678Z · comments (2)

Changes in College Admissions
Zvi · 2024-04-24T13:50:03.487Z · comments (11)

Consent across power differentials
Ramana Kumar (ramana-kumar) · 2024-07-09T11:42:03.177Z · comments (12)

Why you should learn a musical instrument
cata · 2024-05-15T20:36:16.034Z · comments (23)

Sherlockian Abduction Master List
Cole Wyeth (Amyr) · 2024-07-11T20:27:00.000Z · comments (63)

[link] DM Parenting
Shoshannah Tekofsky (DarkSym) · 2024-07-16T08:50:08.144Z · comments (4)

Interoperable High Level Structures: Early Thoughts on Adjectives
johnswentworth · 2024-08-22T21:12:38.223Z · comments (1)

AI #58: Stargate AGI
Zvi · 2024-04-04T13:10:06.342Z · comments (9)

The Dunning-Kruger of disproving Dunning-Kruger
kromem · 2024-05-16T10:11:33.108Z · comments (0)

[link] in defense of Linus Pauling
bhauth · 2024-06-03T21:27:43.962Z · comments (8)

[link] On scalable oversight with weak LLMs judging strong LLMs
zac_kenton (zkenton) · 2024-07-08T08:59:58.523Z · comments (18)

[link] Anthropic announces interpretability advances. How much does this advance alignment?
Seth Herd · 2024-05-21T22:30:52.638Z · comments (4)

So you want to work on technical AI safety
gw · 2024-06-24T14:29:57.481Z · comments (3)

An issue with training schemers with supervised fine-tuning
Fabien Roger (Fabien) · 2024-06-27T15:37:56.020Z · comments (12)

[LDSL#0] Some epistemological conundrums
tailcalled · 2024-08-07T19:52:55.688Z · comments (10)

AI #67: Brief Strange Trip
Zvi · 2024-06-06T18:50:03.514Z · comments (6)

Notes on control evaluations for safety cases
ryan_greenblatt · 2024-02-28T16:15:17.799Z · comments (0)

The Broken Screwdriver and other parables
bhauth · 2024-03-04T03:34:38.807Z · comments (1)

Bounty: Diverse hard tasks for LLM agents
Beth Barnes (beth-barnes) · 2023-12-17T01:04:05.460Z · comments (31)

Should rationalists be spiritual / Spirituality as overcoming delusion
Kaj_Sotala · 2024-03-25T16:48:08.397Z · comments (57)

Wrong answer bias
lukehmiles (lcmgcd) · 2024-02-01T20:05:38.573Z · comments (24)

[question] why did OpenAI employees sign
bhauth · 2023-11-27T05:21:28.612Z · answers+comments (23)

[link] Chapter 1 of How to Win Friends and Influence People
gull · 2024-01-28T00:32:52.865Z · comments (5)

[link] Evaluating Stability of Unreflective Alignment
james.lucassen · 2024-02-01T22:15:40.902Z · comments (10)

Low Probability Estimation in Language Models
Gabriel Wu (gabriel-wu) · 2024-10-18T15:50:05.947Z · comments (0)

[link] The Evals Gap
Marius Hobbhahn (marius-hobbhahn) · 2024-11-11T16:42:46.287Z · comments (7)

The Fragility of Life Hypothesis and the Evolution of Cooperation
KristianRonn · 2024-09-04T21:04:49.878Z · comments (6)

[link] Active Recall and Spaced Repetition are Different Things
Saul Munn (saul-munn) · 2024-11-08T20:14:56.092Z · comments (2)

AI and the Technological Richter Scale
Zvi · 2024-09-04T14:00:08.625Z · comments (8)

An alternative approach to superbabies
Towards_Keeperhood (Simon Skade) · 2024-11-05T22:56:15.740Z · comments (19)

Interested in Cognitive Bootcamp?
Raemon · 2024-09-19T22:12:13.348Z · comments (0)

[question] If I wanted to spend WAY more on AI, what would I spend it on?
Logan Zoellner (logan-zoellner) · 2024-09-15T21:24:46.742Z · answers+comments (16)

Evaluating the truth of statements in a world of ambiguous language.
Hastings (hastings-greer) · 2024-10-07T18:08:09.920Z · comments (19)

[link] Book review: Xenosystems
jessicata (jessica.liu.taylor) · 2024-09-16T20:17:56.670Z · comments (18)

[link] Contra Acemoglu on AI
Maxwell Tabarrok (maxwell-tabarrok) · 2024-06-28T13:13:15.796Z · comments (0)

Mechanistic Interpretability Workshop Happening at ICML 2024!
Neel Nanda (neel-nanda-1) · 2024-05-03T01:18:26.936Z · comments (6)

D&D.Sci: The Mad Tyrant's Pet Turtles [Evaluation and Ruleset]
abstractapplic · 2024-04-09T14:01:34.426Z · comments (6)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

linch on Seven lessons I didn't learn from election day

Some people I know are much more pessimistic about the polls this cycle, due to herding. For example, nonresponse bias might just be massive for Trump voters (across demographic groups), so pollsters end up having to make a series of unprincipled choices with their thumbs on the scales.

abramdemski on o1 is a bad idea

I don't get why you think this is true? EG, it seems like almost no insights about how to train faithful CoT would transfer to systems speaking pure neuralese. It seems to me that what little safety/alignment we have for LLMs is largely a byproduct of how language-focused they are (which gives us a weak sort of interpretability, a very useful safety resource which we are at risk of losing soon).

To put it a different way: human imitation is a relatively safe outer-optimization target, so we should be concerned to the extent that the industry is moving away from it. It sounds like you think safety lessons from the human-imitation regime generalize beyond the human-imitation regime. Maybe I agree that we can derive some abstract lessons like "RL with really dense feedback can avoid instrumental convergence", but we're moving away from the regime where such dense feedback is available, so I don't see what lessons transfer.

abramdemski on 5 ways to improve CoT faithfulness

I don't get why you think actively training the model to be deceptive is "worth the cost". The deceptive behavior can generalize to other cases, and make the system overall much harder to shape in ways you want (because it is actively hiding things from you so your RL is less accurate). It seems like a choice between two bad options: actively deceptive CoT or actively deceptive Face.

abramdemski on 5 ways to improve CoT faithfulness

--...but the reward process (RLHF, RLAIF, whatever) gets to see the outputs of the Face but not the Shoggoth.
--So, the hope is that skills and propensities towards deception and dishonesty will accumulate in the Face rather than the Shoggoth.

It seems pretty important to put some incentives on the CoT to avoid steganography, though. The division of labor here seems aimed at not actively incentivizing instrumentally useful thoughts to get hidden. However, the pre-trained LLM will already be responsive to some steganographic patterns, in addition to the human-readable patterns. So, there is still an incentive optimizing the non-human-readable patterns in the CoT to be as useful as possible, in addition to the human-readable patterns.

abramdemski on 5 ways to improve CoT faithfulness

However, with this naive approach, the paraphrasings will probably still consistently contain certain kinds of noise from the initial thought. For instance, maybe the paraphraser never replaces an exclamation point with a period, even though this end-of-sentence punctuation doesn't really change the meaning. By modulating whether the step ends with a period or an exclamation point, the model can encode one bit of information per step, despite the paraphraser.
One way to solve this problem is to map every piece of text with a given meaning onto the same "canonical paraphrasing." For instance, we'd want a paraphraser that, given many texts with approximately-identical meanings (like "I own a large home!" "My house is spacious!" and "I possess a sizable dwelling."), maps them all onto a single "canonical" text (like "I have a big house.").

I don't think this proposal improves the thing you claim it improves.

If the initial paraphraser doesn't change between '!' and '.', then learning a canonical-paraphraser from that initial paraphraser won't do it either.

I think what canonical-paraphrasing does is help with cases where the paraphraser stochastically removes specific information, but in an information-theoretically leaky way. For example, if the paraphraser changes between '!' and '.' with only 10% probability, then steganography can use error-correcting codes to still get through.

Turning a paraphraser into a canonical-paraphraser probably does something like force details into "always removed" vs "never removed" based on a threshhold; if '!' and '.' are swapped with 10% probability in the initial paraphraser, then they're always swapped in the final paraphraser; if they're swapped with only 1% probability in the initial paraphraser, then they are never swapped in the final paraphraser.

(Of course this oversimplifies things a lot, and depends completely on how the training to turn paraphrasers into canonical-paraphrasers works.)

I think the only solution to the '.' vs '!' problem, fundamentally, is to train the paraphraser to modify those patterns. If the initial paraphraser doesn't ever change specific details, then those details can be used for steganography. If the paraphraser does change those details, then it is already hard to use them for steganography. So the canonical-paraphrasing idea helps somewhat, but fundamentally, the better the paraphraser (the more closely it captures the human notion of semantic equivalence), the better the anti-steganography guarantee we can get.

unexpectedvalues on Seven lessons I didn't learn from election day

Fair enough, I guess? For context, I wrote this for my own blog and then decided I might as well cross-post to LW. In doing so, I actually softened the language of that section a little bit. But maybe I should've softened it more, I'm not sure.

[Edit: in response to your comment, I've further softened the language.]

ryan_greenblatt on Rauno's Shortform

OpenAI models are seemingly trained on huge amounts of chess data [LW(p) · GW(p)], perhaps 1-4% of documents are chess (though chess documents are short, so the fraction of tokens which are chess is smaller than this).

habryka4 on Habryka's Shortform Feed

A bunch of very interesting emails between Elon, Sam Altman, Ilya and Greg were released (I think in some legal proceedings, but not sure). It would IMO be cool for someone to gather them all and do some basic analysis of them.

https://x.com/TechEmails/status/1857456137156669765

https://x.com/TechEmails/status/1857285960997712356

drake-morrison on The Third Fundamental Question

I like this! Especially the Past, Present, Future framing. I usually split along epistemic and instrumental lines. So my fundamental questions were:
1. Epistemic: What do you think you know and how do you think you know it?
2. Instrumental: What are you trying to protect, and how are you trying to protect it?

I've had some notion of a third thing, but now I've got a better handle on it, thanks!

romeostevensit on Anvil Problems

Relatedly: people often discount improvements with large startup costs even if those costs are one time cost for an ongoing benefit. One of the worst is when it's something one is definitely going to do eventually, so delaying paying the startup cost is simply reducing the amount of time for diffuse benefits. Exercise and learning to cook are like this.