LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Intent alignment as a stepping-stone to value alignment
Seth Herd · 2024-11-05T20:43:24.950Z · comments (4)

An argument that consequentialism is incomplete
cousin_it · 2024-10-07T09:45:12.754Z · comments (27)

Comparing Quantized Performance in Llama Models
NickyP (Nicky) · 2024-07-15T16:01:24.960Z · comments (2)

5 ways to improve CoT faithfulness
CBiddulph (caleb-biddulph) · 2024-10-05T20:17:12.637Z · comments (12)

A path to human autonomy
Nathan Helm-Burger (nathan-helm-burger) · 2024-10-29T03:02:42.475Z · comments (12)

Open Thread Fall 2024
habryka (habryka4) · 2024-10-05T22:28:50.398Z · comments (110)

Resolving von Neumann-Morgenstern Inconsistent Preferences
niplav · 2024-10-22T11:45:20.915Z · comments (5)

Bay Winter Solstice 2024: Speech Auditions
ozymandias · 2024-11-04T22:31:38.680Z · comments (0)

Book Review: What Even Is Gender?
Joey Marcellino · 2024-09-01T16:09:27.773Z · comments (14)

AI labs can boost external safety research
Zach Stein-Perlman · 2024-07-31T19:30:16.207Z · comments (1)

The slingshot helps with learning
Wilson Wu (wilson-wu) · 2024-10-31T23:18:16.762Z · comments (0)

[link] Stone Age Herbalist's notes on ant warfare and slavery
trevor (TrevorWiesinger) · 2024-11-09T02:40:01.128Z · comments (0)

Apply to MATS 7.0!
Ryan Kidd (ryankidd44) · 2024-09-21T00:23:49.778Z · comments (0)

Balancing Label Quantity and Quality for Scalable Elicitation
Alex Mallen (alex-mallen) · 2024-10-24T16:49:00.939Z · comments (1)

Context-dependent consequentialism
Jeremy Gillen (jeremy-gillen) · 2024-11-04T09:29:24.310Z · comments (6)

[question] What's the Deal with Logical Uncertainty?
Ape in the coat · 2024-09-16T08:11:43.588Z · answers+comments (23)

[link] What is it like to be psychologically healthy? Podcast ft. DaystarEld
Chipmonk · 2024-10-05T19:14:04.743Z · comments (8)

A more systematic case for inner misalignment
Richard_Ngo (ricraz) · 2024-07-20T05:03:03.500Z · comments (4)

Music in the AI World
Martin Sustrik (sustrik) · 2024-08-16T04:20:01.706Z · comments (8)

[LDSL#6] When is quantification needed, and when is it hard?
tailcalled · 2024-08-13T20:39:45.481Z · comments (0)

Extracting SAE task features for in-context learning
Dmitrii Kharlapenko (dmitrii-kharlapenko) · 2024-08-12T20:34:13.747Z · comments (1)

[question] When is reward ever the optimization target?
Noosphere89 (sharmake-farah) · 2024-10-15T15:09:20.912Z · answers+comments (12)

[link] Epistemic states as a potential benign prior
Tamsin Leake (carado-1) · 2024-08-31T18:26:14.093Z · comments (2)

[LDSL#1] Performance optimization as a metaphor for life
tailcalled · 2024-08-08T16:16:27.349Z · comments (4)

Incentive design and capability elicitation
Joe Carlsmith (joekc) · 2024-11-12T20:56:05.088Z · comments (0)

Inference-Only Debate Experiments Using Math Problems
Arjun Panickssery (arjun-panickssery) · 2024-08-06T17:44:27.293Z · comments (0)

SAE Probing: What is it good for? Absolutely something!
Subhash Kantamneni (subhashk) · 2024-11-01T19:23:55.418Z · comments (0)

Meme Talking Points
ymeskhout · 2024-11-06T15:27:54.024Z · comments (0)

Some comments on intelligence
Viliam · 2024-08-01T15:17:07.215Z · comments (5)

AI #74: GPT-4o Mini Me and Llama 3
Zvi · 2024-07-25T13:50:06.528Z · comments (6)

AI Constitutions are a tool to reduce societal scale risk
Sammy Martin (SDM) · 2024-07-25T11:18:17.826Z · comments (2)

[link] Safety tax functions
owencb · 2024-10-20T14:08:38.099Z · comments (0)

Fun With CellxGene
sarahconstantin · 2024-09-06T22:00:03.461Z · comments (2)

[link] [Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs
Yohan Mathew (ymath) · 2024-09-25T14:52:48.263Z · comments (2)

[link] Baking vs Patissing vs Cooking, the HPS explanation
adamShimi · 2024-07-17T20:29:09.645Z · comments (16)

AI #85: AI Wins the Nobel Prize
Zvi · 2024-10-10T13:40:07.286Z · comments (6)

AIS terminology proposal: standardize terms for probability ranges
eggsyntax · 2024-08-30T15:43:39.857Z · comments (12)

[link] AI forecasting bots incoming
Dan H (dan-hendrycks) · 2024-09-09T19:14:31.050Z · comments (44)

[link] Liquid vs Illiquid Careers
vaishnav92 · 2024-10-20T23:03:49.725Z · comments (6)

[question] What Other Lines of Work are Safe from AI Automation?
RogerDearnaley (roger-d-1) · 2024-07-11T10:01:12.616Z · answers+comments (35)

Examples of How I Use LLMs
jefftk (jkaufman) · 2024-10-14T17:10:04.597Z · comments (2)

[question] Where to find reliable reviews of AI products?
Elizabeth (pktechgirl) · 2024-09-17T23:48:25.899Z · answers+comments (6)

[link] Why Recursion Pharmaceuticals abandoned cell painting for brightfield imaging
Abhishaike Mahajan (abhishaike-mahajan) · 2024-11-05T14:51:41.310Z · comments (1)

Searching for phenomenal consciousness in LLMs: Perceptual reality monitoring and introspective confidence
EuanMcLean (euanmclean) · 2024-10-29T12:16:18.448Z · comments (7)

[LDSL#4] Root cause analysis versus effect size estimation
tailcalled · 2024-08-11T16:12:14.604Z · comments (0)

Paper Summary: Princes and Merchants: European City Growth Before the Industrial Revolution
Jeffrey Heninger (jeffrey-heninger) · 2024-07-15T21:30:04.043Z · comments (1)

[link] My Methodological Turn
adamShimi · 2024-09-29T15:01:45.986Z · comments (0)

Trading Candy
jefftk (jkaufman) · 2024-11-01T01:10:08.024Z · comments (4)

Winning isn't enough
JesseClifton · 2024-11-05T11:37:39.486Z · comments (14)

[link] Arithmetic Models: Better Than You Think
kqr · 2024-10-26T09:42:07.185Z · comments (5)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

abramdemski on 5 ways to improve CoT faithfulness

I don't get why you think actively training the model to be deceptive is "worth the cost". The deceptive behavior can generalize to other cases, and make the system overall much harder to shape in ways you want (because it is actively hiding things from you so your RL is less accurate). It seems like a choice between two bad options: actively deceptive CoT or actively deceptive Face.

abramdemski on 5 ways to improve CoT faithfulness

--...but the reward process (RLHF, RLAIF, whatever) gets to see the outputs of the Face but not the Shoggoth.
--So, the hope is that skills and propensities towards deception and dishonesty will accumulate in the Face rather than the Shoggoth.

It seems pretty important to put some incentives on the CoT to avoid steganography, though. The division of labor here seems aimed at not actively incentivizing instrumentally useful thoughts to get hidden. However, the pre-trained LLM will already be responsive to some steganographic patterns, in addition to the human-readable patterns. So, there is still an incentive optimizing the non-human-readable patterns in the CoT to be as useful as possible, in addition to the human-readable patterns.

abramdemski on 5 ways to improve CoT faithfulness

However, with this naive approach, the paraphrasings will probably still consistently contain certain kinds of noise from the initial thought. For instance, maybe the paraphraser never replaces an exclamation point with a period, even though this end-of-sentence punctuation doesn't really change the meaning. By modulating whether the step ends with a period or an exclamation point, the model can encode one bit of information per step, despite the paraphraser.
One way to solve this problem is to map every piece of text with a given meaning onto the same "canonical paraphrasing." For instance, we'd want a paraphraser that, given many texts with approximately-identical meanings (like "I own a large home!" "My house is spacious!" and "I possess a sizable dwelling."), maps them all onto a single "canonical" text (like "I have a big house.").

I don't think this proposal improves the thing you claim it improves.

If the initial paraphraser doesn't change between '!' and '.', then learning a canonical-paraphraser from that initial paraphraser won't do it either.

I think what canonical-paraphrasing does is help with cases where the paraphraser stochastically removes specific information, but in an information-theoretically leaky way. For example, if the paraphraser changes between '!' and '.' with only 10% probability, then steganography can use error-correcting codes to still get through.

Turning a paraphraser into a canonical-paraphraser probably does something like force details into "always removed" vs "never removed" based on a threshhold; if '!' and '.' are swapped with 10% probability in the initial paraphraser, then they're always swapped in the final paraphraser; if they're swapped with only 1% probability in the initial paraphraser, then they are never swapped in the final paraphraser.

(Of course this oversimplifies things a lot, and depends completely on how the training to turn paraphrasers into canonical-paraphrasers works.)

I think the only solution to the '.' vs '!' problem, fundamentally, is to train the paraphraser to modify those patterns. If the initial paraphraser doesn't ever change specific details, then those details can be used for steganography. If the paraphraser does change those details, then it is already hard to use them for steganography. So the canonical-paraphrasing idea helps somewhat, but fundamentally, the better the paraphraser (the more closely it captures the human notion of semantic equivalence), the better the anti-steganography guarantee we can get.

unexpectedvalues on Seven lessons I didn't learn from election day

Fair enough, I guess? For context, I wrote this for my own blog and then decided I might as well cross-post to LW. In doing so, I actually softened the language of that section a little bit. But maybe I should've softened it more, I'm not sure.

[Edit: in response to your comment, I've further softened the language.]

ryan_greenblatt on Rauno's Shortform

OpenAI models are seemingly trained on huge amounts of chess data [LW(p) · GW(p)], perhaps 1-4% of documents are chess (though chess documents are short, so the fraction of tokens which are chess is smaller than this).

habryka4 on Habryka's Shortform Feed

A bunch of very interesting emails between Elon, Sam Altman, Ilya and Greg were released (I think in some legal proceedings, but not sure). It would IMO be cool for someone to gather them all and do some basic analysis of them.

https://x.com/TechEmails/status/1857456137156669765

https://x.com/TechEmails/status/1857285960997712356

drake-morrison on The Third Fundamental Question

I like this! Especially the Past, Present, Future framing. I usually split along epistemic and instrumental lines. So my fundamental questions were:
1. Epistemic: What do you think you know and how do you think you know it?
2. Instrumental: What are you trying to protect, and how are you trying to protect it?

I've had some notion of a third thing, but now I've got a better handle on it, thanks!

romeostevensit on Anvil Problems

Relatedly: people often discount improvements with large startup costs even if those costs are one time cost for an ongoing benefit. One of the worst is when it's something one is definitely going to do eventually, so delaying paying the startup cost is simply reducing the amount of time for diffuse benefits. Exercise and learning to cook are like this.

abramdemski on 5 ways to improve CoT faithfulness

Note that the optimized planner is only rewarded for producing steps that help the frozen planner. Since the frozen planner doesn't understand steganography, there is no reason for the optimized planner to learn to use it.

I think this conclusion is overstated. The frozen planner may react to steganography, in the same way that image-recognition NNs react to adversarial examples. The optimized planner is very much like an adversarial example, in the way it is trained to generate outputs which nudge a frozen NN in a specific direction.

Imagine reversing the roles of the optimized vs frozen planner during training: the frozen planner is used for a random number of steps, and then we switch to the optimized planner. Then, it really is the case that the optimized planner has no incentive to use stegonography to nudge the frozen planner, since the frozen planner does not get the chance to react at all to the optimized planner.

The downside of this training method is that we can't use your suggestion to only reinforce the very last step rather than all the optimized-planner steps.

Another idea would be to only use the optimized planner to write a single randomly chosen step in the CoT.

All of these ideas still leave some incentive to use steganography in the adversarial-example sense. At the very least, the final answer (after the CoT) might be influenced by some subtle non-human-readable patterns in the CoT, so the optimized planner will be incentivized to optimize those patterns. In my view, what Incremental Steering does is two things:

It prevents the model from inventing new steganography; it can only leverage subtle patterns which already influence the LLM before this step in the training.
It makes it very difficult for the model to learn steganography which requires coordination over multiple steps. It is not being rewarded for such coordination.

rhollerith_dot_com on Thoughts after the Wolfram and Yudkowsky discussion

Although I agree with another comment that Wolfram has not "done the reading" on AI extinction risk, my being able to watch his face while he confronts some of the considerations and arguments for the first time made it easier, not harder, for me to predict where his stance on the AI project will end up 18 months from now. It is hard for me to learn anything about anyone by watching them express a series of cached thoughts.

Near the end of the interview, Wolfram say that he cannot do much processing of what was discussed "in real time", which strongly suggests to me that he expects to process it slowly over the next days and weeks. I.e., he is now trying to reassure himself that the AI project won't kill his four children or any grandchildren he has or will have. Because Wolfram is much better AFAICT at this kind of slow "strategic rational" deliberation than most people at his level of life accomplishment, there is a good chance he will fail to find his slow deliberations reassuring, in which case he probably then declares himself an AI doomer. Specifically, my probability is .2 that 18 months from now, Wolfram will have come out publicly against allowing ambitious frontier AI research to continue. P = .2 is much much higher than my P for the average 65-year-old of his intellectual stature who is not specialized in AI. My P is much higher mostly because I watched this interview; i.e., I was impressed by Wolfram's performance in this interview despite his spending the majority of his time on rabbit holes than I could quickly tell had no possible relevance to AI extinction risk.

My probability that he will become more optimistic about the AI project over the next 18 months is .06: mostly likely, he goes silent on the issue or continues to take an inquisitive non-committal stance in his public discussions of it.

If Wolfram had a history of taking responsibility for his community, e.g., campaigning against drunk driving or running for any elected office, my P of his declaring himself an AI doomer (i.e., becoming someone trying to stop AI) would go up to .5. (He might in fact have done something to voluntarily take responsibility for his community, but if so, I haven't been able to learn about it.) If Wolfram were somehow forced to take sides, and had enough time to deliberate calmly on the choice (after he learned that he needed to publicly choose a side), he would with p = .88 take the side of the AI doomers.