LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[question] How tokenization influences prompting?
Boris Kashirin (boris-kashirin) · 2024-07-29T10:28:25.056Z · answers+comments (4)

[link] The EA case for Trump
Judd Rosenblatt (judd) · 2024-08-03T01:00:45.422Z · comments (1)

[link] A Step Against Land Value Tax
Blog Alt (blog-alt) · 2024-06-24T05:13:48.949Z · comments (23)

[question] How bad would AI progress need to be for us to think general technological progress is also bad?
Jim Buhler (jim-buhler) · 2024-07-09T10:43:45.506Z · answers+comments (5)

Fix simple mistakes in ARC-AGI, etc.
Oleg Trott (oleg-trott) · 2024-07-09T17:46:50.364Z · comments (9)

Contra Musician Gender II
jefftk (jkaufman) · 2024-11-13T03:30:09.510Z · comments (0)

[link] Apply to Aether - Independent LLM Agent Safety Research Group
RohanS · 2024-08-21T09:47:11.493Z · comments (0)

AGI's Opposing Force
SimonBaars (simonbaars) · 2024-08-16T04:18:06.900Z · comments (2)

Trying to understand Hanson's Cultural Drift argument
Kemp (ethan-kemp) · 2024-07-22T20:20:32.734Z · comments (3)

[link] Virtue is a Vector
robotelvis · 2024-09-10T03:02:45.737Z · comments (1)

[link] Georgism Crash Course
Zero Contradictions · 2024-06-29T06:18:39.073Z · comments (5)

[question] What do you expect AI capabilities may look like in 2028?
nonzerosum · 2024-08-23T16:59:53.007Z · answers+comments (5)

[question] Are UV-C Air purifiers so useful?
JohnBuridan · 2024-09-04T14:16:01.310Z · answers+comments (0)

The Bayesian Conspiracy Live Recording
Eneasz · 2024-11-06T16:25:13.380Z · comments (0)

NYU Debate Training Update: Methods, Baselines, Preliminary Results
samarnesen · 2024-07-06T18:28:54.053Z · comments (0)

[link] AI Safety at the Frontier: Paper Highlights, July '24
gasteigerjo · 2024-08-05T13:00:46.028Z · comments (0)

[link] Nerdtrition: simple diets via spreadsheet abuse
dkl9 · 2024-10-27T21:45:15.117Z · comments (0)

My covid-related beliefs and questions
Severin T. Seehrich (sts) · 2024-07-23T03:27:09.348Z · comments (0)

[question] Is it Legal to Maintain Turing Tests using Data Poisoning, and would it work?
Double · 2024-09-05T00:35:39.504Z · answers+comments (9)

[link] In-Context Learning: An Alignment Survey
alamerton · 2024-09-30T18:44:28.589Z · comments (0)

[link] It's important to know when to stop: Mechanistic Exploration of Gemma 2 List Generation
Gerard Boxo (gerard-boxo) · 2024-10-14T17:04:57.010Z · comments (0)

[link] Cooperation and Alignment in Delegation Games: You Need Both!
Oliver Sourbut · 2024-08-03T10:16:51.716Z · comments (0)

[question] Change My Mind: Thirders in "Sleeping Beauty" are Just Doing Epistemology Wrong
DragonGod · 2024-10-16T10:20:22.133Z · answers+comments (67)

[link] AI Safety Newsletter #42: Newsom Vetoes SB 1047 Plus, OpenAI’s o1, and AI Governance Summary
Corin Katzke (corin-katzke) · 2024-10-01T20:35:32.399Z · comments (0)

[link] Jailbreaking language models with user roleplay
loops (smitop) · 2024-09-28T23:43:10.870Z · comments (0)

[link] Can AI agents learn to be good?
Ram Rachum (ram@rachum.com) · 2024-08-29T14:20:04.336Z · comments (0)

The Geometric Importance of Side Payments
StrivingForLegibility · 2024-08-07T01:38:04.635Z · comments (4)

Thinking About Propensity Evaluations
Maxime Riché (maxime-riche) · 2024-08-19T09:23:55.091Z · comments (0)

Two new datasets for evaluating political sycophancy in LLMs
alma.liezenga · 2024-09-28T18:29:49.088Z · comments (0)

Consider tabooing "I think"
Adam Zerner (adamzerner) · 2024-11-12T02:00:08.433Z · comments (2)

Steering LLMs' Behavior with Concept Activation Vectors
Ruixuan Huang (sprout_ust) · 2024-09-28T09:53:19.658Z · comments (0)

An open response to Wittkotter and Yampolskiy
Donald Hobson (donald-hobson) · 2024-09-24T22:27:21.987Z · comments (0)

Three main arguments that AI will save humans and one meta-argument
avturchin · 2024-10-02T11:39:08.910Z · comments (8)

[link] Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Jonathan N (derpyplops) · 2024-11-05T01:01:08.083Z · comments (0)

[link] What is autonomy? Why boundaries are necessary.
Chipmonk · 2024-10-21T17:56:33.722Z · comments (1)

[link] [Linkpost] Automated Design of Agentic Systems
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-08-19T23:06:06.669Z · comments (1)

[link] Michael Streamlines on Buddhism
Chris_Leong · 2024-08-09T04:44:52.126Z · comments (0)

Prediction Market Trading as an LLM Benchmark
Jesse Richardson (SharkoRubio) · 2024-06-25T00:46:33.801Z · comments (1)

Bad lessons learned from the debate
bayesyatina · 2024-06-26T11:54:44.953Z · comments (5)

[link] New fast transformer inference ASIC — Sohu by Etched
lukehmiles (lcmgcd) · 2024-06-26T09:56:08.649Z · comments (9)

[link] Models of life
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-29T19:24:40.060Z · comments (0)

Meta AI (FAIR) latest paper integrates system-1 and system-2 thinking into reasoning models.
happy friday (happy-friday) · 2024-10-24T16:54:15.721Z · comments (0)

[link] A Theory of Equilibrium in the Offense-Defense Balance
Maxwell Tabarrok (maxwell-tabarrok) · 2024-11-15T13:51:33.376Z · comments (1)

[link] Contagious Beliefs—Simulating Political Alignment
James Stephen Brown (james-brown) · 2024-10-13T00:27:08.084Z · comments (0)

Thoughts On the Nature of Capability Elicitation via Fine-tuning
Theodore Chapman · 2024-10-15T08:39:19.909Z · comments (0)

New UChicago Rationality Group
Noah Birnbaum (daniel-birnbaum) · 2024-11-08T21:20:34.485Z · comments (0)

[link] Triangulating My Interpretation of Methods: Black Boxes by Marco J. Nathan
adamShimi · 2024-10-09T19:13:26.631Z · comments (0)

MIT FutureTech are hiring for a Head of Operations role
peterslattery · 2024-10-02T17:11:42.960Z · comments (0)

[link] Approval-Seeking ⇒ Playful Evaluation
Jonathan Moregård (JonathanMoregard) · 2024-08-28T21:03:51.244Z · comments (0)

Dario Amodei's "Machines of Loving Grace" sound incredibly dangerous, for Humans
Super AGI (super-agi) · 2024-10-27T05:05:13.763Z · comments (1)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

abramdemski on 5 ways to improve CoT faithfulness

I don't get why you think actively training the model to be deceptive is "worth the cost". The deceptive behavior can generalize to other cases, and make the system overall much harder to shape in ways you want (because it is actively hiding things from you so your RL is less accurate). It seems like a choice between two bad options: actively deceptive CoT or actively deceptive Face.

abramdemski on 5 ways to improve CoT faithfulness

--...but the reward process (RLHF, RLAIF, whatever) gets to see the outputs of the Face but not the Shoggoth.
--So, the hope is that skills and propensities towards deception and dishonesty will accumulate in the Face rather than the Shoggoth.

It seems pretty important to put some incentives on the CoT to avoid steganography, though. The division of labor here seems aimed at not actively incentivizing instrumentally useful thoughts to get hidden. However, the pre-trained LLM will already be responsive to some steganographic patterns, in addition to the human-readable patterns. So, there is still an incentive optimizing the non-human-readable patterns in the CoT to be as useful as possible, in addition to the human-readable patterns.

abramdemski on 5 ways to improve CoT faithfulness

However, with this naive approach, the paraphrasings will probably still consistently contain certain kinds of noise from the initial thought. For instance, maybe the paraphraser never replaces an exclamation point with a period, even though this end-of-sentence punctuation doesn't really change the meaning. By modulating whether the step ends with a period or an exclamation point, the model can encode one bit of information per step, despite the paraphraser.
One way to solve this problem is to map every piece of text with a given meaning onto the same "canonical paraphrasing." For instance, we'd want a paraphraser that, given many texts with approximately-identical meanings (like "I own a large home!" "My house is spacious!" and "I possess a sizable dwelling."), maps them all onto a single "canonical" text (like "I have a big house.").

I don't think this proposal improves the thing you claim it improves.

If the initial paraphraser doesn't change between '!' and '.', then learning a canonical-paraphraser from that initial paraphraser won't do it either.

I think what canonical-paraphrasing does is help with cases where the paraphraser stochastically removes specific information, but in an information-theoretically leaky way. For example, if the paraphraser changes between '!' and '.' with only 10% probability, then steganography can use error-correcting codes to still get through.

Turning a paraphraser into a canonical-paraphraser probably does something like force details into "always removed" vs "never removed" based on a threshhold; if '!' and '.' are swapped with 10% probability in the initial paraphraser, then they're always swapped in the final paraphraser; if they're swapped with only 1% probability in the initial paraphraser, then they are never swapped in the final paraphraser.

(Of course this oversimplifies things a lot, and depends completely on how the training to turn paraphrasers into canonical-paraphrasers works.)

I think the only solution to the '.' vs '!' problem, fundamentally, is to train the paraphraser to modify those patterns. If the initial paraphraser doesn't ever change specific details, then those details can be used for steganography. If the paraphraser does change those details, then it is already hard to use them for steganography. So the canonical-paraphrasing idea helps somewhat, but fundamentally, the better the paraphraser (the more closely it captures the human notion of semantic equivalence), the better the anti-steganography guarantee we can get.

unexpectedvalues on Seven lessons I didn't learn from election day

Fair enough, I guess? For context, I wrote this for my own blog and then decided I might as well cross-post to LW. In doing so, I actually softened the language of that section a little bit. But maybe I should've softened it more, I'm not sure.

[Edit: in response to your comment, I've further softened the language.]

ryan_greenblatt on Rauno's Shortform

OpenAI models are seemingly trained on huge amounts of chess data [LW(p) · GW(p)], perhaps 1-4% of documents are chess (though chess documents are short, so the fraction of tokens which are chess is smaller than this).

habryka4 on Habryka's Shortform Feed

A bunch of very interesting emails between Elon, Sam Altman, Ilya and Greg were released (I think in some legal proceedings, but not sure). It would IMO be cool for someone to gather them all and do some basic analysis of them.

https://x.com/TechEmails/status/1857456137156669765

https://x.com/TechEmails/status/1857285960997712356

drake-morrison on The Third Fundamental Question

I like this! Especially the Past, Present, Future framing. I usually split along epistemic and instrumental lines. So my fundamental questions were:
1. Epistemic: What do you think you know and how do you think you know it?
2. Instrumental: What are you trying to protect, and how are you trying to protect it?

I've had some notion of a third thing, but now I've got a better handle on it, thanks!

romeostevensit on Anvil Problems

Relatedly: people often discount improvements with large startup costs even if those costs are one time cost for an ongoing benefit. One of the worst is when it's something one is definitely going to do eventually, so delaying paying the startup cost is simply reducing the amount of time for diffuse benefits. Exercise and learning to cook are like this.

abramdemski on 5 ways to improve CoT faithfulness

Note that the optimized planner is only rewarded for producing steps that help the frozen planner. Since the frozen planner doesn't understand steganography, there is no reason for the optimized planner to learn to use it.

I think this conclusion is overstated. The frozen planner may react to steganography, in the same way that image-recognition NNs react to adversarial examples. The optimized planner is very much like an adversarial example, in the way it is trained to generate outputs which nudge a frozen NN in a specific direction.

Imagine reversing the roles of the optimized vs frozen planner during training: the frozen planner is used for a random number of steps, and then we switch to the optimized planner. Then, it really is the case that the optimized planner has no incentive to use stegonography to nudge the frozen planner, since the frozen planner does not get the chance to react at all to the optimized planner.

The downside of this training method is that we can't use your suggestion to only reinforce the very last step rather than all the optimized-planner steps.

Another idea would be to only use the optimized planner to write a single randomly chosen step in the CoT.

All of these ideas still leave some incentive to use steganography in the adversarial-example sense. At the very least, the final answer (after the CoT) might be influenced by some subtle non-human-readable patterns in the CoT, so the optimized planner will be incentivized to optimize those patterns. In my view, what Incremental Steering does is two things:

It prevents the model from inventing new steganography; it can only leverage subtle patterns which already influence the LLM before this step in the training.
It makes it very difficult for the model to learn steganography which requires coordination over multiple steps. It is not being rewarded for such coordination.

rhollerith_dot_com on Thoughts after the Wolfram and Yudkowsky discussion

Although I agree with another comment that Wolfram has not "done the reading" on AI extinction risk, my being able to watch his face while he confronts some of the considerations and arguments for the first time made it easier, not harder, for me to predict where his stance on the AI project will end up 18 months from now. It is hard for me to learn anything about anyone by watching them express a series of cached thoughts.

Near the end of the interview, Wolfram say that he cannot do much processing of what was discussed "in real time", which strongly suggests to me that he expects to process it slowly over the next days and weeks. I.e., he is now trying to reassure himself that the AI project won't kill his four children or any grandchildren he has or will have. Because Wolfram is much better AFAICT at this kind of slow "strategic rational" deliberation than most people at his level of life accomplishment, there is a good chance he will fail to find his slow deliberations reassuring, in which case he probably then declares himself an AI doomer. Specifically, my probability is .2 that 18 months from now, Wolfram will have come out publicly against allowing ambitious frontier AI research to continue. P = .2 is much much higher than my P for the average 65-year-old of his intellectual stature who is not specialized in AI. My P is much higher mostly because I watched this interview; i.e., I was impressed by Wolfram's performance in this interview despite his spending the majority of his time on rabbit holes than I could quickly tell had no possible relevance to AI extinction risk.

My probability that he will become more optimistic about the AI project over the next 18 months is .06: mostly likely, he goes silent on the issue or continues to take an inquisitive non-committal stance in his public discussions of it.

If Wolfram had a history of taking responsibility for his community, e.g., campaigning against drunk driving or running for any elected office, my P of his declaring himself an AI doomer (i.e., becoming someone trying to stop AI) would go up to .5. (He might in fact have done something to voluntarily take responsibility for his community, but if so, I haven't been able to learn about it.) If Wolfram were somehow forced to take sides, and had enough time to deliberate calmly on the choice (after he learned that he needed to publicly choose a side), he would with p = .88 take the side of the AI doomers.