LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Natural Latents Are Not Robust To Tiny Mixtures
johnswentworth · 2024-06-07T18:53:36.643Z · comments (8)

Offering AI safety support calls for ML professionals
Vael Gates · 2024-02-15T23:48:12.797Z · comments (1)

[link] DeepMind: Evaluating Frontier Models for Dangerous Capabilities
Zach Stein-Perlman · 2024-03-21T03:00:31.599Z · comments (8)

[link] electric turbofans
bhauth · 2024-11-02T22:50:59.807Z · comments (2)

The proper response to mistakes that have harmed others?
Ruby · 2023-12-31T04:06:31.505Z · comments (12)

Balancing Games
jefftk (jkaufman) · 2024-02-24T14:40:04.237Z · comments (18)

[question] We might be dropping the ball on Autonomous Replication and Adaptation.
Charbel-Raphaël (charbel-raphael-segerie) · 2024-05-31T13:49:11.327Z · answers+comments (30)

Inspired by: Failures in Kindness
X4vier · 2024-07-27T01:21:42.848Z · comments (2)

[link] on bacteria, on teeth
bhauth · 2024-09-30T15:56:56.830Z · comments (9)

AI #78: Some Welcome Calm
Zvi · 2024-08-22T14:20:10.812Z · comments (15)

[link] Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT
Robert_AIZI · 2024-03-05T13:55:33.483Z · comments (24)

AI Safety Chatbot
markov (markovial) · 2023-12-21T14:06:48.981Z · comments (11)

0th Person and 1st Person Logic
Adele Lopez (adele-lopez-1) · 2024-03-10T00:56:14.446Z · comments (28)

Interdictor Ship
lsusr · 2024-08-19T04:59:18.487Z · comments (9)

[link] An Opinionated Evals Reading List
Marius Hobbhahn (marius-hobbhahn) · 2024-10-15T14:38:58.778Z · comments (0)

Self-explaining SAE features
Dmitrii Kharlapenko (dmitrii-kharlapenko) · 2024-08-05T22:20:36.041Z · comments (13)

An Actually Intuitive Explanation of the Oberth Effect
Isaac King (KingSupernova) · 2024-01-10T20:23:17.216Z · comments (33)

"Epistemic range of motion" and LessWrong moderation
habryka (habryka4) · 2023-11-27T21:58:40.834Z · comments (3)

What is "True Love"?
johnswentworth · 2024-08-18T16:05:47.358Z · comments (9)

[link] How do open AI models affect incentive to race?
jessicata (jessica.liu.taylor) · 2024-05-07T00:33:20.658Z · comments (13)

5 Physics Problems
DaemonicSigil · 2024-03-18T08:05:45.971Z · comments (0)

Originality vs. Correctness
alkjash · 2023-12-06T18:51:49.531Z · comments (17)

Showing SAE Latents Are Not Atomic Using Meta-SAEs
Bart Bussmann (Stuckwork) · 2024-08-24T00:56:46.048Z · comments (9)

Base LLMs refuse too
Connor Kissane (ckkissane) · 2024-09-29T16:04:21.343Z · comments (20)

[question] What do we know about the AI knowledge and views, especially about existential risk, of the new OpenAI board members?
Zvi · 2024-03-11T14:55:05.128Z · answers+comments (2)

Toward Safety Cases For AI Scheming
Mikita Balesni (mykyta-baliesnyi) · 2024-10-31T17:20:06.019Z · comments (1)

[link] Is Claude a mystic?
jessicata (jessica.liu.taylor) · 2024-06-07T04:27:09.118Z · comments (23)

Against empathy-by-default
Steven Byrnes (steve2152) · 2024-10-16T16:38:49.926Z · comments (24)

AI Alignment via Slow Substrates: Early Empirical Results With StarCraft II
Lester Leong (lester-leong) · 2024-10-14T04:05:05.096Z · comments (9)

There Should Be More Alignment-Driven Startups
Vaniver · 2024-05-31T02:05:06.799Z · comments (14)

[link] Linkpost: Memorandum on Advancing the United States’ Leadership in Artificial Intelligence
Nisan · 2024-10-25T04:37:00.828Z · comments (2)

Approaching Human-Level Forecasting with Language Models
Fred Zhang (fred-zhang) · 2024-02-29T22:36:34.012Z · comments (6)

Pollsters Should Publish Question Translations
jefftk (jkaufman) · 2024-09-08T22:10:04.932Z · comments (3)

[link] Results from an Adversarial Collaboration on AI Risk (FRI)
Josh Rosenberg (josh-rosenberg) · 2024-03-11T20:00:24.642Z · comments (3)

New paper shows truthfulness & instruction-following don't generalize by default
joshc (joshua-clymer) · 2023-11-19T19:27:30.735Z · comments (0)

D&D.Sci: The Mad Tyrant's Pet Turtles
abstractapplic · 2024-03-29T16:22:13.732Z · comments (18)

What's next for the field of Agent Foundations?
Nora_Ammann · 2023-11-30T17:55:13.982Z · comments (23)

Understanding SAE Features with the Logit Lens
Joseph Bloom (Jbloom) · 2024-03-11T00:16:57.429Z · comments (0)

[link] Are There Examples of Overhang for Other Technologies?
Jeffrey Heninger (jeffrey-heninger) · 2023-12-13T21:48:08.954Z · comments (50)

AI #48: Exponentials in Geometry
Zvi · 2024-01-18T14:20:07.869Z · comments (9)

Measuring Coherence of Policies in Toy Environments
dx26 (dylan-xu) · 2024-03-18T17:59:08.118Z · comments (9)

Rationalists are missing a core piece for agent-like structure (energy vs information overload)
tailcalled · 2024-08-17T09:57:19.370Z · comments (9)

Does AI risk “other” the AIs?
Joe Carlsmith (joekc) · 2024-01-09T17:51:47.020Z · comments (3)

[link] More people getting into AI safety should do a PhD
AdamGleave · 2024-03-14T22:14:48.855Z · comments (24)

Some negative steganography results
Fabien Roger (Fabien) · 2023-12-09T20:22:52.323Z · comments (5)

Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions
Lidor Banuel Dabbah · 2024-07-19T20:32:15.095Z · comments (6)

Thoughts on SB-1047
ryan_greenblatt · 2024-05-29T23:26:14.392Z · comments (1)

AI #81: Alpha Proteo
Zvi · 2024-09-12T13:00:07.958Z · comments (3)

[link] Towards shutdownable agents via stochastic choice
EJT (ElliottThornley) · 2024-07-08T10:14:24.452Z · comments (7)

LessOnline Festival Updates Thread
Ben Pace (Benito) · 2024-04-18T21:55:08.003Z · comments (26)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

p-b-1 on Leon Lang's Shortform

I think the evidence mostly points towards 3+4,

But if 3 is due to 1 it would have bigger implications about 6 and probably also 5.

And there must be a whole bunch of people out there who know wether the curves bend.

satron on Sabotage Evaluations for Frontier Models

I do disagree with your proposed standard. It is good enough for me, that the critic's argument are engaged by someone on your side. Going there personally seems unnecessary. After all, if the goal is to build safe AI, you personally knowing a niche technical solution isn't necessary, if you have people on your team who are aware of publicly produced solutions as well as internal ones.

And there is engagement with people like Yudkowski from the optimistic side. There are at least proposed solutions to problems he is presenting. Just because Sam Altman personally didn't invent or voice them doesn't really make him unethical (I accept that he personally may be unethical for a different reason).

benito on Sabotage Evaluations for Frontier Models

What probability do you think the ~25,000 Enron employees had about it collapsing and being fraudulent? Do you think it was above 10%? As another case study, here's a link to one of the biggest recent examples of people having numbers way off.

Re: government, I will point out that non-zero of the people in these governmental departments who are on track to being approving those models had secret agreements with one of the leading companies to never criticize them, so I'd take some of their supposed reliability as external auditors with a grain of salt.

satron on Sabotage Evaluations for Frontier Models

The founders of the companies accept money for the same reason any other business accepts money. You can build something genuinely good for humanity while making yourself richer at the same time. This has happened many times already (Apple phones for example).

I concede that the founders of the companies didn't personally publicly engage with the arguments of people like Yudkowski, but that's a really high bar. Historically, CEOs aren't usually known for getting into technical debates. For that reason, they create security councils that monitor the perceived threats.

And it's not like there was no response whatsoever from people who are optimistic about AI. There was plenty of them. I am a believer that arguments matter and not people who make them. If a successful argument has been made, then I don't see a need for CEOs to repeat them. And I certainly don't think that just because CEOs don't go to debates, that makes them unethical.

benito on Sabotage Evaluations for Frontier Models

The issue is that no matter how much good stuff they do, one can always call it marginal and not enough.

I think this is unfairly glossing my position. Simply because someone could always ask for more work does not imply that this particular request for more work is necessarily unreasonable.

"I wish you would be a little less selfish"
"People always ask others to do more for them! When will this stop!"
"Well, in this case you have murdered many people for personal gratification and stolen the life savings of many people. Surely I get to ask it in this case."

The basic standards I am proposing is that if you are getting billions-of-dollars rich by building machines you believe may lead to omnicide or omnicide-adjacent outcomes for literally everyone, you should have a debate with a serious critic (e.g. a Nobel Prize winner in your field who believes you shouldn't be doing so, or perhaps the person who saw the alignment problem a ~decade ahead of everyone and worked on it as a top-priority the entire time) about (a) why you are doing this, and (b) whether the outcome will be good or bad.

I have many other basic standards that are not being met, but on the question of whether omnicide or omnicide-adjacent things are the default outcome when you are unilaterally running humanity towards it, this is perhaps the lowest standard for public discourse and honest debate. Publicly show up to engage with your strongest critics, at least literally once (though you are correct to assume that actually I think it should happen more than once).

cbiddulph on 5 ways to improve CoT faithfulness

Another idea would be to only use the optimized planner to write a single randomly chosen step in the CoT.

The only difference between this and my proposal is that the steps up until that randomly chosen step are written by the frozen planner, right? This variant would just make it so the optimized planner is now OOD during deployment, without any advantage AFAICT.

sam-marks on Lao Mein's Shortform

For what it's worth—even granting that Musk pursuing a personal vendetta against Altman or OAI happens to be good for the world—I think this is a pretty uncomfortable thing to root for, let alone to actively influence. I think this for the same reason that I think it's uncomfortable to hope for—and immoral to contribute to—assassination of political leaders, even assuming that their assassination would be net good.

cbiddulph on 5 ways to improve CoT faithfulness

The frozen planner may react to steganography, in the same way that image-recognition NNs react to adversarial examples.

This is a good point!

Still, IMO, exploiting the frozen planner through adversarial inputs in a single step seems pretty unlikely to be a fruitful strategy for the optimized planner. If the optimized planner is simply trying to accurately convey information to the frozen planner, probably >99% of that information is best to convey through human-understandable text.

This is especially likely to hold under the following conditions:

The step size should be small. I alluded to this in the post, but it seems very hard to select a single token that conveys some nonobvious information to the frozen planner in a way that somehow improves performance. The more tokens per step, the easier this becomes.
We should apply regularization to the planner, to make it stay somewhat close to the base LLM's logprobs. That can rule out especially crazy sequences of tokens like in the research at https://llm-attacks.org.
We can continue training the frozen planner on human data, where the humans write continuations of CoT prefixes from the optimized planner. That way, if the optimized planner puts the frozen planner off distribution, the humans can provide examples that will put it back on track. This data collection might be pretty expensive though - maybe there's a smart way to choose optimized CoT prefixes that will be most helpful to collect human continuations for.

satron on Sabotage Evaluations for Frontier Models

That depends on how bad things are perceived. They might be more optimistic than is warranted, but probably not by that much. I don't believe that so many people could decrease their worries by 40% for example (just my opinion). Not to mention all of the government monitoring people who approve of the models.

satron on Sabotage Evaluations for Frontier Models

The issue is that no matter how much good stuff they do, one can always call it marginal and not enough. There aren't really any objective benchmarks for measuring it.

The justification for creating AI (at least in the heads of its developers) could look something like this (really simplified): Our current models seem safe, we have a plan on how to proceed and we believe that we will achieve a utopia.

You can argue that they are wrong and nothing like that will work, but that's their justification.