LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Introducing Kairos: a new AI safety fieldbuilding organization (the new home for SPAR and FSP)
agucova · 2024-10-25T21:59:08.782Z · comments (0)

[question] Can subjunctive dependence emerge from a simplicity prior?
Daniel C (harper-owen) · 2024-09-16T12:39:35.543Z · answers+comments (0)

Another UFO Bet
codyz · 2024-11-01T01:55:27.301Z · comments (9)

[question] why won't this alignment plan work?
KvmanThinking (avery-liu) · 2024-10-10T15:44:59.450Z · answers+comments (7)

[link] Redundant Attention Heads in Large Language Models For In Context Learning
skunnavakkam · 2024-09-01T20:08:48.963Z · comments (0)

A Taxonomy Of AI System Evaluations
Maxime Riché (maxime-riche) · 2024-08-19T09:07:45.224Z · comments (0)

[link] Metaculus's 'Minitaculus' Experiments — Collaborate With Us
ChristianWilliams · 2024-08-26T20:44:32.125Z · comments (0)

Food, Prison & Exotic Animals: Sparse Autoencoders Detect 6.5x Performing Youtube Thumbnails
Louka Ewington-Pitsos (louka-ewington-pitsos) · 2024-09-17T03:52:43.269Z · comments (2)

[question] Why would ASI share any resources with us?
Satron · 2024-11-13T23:38:36.535Z · answers+comments (5)

Exploring Shard-like Behavior: Empirical Insights into Contextual Decision-Making in RL Agents
Alejandro Aristizabal (alejandro-aristizabal) · 2024-09-29T00:32:42.161Z · comments (0)

The Existential Dread of Being a Powerful AI System
testingthewaters · 2024-09-26T10:56:32.904Z · comments (1)

Against Job Boards: Human Capital and the Legibility Trap
vaishnav92 · 2024-10-24T20:50:50.266Z · comments (1)

GPT4o is still sensitive to user-induced bias when writing code
Reed (ThomasReed) · 2024-09-22T21:04:54.717Z · comments (0)

[question] How to cite LessWrong as an academic source?
PhilosophicalSoul (LiamLaw) · 2024-11-06T08:28:26.309Z · answers+comments (6)

[link] AI Safety Newsletter #43: White House Issues First National Security Memo on AI Plus, AI and Job Displacement, and AI Takes Over the Nobels
Corin Katzke (corin-katzke) · 2024-10-28T16:03:39.258Z · comments (0)

Using LLM's for AI Foundation research and the Simple Solution assumption
Donald Hobson (donald-hobson) · 2024-09-24T11:00:53.658Z · comments (0)

Seeking mentorship
Kevin Afachao (kevin-afachao) · 2024-09-21T16:54:58.353Z · comments (0)

[link] An "Observatory" For a Shy Super AI?
Sherrinford · 2024-09-27T21:22:40.296Z · comments (0)

If I care about measure, choices have additional burden (+AI generated LW-comments)
avturchin · 2024-11-15T10:27:15.212Z · comments (9)

[link] Linkpost: Hypocrisy standoff
Chris_Leong · 2024-09-29T14:27:19.175Z · comments (1)

New Capabilities, New Risks? - Evaluating Agentic General Assistants using Elements of GAIA & METR Frameworks
Tej Lander (tej-lander) · 2024-09-29T18:58:56.253Z · comments (0)

[question] AMA: International School Student in China
Novice · 2024-10-01T06:00:16.282Z · answers+comments (0)

Longevity and the Mind
George3d6 · 2024-09-16T09:43:09.700Z · comments (2)

Reinforcement Learning from Information Bazaar Feedback, and other uses of information markets
Abhimanyu Pallavi Sudhir (abhimanyu-pallavi-sudhir) · 2024-09-16T01:04:32.953Z · comments (1)

[link] Should we abstain from voting? (In nondeterministic elections)
B Jacobs (Bob Jacobs) · 2024-10-02T10:07:43.167Z · comments (6)

Biasing VLM Response with Visual Stimuli
Jaehyuk Lim (jason-l) · 2024-10-03T18:04:31.474Z · comments (0)

[link] AI Safety Newsletter #41: The Next Generation of Compute Scale Plus, Ranking Models by Susceptibility to Jailbreaking, and Machine Ethics
Corin Katzke (corin-katzke) · 2024-09-11T19:14:08.274Z · comments (1)

[link] Could Things Be Very Different?—How Historical Inertia Might Blind Us To Optimal Solutions
James Stephen Brown (james-brown) · 2024-09-11T09:53:07.474Z · comments (0)

[link] Universal basic income isn’t always AGI-proof
Kevin Kohler (KevinKohler) · 2024-09-05T15:39:18.389Z · comments (3)

Democracy beyond majoritarianism
Arturo Macias (arturo-macias) · 2024-09-03T15:10:56.284Z · comments (2)

[question] If the DoJ goes through with the Google breakup,where does Deepmind end up?
O O (o-o) · 2024-10-12T05:06:50.996Z · answers+comments (1)

Differential knowledge interconnection
Roman Leventov · 2024-10-12T12:52:36.267Z · comments (0)

[link] How long should political (and other) terms be?
ohmurphy · 2024-10-14T21:38:43.050Z · comments (0)

Grass Valley USA - ACX Meetups Everywhere Fall 2024
Raelifin · 2024-08-29T18:39:57.229Z · comments (0)

[question] Artificial V/S Organoid Intelligence
10xyz (10xyz-coder) · 2024-10-23T14:31:46.385Z · answers+comments (0)

Educational CAI: Aligning a Language Model with Pedagogical Theories
Bharath Puranam (bharath-puranam) · 2024-11-01T18:55:26.993Z · comments (1)

Using Narrative Prompting to Extract Policy Forecasts from LLMs
Max Ghenis (MaxGhenis) · 2024-11-05T04:37:52.004Z · comments (0)

Meta: On viewing the latest LW posts
quiet_NaN · 2024-08-25T19:31:39.008Z · comments (2)

Apply to be a mentor in SPAR!
agucova · 2024-11-05T21:32:45.797Z · comments (0)

Scattered thoughts on what it means for an LLM to believe
TheManxLoiner · 2024-11-06T22:10:29.429Z · comments (3)

Some reasons to start a project to stop harmful AI
Remmelt (remmelt-ellen) · 2024-08-22T16:23:34.132Z · comments (0)

Agency overhang as a proxy for Sharp left turn
Eris (anton-zheltoukhov) · 2024-11-07T12:14:24.333Z · comments (0)

[question] How do we know dreams aren't real?
Logan Zoellner (logan-zoellner) · 2024-08-22T12:41:57.380Z · answers+comments (31)

[link] Formalize the Hashiness Model of AGI Uncontainability
Remmelt (remmelt-ellen) · 2024-11-09T16:10:05.032Z · comments (0)

Bellevue Library Meetup - Nov 23
Cedar (xida-ren) · 2024-11-09T23:05:02.452Z · comments (1)

[link] Is P(Doom) Meaningful? Bayesian vs. Popperian Epistemology Debate
Liron · 2024-11-09T23:39:30.039Z · comments (0)

[link] Exposure can’t rule out disasters
Chipmonk · 2024-08-15T17:03:37.259Z · comments (19)

[link] Internal music player: phenomenology of earworms
dkl9 · 2024-11-14T23:29:48.383Z · comments (1)

[link] Join the $10K AutoHack 2024 Tournament
Paul Bricman (paulbricman) · 2024-09-25T11:54:20.112Z · comments (0)

The great Enigma in the sky: The universe as an encryption machine
Alex_Shleizer · 2024-08-14T13:21:58.713Z · comments (1)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

benito on Sabotage Evaluations for Frontier Models

Yes, it is far harder to get someone to unflinchingly kill themselves and their family. But all of this glory and power is sufficient to get someone to do a little self-deception, a little looking away from the unpleasant thoughts.

"Who can really say how the future will go? Think of all the good things that might happen! And I like all the people around me. I'm sure they're working as hard as they can. Let's keep building."

sodium on AI Safety Camp 10

Sorry, is there a timezone for when the applications would close by, or is it AoE?

michaelcorey on Graceful Degradation

Thank you, I found it useful for me.

ryan_greenblatt on lukehmiles's Shortform

It might be easier to try to establish some track record by doing a small research project first. I don't know if you have enough runway for this though.

ryan_greenblatt on Using Dangerous AI, But Safely?

To the extent that complexity makes succesfully red-teaming our approaches too difficult, then we could try to mostly depend on simple approaches and only use more complex approaches only if we are confident that at least they won't can't make the situation worse.

My general view is that complex methods are only productive if you have a good feedback loop (either empirical feedback or a very robust theoretical feedback (e.g. proofs)). In the case of control, so long as we do a good job with red teaming^[1], we can build a good feedback loop such that complex methods can be productive. I think a similar process is often used in computer security and I think computer security often involves complex measures^[2].

As an aside, this also makes me relatively pessimistic about developing complex and unprincipled alignment (non-control) approaches for hard-to-experiment-on threat models like scheming (aka deceptive alignment). It will be difficult to make scheming model organisms which realistically reproduce the internal structure and capability level of models we're worried about and thus it will be hard to effectively iterate on complex and unprincipled measures.

An alternative approach for empirics is to focus on highly principled measures where we have a story for why they should generalize to scheming that we can test and iterate against (e.g. we'll understand how the model works which we can test by seeing if we can understand models in general).

which isn't clearly possible given that the relevant AIs could have different capability profiles, ↩︎
I don't think the ideas we consider are more complex than typical approaches in security, though Buck would know better than me. ↩︎

satron on Sabotage Evaluations for Frontier Models

I think that these examples are quite different. People who were voting for a president who ultimately turned out to be corrupt most likely predicted that the most likely scenario won't be totally horrible for them personally. And most of the times they are right, corrupt presidents are bad for countries, but you could live happily under their rule. People would at least expect to retain whatever valuable stuff they already have.

With AI the story is different. If a person thinks that the most likely scenario is the scenario where everyone dies, then they would not only expect to not win anything, but to lose everything they have. And media is actively talking about it, so it's not like they are not at least aware of the possibility. Even if they don't see them as personally responsible for everything, they sure don't want to feel the consequence of everyone dying.

This is not the case of someone doing something that they know is wrong but benefits them. If they truly believe that we are all going to die, then it's doing something they know is wrong and actively hurts them. And allegedly hurts them much more than any scenario that has happened so far (since we are all still alive).

So I believe that most people working on it, actually believe that we are going to be just fine. Whether this belief is warranted is another question.

satron on Sabotage Evaluations for Frontier Models

I think the justification of this articles goes something like this: "here is the vision of future where we successfully align AI. It is utopian enough that it warrants pursuing it". Risks from creating AI just aren't the topic. They deal with them elsewhere. These essays were specifically focused on the "positive vision" part of the alignment. So I think you are critiquing the articles for lacking something they were never intended to have in the first place.

OpenAI seems to have made some basic commitments here. The word commit is mentioned 29 times. Other companies did it as well here, and here. Here Anthropic makes promises for optimistic, intermediate and pessimistic scenarios of AI development.

green_leaf on Quantum Immortality: A Perspective if AI Doomers are Probably Right

I don't know about similarity... but I was just making a point that QI doesn't require it.

cbiddulph on 5 ways to improve CoT faithfulness

If the initial paraphraser doesn't change between '!' and '.', then learning a canonical-paraphraser from that initial paraphraser won't do it either.

...

So the canonical-paraphrasing idea helps somewhat, but fundamentally, the better the paraphraser (the more closely it captures the human notion of semantic equivalence), the better the anti-steganography guarantee we can get.

What I meant is that we train the canonical paraphraser based on the "human notion of semantic equivalence," NOT the "initial paraphraser."

One way of doing this is to train another model (call it the "adversarial rephraser") at the same time that you train the canonical rephraser. The adversarial rephraser takes a single sentence s as input, and tries to pick another sentence s' such that either 1) s has the same meaning as s' but CR(s) != CR(s'), or 2) s has a different meaning from s' but CR(s) = CR(s'). Semantic equivalence is determined by some oracle - probably an LLM like GPT-4 is sufficient, but you could also fine-tune a model on human data specifically for this task. We then train the canonical rephraser against the adversarial examples.

The adversarial rephraser should be able to discover strategies like changing '.' to '!', and the canonical rephraser will learn to account for those strategies.

benito on Sabotage Evaluations for Frontier Models

You made lots of points, so I wrote a comment for each... probably this was too many replies? I didn't know what else to do that didn't feel like avoiding your points. I hereby state that I do not expect of you to respond to all five of my comments!