LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Is AI Alignment Enough?
Aram Panasenco (panasenco) · 2025-01-10T18:57:48.409Z · comments (6)

Corrigibility's Desirability is Timing-Sensitive
RobertM (T3t) · 2024-12-26T22:24:17.435Z · comments (4)

Book Summary: Zero to One
bilalchughtai (beelal) · 2024-12-29T16:13:52.922Z · comments (2)

Will bird flu be the next Covid? "Little chance" says my dashboard.
Nathan Young · 2025-01-07T20:10:50.080Z · comments (0)

[link] AI as systems, not just models
Andy Arditi (andy-arditi) · 2024-12-21T23:19:05.507Z · comments (0)

[link] Impact in AI Safety Now Requires Specific Strategic Insight
MiloSal (milosal) · 2024-12-29T00:40:53.780Z · comments (1)

[link] The Roots of Progress 2024 in review
jasoncrawford · 2025-01-01T00:02:06.441Z · comments (0)

Learning Multi-Level Features with Matryoshka SAEs
Bart Bussmann (Stuckwork) · 2024-12-19T15:59:00.036Z · comments (4)

On the OpenAI Economic Blueprint
Zvi · 2025-01-15T14:30:06.773Z · comments (0)

Living with Rats in College
lsusr · 2024-12-25T10:44:13.085Z · comments (0)

Voluntary Salary Reduction
jefftk (jkaufman) · 2025-01-15T03:40:02.909Z · comments (0)

Intranasal mRNA Vaccines?
J Bostock (Jemist) · 2025-01-01T23:46:40.524Z · comments (2)

Preface
Allison Duettmann (allison-duettmann) · 2025-01-02T18:59:46.290Z · comments (1)

[link] debating buying NVDA in 2019
bhauth · 2025-01-04T05:06:54.047Z · comments (0)

Elevating Air Purifiers
jefftk (jkaufman) · 2024-12-17T01:40:05.401Z · comments (0)

[link] The Alignment Simulator
Yair Halberstadt (yair-halberstadt) · 2024-12-22T11:45:55.220Z · comments (3)

[link] Genetically edited mosquitoes haven't scaled yet. Why?
alexey · 2024-12-30T21:37:32.942Z · comments (0)

Good Reasons for Alts
jefftk (jkaufman) · 2024-12-21T01:30:03.113Z · comments (2)

[link] Being Present is Not a Skill
Chipmonk · 2024-12-18T01:11:04.715Z · comments (8)

[link] NAO Updates, January 2025
jefftk (jkaufman) · 2025-01-10T03:37:36.698Z · comments (0)

The Second Gemini
Zvi · 2024-12-17T15:50:06.373Z · comments (0)

[link] Letter from an Alien Mind
Shoshannah Tekofsky (DarkSym) · 2024-12-27T13:20:49.277Z · comments (7)

[link] Human-AI Complementarity: A Goal for Amplified Oversight
rishubjain · 2024-12-24T09:57:55.111Z · comments (3)

[link] Job Opening: SWE to help improve grant-making software
Ethan Ashkie (ethan-ashkie-1) · 2025-01-08T00:54:22.820Z · comments (1)

The average rationalist IQ is about 122
Rockenots (Ekefa) · 2024-12-28T15:42:07.067Z · comments (23)

Elon Musk and Solar Futurism
transhumanist_atom_understander · 2024-12-21T02:55:28.554Z · comments (27)

[link] PCR retrospective
bhauth · 2024-12-26T21:20:56.484Z · comments (0)

Non-Obvious Benefits of Insurance
jefftk (jkaufman) · 2024-12-23T03:40:02.184Z · comments (5)

The absolute basics of representation theory of finite groups
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-08T09:47:13.136Z · comments (0)

[link] LLMs for language learning
Benquo · 2025-01-15T14:08:54.620Z · comments (0)

The Alignment Mapping Program: Forging Independent Thinkers in AI Safety - A Pilot Retrospective
Alvin Ånestrand (alvin-anestrand) · 2025-01-10T16:22:16.905Z · comments (0)

[link] Our new video about goal misgeneralization, plus an apology
Writer · 2025-01-14T14:07:21.648Z · comments (0)

Broken Latents: Studying SAEs and Feature Co-occurrence in Toy Models
chanind · 2024-12-30T22:50:54.964Z · comments (3)

[question] Meal Replacements in 2025?
alkjash · 2025-01-06T15:37:25.041Z · answers+comments (9)

Grading my 2024 AI predictions
Nikola Jurkovic (nikolaisalreadytaken) · 2025-01-02T05:01:46.587Z · comments (1)

[link] I read every major AI lab’s safety plan so you don’t have to
sarahhw · 2024-12-16T18:51:38.499Z · comments (0)

[link] It looks like there are some good funding opportunities in AI safety right now
Benjamin_Todd · 2024-12-22T12:41:02.151Z · comments (0)

Is AI Physical?
Lauren Greenspan (LaurenGreenspan) · 2025-01-14T21:21:39.999Z · comments (2)

NYC Congestion Pricing: Early Days
Zvi · 2025-01-14T14:00:07.445Z · comments (0)

Latent Adversarial Training (LAT) Improves the Representation of Refusal
alexandraabbas · 2025-01-06T10:24:53.419Z · comments (6)

A Generalization of the Good Regulator Theorem
Alfred Harwood · 2025-01-04T09:55:25.432Z · comments (6)

Proof Explained for "Robust Agents Learn Causal World Model"
Dalcy (Darcy) · 2024-12-22T15:06:16.880Z · comments (0)

subfunctional overlaps in attentional selection history implies momentum for decision-trajectories
Emrik (Emrik North) · 2024-12-22T14:12:49.027Z · comments (1)

Theoretical Alignment's Second Chance
lunatic_at_large · 2024-12-22T05:03:51.653Z · comments (0)

Can we rescue Effective Altruism?
Elizabeth (pktechgirl) · 2025-01-09T16:40:02.405Z · comments (0)

Measuring Nonlinear Feature Interactions in Sparse Crosscoders [Project Proposal]
Jason Gross (jason-gross) · 2025-01-06T04:22:12.633Z · comments (0)

AGI with RL is Bad News for Safety
Nadav Brandes (nadav-brandes) · 2024-12-21T19:36:03.970Z · comments (22)

[link] Why OpenAI’s Structure Must Evolve To Advance Our Mission
stuhlmueller · 2024-12-28T04:24:19.937Z · comments (1)

Turning up the Heat on Deceptively-Misaligned AI
J Bostock (Jemist) · 2025-01-07T00:13:28.191Z · comments (16)

Don’t Legalize Drugs
Declan Molony (declan-molony) · 2025-01-14T06:51:14.005Z · comments (7)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

steve2152 on Applying traditional economic thinking to AGI: a trilemma

Thanks! I basically agree.

I think that, if we assume that there’s a world in which (1) at least some humans own some capital in the post-AGI economy (hence rapidly exponentially growing wealth), (2) nobody is worried about expropriation or violence, and (3) humans have the knowledge and power to pass and enforce effective laws holding up human interests in regards to externalities (e.g. AGIs creating new exotic forms of lethal pollution while following the letter of the existing law, or building a Dyson swarm that blocks out the sun)…

…then that’s already pretty great! That would be far better than my baseline expectation.

I think that, if we assume (1-3), then the non-capital-owning humans have a great chance of doing OK too, via (A) charity from the fabulously-wealthy capital-owning humans, or through (B) political imposition of UBI (assuming democracy), or, like you said, (C) getting employed by the fabulously-wealthy capital-owning humans who specifically want to employ other humans (or selling ownerships rights to their posthumous skulls, ofc :) ).

bokov-1 on My latest attempt to understand decision theory: I asked ChatGPT to debate me.

The closest I can come to examples might be ones where the two-box outcome is so much worse then the one-box outcome that I have nothing to lose by choosing the path of hope.

E.g. picking one box even though I and everybody else knows I'm a two-boxer if I believe that in this case two-boxing will kill me

Or, cooperating when both unilateral defection, unilateral cooperation, and mutual defection have results vastly worse than mutual cooperation.

Are these on the right track?

daniel-tan on Daniel Tan's Shortform

Hidden in Plain Text provides a model organism of steganography. How do they solve the challenges here?

Condition 1 is satisfied by having a separate encoder and decoder, s.t. the decoder only sees the encoder's output (and not the prompt). This is most similar to "long-context task" in my framing.
Condition 2 is satisfied by RL fine-tuning against a CoT monitor. (It's slightly more complicated than that because the encoder and decoder are trained separately, so there's a loop where encoder is trained for some time, then decoder is trained for some time, etc).

It seems like this setup could be simplified:

The encoder and decoder could be the same model. BUT, we still delete the original prompt from the decoder's context, so the model must learn to encode the prompt information. (This represents task-relevant information falling out of the context window at some point. Seems plausible in particularly important threat models like scheming)
The RL setup could be simplified to a single reward formulation, where the reward is 0 if the model is detected to be engaging in reasoning, or if the model predicts wrongly, and 1 otherwise. Some reward shaping may be beneficial.

The simplified setting may be easier to generalise to different tasks (e.g. sycophancy, evading trusted monitoring, etc)

alexander-gietelink-oldenziel on Lecture Series on Tiling Agents

skills issue.

prep for the model that is coming tomorrow not the model of today

bokov-1 on My latest attempt to understand decision theory: I asked ChatGPT to debate me.

Because, based on the behavior of people here whose intelligence and ideas I have come to respect, this is an important topic.

Clearly I completely lack the background to understand the full theoretical argument. I also lack the background to understand the full theoretical argument behind general relatively and quantum uncertainty. Yet there are many real-world practical examples that I do understand and can work backwards from to get a roughly correct intuition about these ideas.

Every example I have seen for CDT falling short has been a hypothetical scenario that almost certainly never happened.

But if the only scenarios where CDT is a dominated strategy are hypothetical ones, I wouldn't expect smart people on LW to spend so much time and energy on them.

abramdemski on Lecture Series on Tiling Agents

Entering, but not entered. The machines do not yet understand the prompts I write them. (Seriously, it's total garbage still, even with lots of high quality background material in the context.)

bokov-1 on My latest attempt to understand decision theory: I asked ChatGPT to debate me.

Thank you for responding to my post despite its negative rating.

Can you, as a human, give any practical real-world examples that do not rely on non-existent tech where anything outperforms non-naive CDT?

By non-naive I mean CDT that isn't myopically just trying to maximize the immediate payoff but rather trying to maximize the long term value to the player into account future interactions, reputation, uncertainty about causal relationships, etc.

henry-sleight on [deleted]

Currently, during research programs such as MATS, many impactful AI Safety projects are being worked on

I think you could get to the problem faster than this. As I understand it, you're trying to motivate the shared repo by thinking about all of the duplicated work happening across the community & how valuable it would be for people trying to learn this style of research for the first time to work from a shared foundation.

I think this is a pretty complex problem and needs to be called out more explicitly. Something like:

[For many early-career researchers, there's an unnecessarily steep learning curve for even figuring out what good norms for their research code should look like in the first place. We're all for people learning and trying things for themselves, but they should do that on top of a solid, trusted, well documented foundation. That's why things like e.g. the ARENA curriculum are so valuable.

But there aren't standardised templates/repos for doing most of the work in empirical alignment research, and we think this probably slows down new researchers a lot, and requires them to unnecessarily duplicate work and make mistakes that they might not notice are slowing them down. ML research in general involves so much tinkering and figuring things out, that building from a strong template can be a meaningful speedup and provide a helpful initial learning experience.

For the MATS 7 scholars mentored by Ethan, Jan, Fabien, Mrinank, and others from the Anthropic Alignment Science team, we have created....

alexander-gietelink-oldenziel on Lecture Series on Tiling Agents

Mmm. You are entering the Cyborg Era. The only ideas you may take to the next epoch are those that can be uploaded to the machine intelligence.

jbash on In Defense of a Butlerian Jihad

Yeah, I’m curious.

OK...

Some of this kind of puts words in your mouth by extrapolating from similar discussions with others. I apologize in advance for anything I've gotten wrong.

What's so great about failure?

This one is probably the simplest from my viewpoint, and I bet it's the one that's you'll "get" the least. Because it's basically my not "getting" your view at a very basic level.

Why would you ever even want to be able to fail big, in a way that would follow you around? What actual value do you get out of it? Failure in itself is valuable to you?

Wut?

It feels to me like a weird need to make your whole life into some kind of game to be "won" or "lost", or some kind of gambling addiction or something.

And I do have to wonder if there may not be a full appreciation for what crushing failure really is.

Failure is always an option

If you're in the "UBI paradise", it's not like you can't still succeed or fail. Put 100 years into a project. You're gonna feel the failure if it fails, and feel the success if it succeeds.

That's artificial? Weak sauce? Those aren't real real stakes? You have to be an effete pampered hothouse flower to care about that kind of made-up stuff?

Well, the big stakes are already gone. If you're on Less Wrong, you probably don't have much real chance of failing so hard that you die, without intentionally trying. Would your medieval farmer even recognize that your present stakes are significant?

... and if you care, your social prestige, among whoever you care about, can always be on the table, which is already most of what you're risking most of the time.

Basically, it seems like you're treating a not-particularly-qualitative change as bigger than it is, and privileging the status quo.

What agency?

Agency is another status quo issue.

Everybody's agency is already limited, severely and arbitrarily, but it doesn't seem to bother them.

Forces mostly unknown and completely beyond your control have made a universe in which you can exist, and fitted you for it. You depend on the fine structure constant. You have no choice about whether it changes. You need not and cannot act to maintain the present value. I doubt that makes you feel your agency is meaningless.

You could be killed by a giant meteor tomorrow, with no chance of acting to change that. More likely, other humans could kill you, still in a way you couldn't influence, for reasons you couldn't change and might never learn. You will someday die of some probably unchosen cause. But I bet none of this worries you on the average day. If it does, people will worry about you.

The Grand Sweep of History is being set by chaotically interacting causes, both natural and human. You don't know what most of them are. If you're one of a special few, you may be positioned to Change History by yourself... but you don't know if you are, what to do, or what the results would actually be. Yet you don't go around feeling like a leaf in the wind.

The "high impact" things that you do control are pretty randomly selected. You can get into Real Trouble or gain Real Advantages, but how is contingent, set by local, ephemeral circumstances. You can get away with things that would have killed a caveman, and you can screw yourself in ways you couldn't easily even explain to a caveman.

Yet, even after swallowing all the existing arbitrariness, new arbitrariness seems not-OK. Imagine a "UBI paradise", except each person gets a bunch of random, arbitrary, weird Responsibilities, none of them with much effect on anything or anybody else. Each Responsibility is literally a bad joke. But the stakes are real: you're Shot at Dawn if you don't Meet Your Responsibilities. I doubt you'd feel the Meaning very strongly.

... even though some of the human-imposed stuff we have already can seem too close to a bad joke.

The upshot is that it seems the "important" control people say they need is almost exactly the control they're used to having (just as the failures they need to worry about are suspiciously close to failures they presently have to worry about). Like today's scope of action is somehow automatically optimal by natural law.

That feels like a lack of imagination or flexibility.

And I definitely don't feel that way. There are things I'd prefer to keep control over, but they're not exactly the things I control today, and don't fall neatly into (any of) the categories people call "meaningful". I'd probably make some real changes in my scope of control if I could.

What about everybody else?

It's all very nice to talk about being able to fail, but you don't fail in a vaccuum. You affect others. Your "agentic failure" can be other people's "mishap they don't control". It's almost impossible to totally avoid that. Even if you want that, why do you think you should get it?

The Universe doesn't owe you a value system

This is a bit nebulous, and not dead on the topic of "stakes", and maybe even a bit insulting... but I also think it's related in an important way, and I don't know a better way to say it clearly.

I always feel a sense that what people who talk about "meaning" really want is value realism. You didn't say this, but this is what I feel like I see underneath practically everybody's talk about meaning:

Gosh darn it, there should be some external, objective, sharable way to assign Real Value to things. Only things that Real Value are "meaningful.

And if there is no such thing, it's important not to accept it, not really, not on a gut level...

... because I need it, dammit!

Say that or not, believe it or not, feel it or not, your needs, real or imagined, don't mean anything to the Laws that Govern All. They don't care to define Real Value, and they don't.

You get to decide what matters to you, and that means you have to decide what matters to you. Of course what you pick is ultimately caused by things you don't control, because you are caused by things you don't control. That doesn't make it any less yours. And it won't exactly match anybody else.

... and choosing to need the chance to fail, because it superficially looks like an externally imposed part of the Natural Order(TM), seems unfortunate. I mean, if you can avoid it.

"But don't you see, Sparklebear? The value was inside of YOU all the time!"