LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Jan Betley (jan-betley) · 2025-02-25T17:39:31.059Z · comments (18)

[link] what an efficient market feels from inside
DMMF · 2025-02-25T02:38:40.129Z · comments (8)

Economics Roundup #5
Zvi · 2025-02-25T13:40:07.086Z · comments (6)

[link] Upcoming Protest for AI Safety
Matt Vincent (matthew-milone) · 2025-02-25T03:04:03.153Z · comments (0)

Revisiting Conway's Law
annebrandes (annebrandes1@gmail.com) · 2025-02-25T08:33:52.421Z · comments (0)

Three Levels for Large Language Model Cognition
Eleni Angelou (ea-1) · 2025-02-25T23:14:00.306Z · comments (0)

[link] We Can Build Compassionate AI
Gordon Seidoh Worley (gworley) · 2025-02-25T16:37:06.160Z · comments (1)

Technical comparison of Deepseek, Novasky, S1, Helix, P0
Juliezhanggg · 2025-02-25T04:20:40.413Z · comments (0)

Levels of analysis for thinking about agency
Cole Wyeth (Amyr) · 2025-02-26T04:24:24.583Z · comments (0)

[question] Intellectual lifehacks repo
Antoine de Scorraille (Etoile de Scauchy) · 2025-02-25T16:32:09.814Z · answers+comments (4)

[link] The Stag Hunt—cultivating cooperation to reap rewards
James Stephen Brown (james-brown) · 2025-02-25T23:45:07.472Z · comments (0)

[link] [Crosspost] Strategic wealth accumulation under transformative AI expectations
arden446 · 2025-02-25T21:50:11.458Z · comments (0)

Making alignment a law of the universe
juggins · 2025-02-25T10:44:11.632Z · comments (1)

Demystifying the Pinocchio Paradox
Novak Zukowski (Zantarus) · 2025-02-25T06:16:57.219Z · comments (0)

next page (older posts) →

Archive

Recent comments

pchvykov on AI Apocalypse and the Buddha

Yes - but I also find there are a number of dogmas in the LW community which are getting entrenched in group-think now and get immediate lashback when confronted. I feel like there used to be more openness to critically engage with unorthodox opinions 10 years ago or so...

ryan_greenblatt on Training AI to do alignment research we don’t already know how to do

I certainly agree it isn't clear, just my current best guess.

jeremy-gillen on Training AI to do alignment research we don’t already know how to do

It's not entirely clear to me that the math works out for AIs being helpful on net relative to humans just doing it, because of the supervision required, and the trust and misalignment issues.

But on this question (for AIs that are just capable of "prosaic and relatively unenlightened ML research") it feels like shot-in-the-dark guesses. It's very unclear to me what is and isn't possible.

angela-pretorius on what an efficient market feels from inside

Some landlords offer cheap rent but only rent to childless non-smokers who have a perfect credit score and a well-paid job (preferably a job that requires security clearance or an enhanced DBS check) and whose demographic characteristics are not too similar to those of their previous bad tenants.

Other landlords are less selective about who they rent to but charge much higher rents to compensate for the risk of property damage and rent arrears. This is why properties that are very similar in quality may be charging at wildly different rents.

ryan_greenblatt on Training AI to do alignment research we don’t already know how to do

On some axes, but won't there to be axes where AIs are more difficult than humans also? Sycophancy&slop being the most salient. Misalignment issues being another.

Yes, I just meant on net. (Relative to the current ML community and given a similar fraction of resources to spend on AI compute.)

nikola-jurkovic on nikola's Shortform

All of the above but it seems pretty hard to have an impact as a high schooler, and many impact avenues aren't technically "positions" (e.g. influencer)
I think that everyone expect "Extremely resilient individuals who expect to get an impactful position (including independent research) very quickly" is probably better off following the strategy.

lsusr on Luna Lovegood and the Chamber of Secrets - Part 10

That's a good point. I've changed it to "wokking".

samuelshadrach on xpostah's Shortform

If I understand correctly it is possible to find $300/mo/bedroom accommodation in rural US today, and a large enough city will compress city rents down to rural rents. A govt willing to pursue a plan as interesting as this one may also be able to increase immigrant labour to build the houses and relax housing regulations. US residential rents are artificially high compared to global average. (In some parts of the world, a few steel sheets (4 walls + roof) is sufficient to count as a house, even water and sewage piping in every house is not mandatory as long as residents can access toilets and water supply within walking distance.)

(A gigacity could also increase rents because it'll increase the incomes of even its lowest income members. But yeah in general now you need to track median incomes of 1B people to find out new equilibrium.)

ricraz on nikola's Shortform

I found this tricky to parse because of two phrasing issues:

The post depends a lot on what you mean by "school" (high school versus undergrad).
I feel confused about what claim you're making about the waiting room strategy: you say that some people shouldn't use it, but you don't actually claim that anyone in particular should use it. So are you just mentioning that it's a possible strategy? Or are you implying that it should be the default strategy?

jeremy-gillen on Training AI to do alignment research we don’t already know how to do

Thanks, I appreciate the draft. I see why it's not plausible to get started on now, since much of it depends on having AGIs or proto-AGIs to play with.

I guess I shouldn't respond too much in public until you've published the doc, but:

If I'm interpreting correctly, a number of the things you intend to try involve having a misaligned (but controlled) proto-AGI run experiments involving training (or otherwise messing with in some way) an AGI. I hope you have some empathy the internal screaming I have toward this category of things.
A bunch of the ideas do seem reasonable to want to try (given that you had AGIs to play with, and were very confident that doing so wouldn't allow them to escape or otherwise gain influence). I am sympathetic to the various ideas that involve gaining understanding of how to influence goals better by training in various ways.
There are chunks of these ideas that definitely aren't "prosaic and relatively unenlightened ML research", and involve very-high-trust security stuff or non-trivial epistemic work.
I'd be a little more sympathetic to these kinda desperate last-minute things if I had no hope in literally just understanding how to build task-AGI properly, in a well understood way. We can do this now. I'm baffled that almost all of the EA-alignment-sphere has given up on even trying to do this. From talking to people this weekend this shift seems downstream of thinking that we can make AGIs do alignment work, without thinking this through in detail.

The total quantity of risk reduction is unclear, but seems substantial to me. I'd guess takeover risk goes from 50% to 5% if you do a very good job at executing on huge amounts of prosaic and relatively unenlightened ML research at the relevant time

Agree it's unclear. I think the chance of most of the ideas being helpful depends on some variables that we don't clearly know yet. I think 90% risk improvement can't be right, because there's a lot of correlation between each of the things working or failing. And a lot of the risk comes from imperfect execution of the control scheme, which adds on top.

One underlying intuition that I want to express: The world where we are making proto-AGIs run all these experiments is pure chaos. Politically and epistemically and with all the work we need to do. I think pushing toward this chaotic world is much worse than other worlds we could push for right now.

But if I thought control was likely to work very well and saw a much more plausible path to alignment among the "stuff to try", I'd think it was a reasonable strategy.

I also think that getting the ML community to work on things effectively is probably substantially harder than getting AIs to work on things effectively

On some axes, but won't there to be axes where AIs are more difficult than humans also? Sycophancy&slop being the most salient. Misalignment issues being another.

This work isn't extremely easy to verify or scale up (such that I don't think "throw a billion dollars at it" just works),

This makes sense now. But I think this line should make you worry about whether you can make controlled AIs do it.