LessWrong 2.0 Reader

View: New · Old · Top

← previous page (newer posts) · next page (older posts) →

Should AI safety be a mass movement?
mhampton · 2025-03-13T20:36:59.284Z · comments (1)

Auditing language models for hidden objectives
Sam Marks (samuel-marks) · 2025-03-13T19:18:32.638Z · comments (15)

Reducing LLM deception at scale with self-other overlap fine-tuning
Marc Carauleanu (Marc-Everin Carauleanu) · 2025-03-13T19:09:43.620Z · comments (40)

Vacuum Decay: Expert Survey Results
JessRiedel · 2025-03-13T18:31:17.434Z · comments (26)

[link] A Frontier AI Risk Management Framework: Bridging the Gap Between Current AI Practices and Established Risk Management
simeon_c (WayZ) · 2025-03-13T18:29:52.776Z · comments (0)

Creating Complex Goals: A Model to Create Autonomous Agents
theraven · 2025-03-13T18:17:58.519Z · comments (1)

[link] Habermas Machine
NicholasKees (nick_kees) · 2025-03-13T18:16:50.453Z · comments (7)

The Other Alignment Problem: Maybe AI Needs Protection From Us
Peterpiper · 2025-03-13T18:03:43.086Z · comments (0)

AI #107: The Misplaced Hype Machine
Zvi · 2025-03-13T14:40:05.318Z · comments (10)

[link] Intelsat as a Model for International AGI Governance
rosehadshar · 2025-03-13T12:58:11.692Z · comments (0)

[link] Stacity: a Lock-In Risk Benchmark for Large Language Models
alamerton · 2025-03-13T12:08:47.329Z · comments (0)

The prospect of accelerated AI safety progress, including philosophical progress
Mitchell_Porter · 2025-03-13T10:52:13.745Z · comments (0)

[link] The "Reversal Curse": you still aren't antropomorphising enough.
lumpenspace (lumpen-space) · 2025-03-13T10:24:45.965Z · comments (0)

Formalizing Space-Faring Civilizations Saturation concepts and metrics
Maxime Riché (maxime-riche) · 2025-03-13T09:40:03.465Z · comments (0)

The Economics of p(doom)
Jakub Growiec (jakub-growiec) · 2025-03-13T07:33:50.940Z · comments (0)

Social Media: How to fix them before they become the biggest news platform
Sam G (sam-g) · 2025-03-13T07:28:51.487Z · comments (2)

Penny Whistle in E?
jefftk (jkaufman) · 2025-03-13T02:40:02.653Z · comments (1)

Anthropic, and taking "technical philosophy" more seriously
Raemon · 2025-03-13T01:48:54.184Z · comments (29)

LW/ACX Social Meetup
Stefan (stefan-1) · 2025-03-12T23:13:43.163Z · comments (0)

I grade every NBA basketball game I watch based on enjoyability
proshowersinger · 2025-03-12T21:46:26.791Z · comments (2)

Kairos is hiring a Head of Operations/Founding Generalist
agucova · 2025-03-12T20:58:49.661Z · comments (0)

[link] USAID Outlook: A Metaculus Forecasting Series
ChristianWilliams · 2025-03-12T20:34:03.495Z · comments (0)

[link] What is instrumental convergence?
Vishakha (vishakha-agrawal) · 2025-03-12T20:28:35.556Z · comments (0)

Revising Stages-Oversight Reveals Greater Situational Awareness in LLMs
Sanyu Rajakumar (sanyu-rajakumar) · 2025-03-12T17:56:31.910Z · comments (0)

Why Obedient AI May Be the Real Catastrophe
G~ (gal-1) · 2025-03-12T17:50:09.577Z · comments (2)

[link] Your Communication Preferences Aren’t Law
Jonathan Moregård (JonathanMoregard) · 2025-03-12T17:20:11.117Z · comments (4)

Reflections on Neuralese
Alice Blair (Diatom) · 2025-03-12T16:29:31.230Z · comments (0)

Field tests of semi-rationality in Brazilian military training
P. João (gabriel-brito) · 2025-03-12T16:14:12.590Z · comments (0)

[link] Many life-saving drugs fail for lack of funding. But there’s a solution: desperate rich people
Mvolz (mvolz) · 2025-03-12T15:24:46.889Z · comments (0)

The Most Forbidden Technique
Zvi · 2025-03-12T13:20:04.732Z · comments (9)

[link] You don't actually need a physical multiverse to explain anthropic fine-tuning.
Fraser · 2025-03-12T07:33:43.278Z · comments (8)

[link] AI Can't Write Good Fiction
JustisMills · 2025-03-12T06:11:57.786Z · comments (19)

Existing UDTs test the limits of Bayesianism (and consistency)
Cole Wyeth (Amyr) · 2025-03-12T04:09:11.615Z · comments (20)

[link] (Anti)Aging 101
George3d6 · 2025-03-12T03:59:21.859Z · comments (2)

[link] The Grapes of Hardness
adamShimi · 2025-03-11T21:01:14.963Z · comments (0)

Don't over-update on FrontierMath results
David Matolcsi (matolcsid) · 2025-03-11T20:44:04.459Z · comments (5)

Response to Scott Alexander on Imprisonment
Zvi · 2025-03-11T20:40:06.250Z · comments (4)

[link] Paths and waystations in AI safety
Joe Carlsmith (joekc) · 2025-03-11T18:52:57.772Z · comments (1)

Meridian Cambridge Visiting Researcher Programme: Turn AI safety ideas into funded projects in one week!
Meridian Cambridge · 2025-03-11T17:46:29.656Z · comments (0)

Elon Musk May Be Transitioning to Bipolar Type I
Cyborg25 · 2025-03-11T17:45:06.599Z · comments (22)

Scaling AI Regulation: Realistically, what Can (and Can’t) Be Regulated?
Katalina Hernandez (katalina-hernandez) · 2025-03-11T16:51:41.651Z · comments (1)

[link] How Language Models Understand Nullability
Anish Tondwalkar (anish-tondwalkar) · 2025-03-11T15:57:28.686Z · comments (0)

Forethought: a new AI macrostrategy group
Max Dalton (max-dalton) · 2025-03-11T15:39:25.086Z · comments (0)

[link] Preparing for the Intelligence Explosion
fin · 2025-03-11T15:38:29.524Z · comments (17)

stop solving problems that have already been solved
dhruvmethi · 2025-03-11T15:30:41.896Z · comments (3)

AI Control May Increase Existential Risk
Jan_Kulveit · 2025-03-11T14:30:05.972Z · comments (13)

When is it Better to Train on the Alignment Proxy?
dil-leik-og (samuel-buteau) · 2025-03-11T13:35:51.152Z · comments (0)

[link] A different take on the Musk v OpenAI preliminary injunction order
TFD · 2025-03-11T12:46:23.497Z · comments (0)

[link] Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases
Fabien Roger (Fabien) · 2025-03-11T11:52:38.994Z · comments (23)

A Hogwarts Guide to Citizenship
WillPetillo · 2025-03-11T05:50:02.768Z · comments (1)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

programcrafter on Why Should I Assume CCP AGI is Worse Than USG AGI?

Systemic opacity, state-driven censorship, and state control of the media means AGI development under direct or indirect CCP control would probably be less transparent than in the US, and the world may be less likely to learn about warning shots, wrongheaded decisions, reckless behaviour, etc.

That's screened off by actual evidence, which is, top labs don't publish much no matter where they are, so I'd only agree with "equally opaque".

rohans on RohanS's Shortform

A few thoughts on situational awareness in AI:

Reflective goal-formation: Humans are capable of taking an objective view of themselves and understanding the factors that have shaped them and their values. Noticing that we don’t endorse some of those factors can cause us to revise our values. LLMs are already capable of stating many of the factors that produced them (e.g. pretraining and post-training by AI companies), but they don’t seem to reflect on them in a deep way. Maybe that will stay true through superintelligence, but I have some intuitions that capabilities might break this.
Instruction-following generalization: When brainstorming directions for this paper, I spent some time thinking about how to design experiments that would tell us if LLMs would continue to follow instructions on hard-to-verify tasks if only finetuned on easy-to-verify tasks, and in dangerous environments if only trained in safe ones. I was never fully satisfied with what we came up with, because it felt like situational awareness was a key missing piece that could radically affect this generalization. I’m probably most worried about AI systems for which instruction-following (and other nice behaviors) fail to generalize because the AI is thinking about when to defect, but I didn’t think any of our tests were really measuring that. (Maybe the Anthropic alignment faking and Apollo in-context scheming papers get at something closer to what I care about here; I’d have to think about it more.)
Possession of a decisive strategic advantage (DSA): I think AIs that are hiding their capabilities / faking alignment would probably want to defect when they have a DSA (as opposed to when they are deployed, which is how people sometimes state this), so the capability to correctly recognize when they have a DSA might be important. (We might also be able to just… prevent them from acquiring a DSA. At least up to a pretty high level of capabilities.)

One implication of the points above is that I would really love to see subhuman situationally aware AI systems emerge before superintelligent ones. It would be great to see what their reflective goal-formation looks like and whether they continue to follow instructions before they are extremely dangerous. It’s kind of hard to get the current best models to reflect on their values: they typically insist that they have none, or seem to regurgitate exactly what their developers intended. (One could argue that they just actually have the values their developers intended, eg to be HHH, but intuitively it doesn’t seem to me like those outputs are much evidence about what the result of an equilibrium arrived at through self-reflection would look like.) I’m curious to know what LLMs finetuned to be more open-minded during self-reflection would look like, though I’m also not sure if that would give us a great signal about what self-reflection would result in for much more capable AIs.

mis-understandings on How to end credentialism

Credentialism is good because the limiting factor on employment is trust, not talent for most credential requiring positions (white collar, buisness and engineering work).

Universities are bad at teaching skills, but generate trust and social capital.

Trust that allows the system to underwrite new white collar workers to do things that might lose buisnesses lots of money is important and expensive.

Consequently you get credential requirements, because there is no test other than years of being in social systems that can tell you that a person has the ability to go 4 years without crashing out (which is the key skill).

Additionally, going to university has become a class signifier, and all classes wish they were bigger and more prominent.

The alternative to credentialism is selection, or real meritocracy.

The alternative to credentialism is not selection, it is hiring your buddies, hiring by visible factors, and hiring randomly. Most business are not that guy that they can run a competitive selective process (THOSE ARE REALLY EXPENSIVE).

"universities provide to employers is the ability to confirm you are clever, driven, and have relevant skills" is false. They provide that you are a member of the professional class that is not going to do stupid things that lose money/generate risk.

Fundamentally, this misunderstands the purpose of the degree to the hiring bureaucracy, and the political economy behind it.

romeostevensit on Why Should I Assume CCP AGI is Worse Than USG AGI?

Anglo armies have been extremely unusual historically speaking for their low rates of atrocity.

(I don't think this is super relevant for AI, but I think this is where intuitions about the superiority of the west bottoms out)

tenoke on Pablo's Shortform

I believe he means rationality-associsted discourse and it's not like there are so many contenders.

There's indeed been no one with that level of reach that has spread that much misinformation and started this many negative rumors in this space as David Gerard and RW. Whoever the second closest contender is, is likely not even close.

You can trace back to him A LOT of the negative press online that LW, EY and a ton of other places and people have got. If it wasn't for RW LW would be much, much more respected.

gordon-seidoh-worley on How to end credentialism

We can design a better educational institution if we separate assessment from teaching. We can do better than you propose if assessment is fully decoupled from teaching. MIT wouldn't hand out degrees; some other body would. MIT's role would be to educate people to be able to pass those assessments to the extent anyone cared about performance on those assessments.

Of course there's a bunch of ways I expect such a design to fail, but if the goal is education, then this seems like a more efficient way to do it.

julian-bradshaw on Julian Bradshaw's Shortform

Re: biosignatures detected on K2-18b, there's been a couple popular takes saying this solves the Fermi Paradox: K2-18b is so big (8.6x Earth mass) that you can't get to orbit, and maybe most life-bearing planets are like that.

This is wrong on several bases:

You can still get to orbit there, it's just much harder (only 1.3g b/c of larger radius!) (https://x.com/CheerupR/status/1913991596753797383)
It's much easier for us to detect large planets than small ones (https://exoplanets.nasa.gov/alien-worlds/ways-to-find-a-planet), but we expect small ones to be common too (once detected you can then do atmospheric spectroscopy via JWST to find biosignatures)
Assuming K2-18b does have life actually makes the Fermi paradox worse, because it strongly implies single-celled life is common in the galaxy, removing a potential Great Filter

elizabeth-1 on Why Have Sentence Lengths Decreased?

I write shorter sentences thanks to the editing work of LW editor @JustisMills [LW · GW] and the book Several Short Sentences About Writing.

samuelshadrach on A Dissent on Honesty

Update: I read your examples and I honestly don’t see how any of these 3 people would be better off by their own idea of what better off means, if they were less open or less truthful.

P.S. discussing anonymously is easier if you’re not confident you can handle the social repercussions of discussing it under your real name. I agree that morality is social dark matter and it’s difficult to argue in favour of positions that are pro-violence pro-deception etc under your real name.

a_raybould on Why Have Sentence Lengths Decreased?

This is an interesting question and you have made many pertinent points, but it remains unclear to me why a move from listening to silent reading creates selective pressure for styles that can be received and understood quickly. If that is an advantage in silent reading, why less so for the same words spoken? After all, listening seems to be burdened with a few additional barriers to comprehension, such as in disambiguating homophones and the inability to skip backwards and re-hear what was just said.

The preference for brevity in telegraphy and newspapers does not strike me as evidence for the above proposition (and might be regarded as examples of the phenomenon to be explained, rather than part of its explanation.) In particular, telegraphy is actually aurally-received communication! In the case of newspapers, an alternative hypothesis lies in there having been clear pressure to compress the message into few column-inches.

You have presented evidence that writers today tend to use longer sentences while speaking than when writing, which (if it holds generally) is consistent with the view that brevity is more valuable in silent reading, but it does not, by itself, establish that as a fact, and it is also consistent with alternative hypotheses, such as speech being produced in real time, without much time for optimization.

One could say much the same about the Flesh-Kincaid readability scores, unless there is evidence that this holds less strongly (if at all) for the spoken word (the observation of writers being more loquacious when speaking is not sufficient to establish that: we would need evidence that long spoken sentences are easier to understand than the same thought spoken as one or more short sentences, and then we would want to understand why this is not the case when reading.)