LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Dragon Agnosticism
jefftk (jkaufman) · 2024-08-01T17:00:06.434Z · comments (75)

We might be missing some key feature of AI takeoff; it'll probably seem like "we could've seen this coming"
Lukas_Gloor · 2024-05-09T15:43:11.490Z · comments (36)

Catastrophic sabotage as a major threat model for human-level AI systems
evhub · 2024-10-22T20:57:11.395Z · comments (11)

OpenAI: The Board Expands
Zvi · 2024-03-12T14:00:04.110Z · comments (1)

Science advances one funeral at a time
Cameron Berg (cameron-berg) · 2024-11-01T23:06:19.381Z · comments (9)

Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers
hugofry · 2024-04-29T20:57:35.127Z · comments (8)

[question] Am I confused about the "malign universal prior" argument?
nostalgebraist · 2024-08-27T23:17:22.779Z · answers+comments (35)

SB 1047: Final Takes and Also AB 3211
Zvi · 2024-08-27T22:10:07.647Z · comments (11)

Everything Wrong with Roko's Claims about an Engineered Pandemic
WitheringWeights (EZ97) · 2024-02-22T15:59:08.439Z · comments (10)

Just admit that you’ve zoned out
joec · 2024-06-04T02:51:27.594Z · comments (22)

New page: Integrity
Zach Stein-Perlman · 2024-07-10T15:00:41.050Z · comments (3)

Introducing Squiggle AI
ozziegooen · 2025-01-03T17:53:42.915Z · comments (15)

How to train your own "Sleeper Agents"
evhub · 2024-02-07T00:31:42.653Z · comments (11)

Quotes from Leopold Aschenbrenner’s Situational Awareness Paper
Zvi · 2024-06-07T11:40:03.981Z · comments (10)

Defining alignment research
Richard_Ngo (ricraz) · 2024-08-19T20:42:29.279Z · comments (23)

Circular Reasoning
abramdemski · 2024-08-05T18:10:32.736Z · comments (37)

The Game Board has been Flipped: Now is a good time to rethink what you’re doing
LintzA (alex-lintz) · 2025-01-28T23:36:18.106Z · comments (18)

Zvi’s Thoughts on His 2nd Round of SFF
Zvi · 2024-11-20T13:40:08.092Z · comments (2)

[link] Introducing METR's Autonomy Evaluation Resources
Megan Kinniment (megan-kinniment) · 2024-03-15T23:16:59.696Z · comments (0)

Natural Latents: The Concepts
johnswentworth · 2024-03-20T18:21:19.878Z · comments (18)

Anvil Shortage
Screwtape · 2024-11-13T22:57:41.974Z · comments (16)

[link] "AI Safety for Fleshy Humans" an AI Safety explainer by Nicky Case
habryka (habryka4) · 2024-05-03T18:10:12.478Z · comments (11)

A very strange probability paradox
notfnofn · 2024-11-22T14:01:36.587Z · comments (26)

Stagewise Development in Neural Networks
Jesse Hoogland (jhoogland) · 2024-03-20T19:54:06.181Z · comments (1)

The subset parity learning problem: much more than you wanted to know
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-03T09:13:59.245Z · comments (18)

Review: Conor Moreton's "Civilization & Cooperation"
Duncan Sabien (Deactivated) (Duncan_Sabien) · 2024-05-26T19:32:43.131Z · comments (8)

Teaching CS During Take-Off
andrew carle (andrew-carle) · 2024-05-14T22:45:39.447Z · comments (13)

Covert Malicious Finetuning
Tony Wang (tw) · 2024-07-02T02:41:51.698Z · comments (4)

story-based decision-making
bhauth · 2024-02-07T02:35:27.286Z · comments (11)

[link] Should you be worried about H5N1?
gw · 2024-12-05T21:11:06.996Z · comments (2)

AI #73: Openly Evil AI
Zvi · 2024-07-18T14:40:05.770Z · comments (20)

[link] New report: Safety Cases for AI
joshc (joshua-clymer) · 2024-03-20T16:45:27.984Z · comments (14)

Partial value takeover without world takeover
KatjaGrace · 2024-04-05T06:20:03.961Z · comments (23)

Three Notions of "Power"
johnswentworth · 2024-10-30T06:10:08.326Z · comments (44)

AIs Will Increasingly Fake Alignment
Zvi · 2024-12-24T13:00:07.770Z · comments (0)

[link] Self-Help Corner: Loop Detection
adamShimi · 2024-10-02T08:33:23.487Z · comments (6)

Tips On Empirical Research Slides
James Chua (james-chua) · 2025-01-08T05:06:44.942Z · comments (4)

Is "VNM-agent" one of several options, for what minds can grow up into?
AnnaSalamon · 2024-12-30T06:36:20.890Z · comments (54)

Singular learning theory: exercises
Zach Furman (zfurman) · 2024-08-30T20:00:03.785Z · comments (5)

[link] Debating with More Persuasive LLMs Leads to More Truthful Answers
Akbir Khan (akbir-khan) · 2024-02-07T21:28:10.694Z · comments (14)

Matryoshka Sparse Autoencoders
Noa Nabeshima (noa-nabeshima) · 2024-12-14T02:52:32.017Z · comments (15)

[Intuitive self-models] 1. Preliminaries
Steven Byrnes (steve2152) · 2024-09-19T13:45:27.976Z · comments (20)

There is a globe in your LLM
jacob_drori (jacobcd52) · 2024-10-08T00:43:40.300Z · comments (4)

Agent Foundations 2025 at CMU
Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2025-01-19T23:48:22.569Z · comments (10)

[link] The Cognitive-Theoretic Model of the Universe: A Partial Summary and Review
jessicata (jessica.liu.taylor) · 2024-03-27T19:59:27.893Z · comments (37)

(Salt) Water Gargling as an Antiviral
Elizabeth (pktechgirl) · 2024-11-22T18:00:02.765Z · comments (6)

Thoughts on the conservative assumptions in AI control
Buck · 2025-01-17T19:23:38.575Z · comments (5)

[link] Detecting Genetically Engineered Viruses With Metagenomic Sequencing
jefftk (jkaufman) · 2024-06-27T14:01:34.868Z · comments (10)

[link] Hardshipification
Jonathan Moregård (JonathanMoregard) · 2024-05-28T20:02:29.709Z · comments (17)

[link] More Hyphenation
Arjun Panickssery (arjun-panickssery) · 2024-02-07T19:43:29.086Z · comments (19)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

milan-w on bgold's Shortform

I've seen reddit ads from multiple companies offering to work for them doing freelance annotation / high-quality-text-data generation.

daniel-tan on Why Don't We Just... Shoggoth+Face+Paraphraser?

IMO deepseek-r1 follows this proposal pretty closely.

Proposal part 1: Shoggoth/Face Distinction

r1's CoT is the "shoggoth". Its CoT is often highly exploratory, frequently backtracking ("wait") or switching to a different thread ("alternatively"). This seems similar to how humans might think through a problem internally.

r1 is also the "face", since it clearly demarcates the boundary between thinking ("<think> ... </think>") and action (everything else). This separation is enforced by a "format reward".

Proposal part 2: Blind the Evaluation Process to the Internal Reasoning

This also seems to be satisfied, since r1 was mainly trained via RL against rules-based reward, which only evaluates the outcome. I.e. everything inside <think> tags (the "shoggoth") is not evaluated for.

Proposal part 3: Use Paraphraser

This is the only thing that seems missing. But it seems like you can easily add this into the r1 recipe, i.e. just train r1 with its CoT paraphrased by some other LLM.

johnswentworth on The Case Against AI Control Research

Even if all of those are true, the argument in the post would still imply that control research (at least of the sort people do today) cannot have very high expected value. Like, sure, let's assume for sake of discussion that most total AI safety research will be done by early transformative AI, that the only chance of aligning superintelligent AIs is to delegate, that control research is unusually tractable, and that for some reason we're going to use the AIs to pursue formal verification (not a good idea, but whatever).

Even if we assume all that, we still have the problem that control research of the sort people do today does basically-nothing to address slop; it is basically-exclusively focused on intentional scheming. Insofar as intentional scheming is not the main thing which makes outsourcing to early AIs fail, all that control research cannot have very high expected value. None of your bullet points address that core argument at all.

ryan_greenblatt on Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

I don't think it's worth adjudicating the question of how relevant Vanessa's response is (though I do think Vannessa's response is directly relevant).

if the AIs are aligned to the these structures, human disempowerment is likely because these structures are aligned to humans way less than they seem

My claim would be that if single-single alignment is solved, this problem won't be existential. I agree that if you literally aligned all AIs to (e.g.) the mission of a non-profit as well as you can, you're in trouble. However, if you have single-single alignment:

At the most basic level, I expect we'll train AIs to give advice and ask them what they think will happen with various possible governance and alignmnent structures. If they think a goverance structure will yield total human disempowerment, we'll do something else. This is a basic reason not to expect large classes of problems so long as we have single-single aligned AIs which are wise. (Though problems that require coordination to resolve might not be like this.) I've very skeptical of a world where single-single alignment is well described as being solved and people don't ask for advice (or consider this advice seriously) because they never get around to asking AIs or there are no AIs aligned in such a way that they should try to give good advice.
I expect organizations will be explicitly controlled by people and (some of) those people will have AI representatives to represent their interests as I discuss here [LW(p) · GW(p)]. If you think getting good AI representation is unlikely, that would be a crux, but this would be my proposed solution at least.
- The explicit mission of for-profit companies is to empower the shareholders. It clearly doesn't serve the interests of the shareholders to end up dead or disempowered.
- Democratic governments have similar properties.
- At a more basic level, I think people running organizations won't decide "oh, we should put the AI in charge of running this organization aligned to some mission from the preferences of the people (like me) who currently have de facto or de jure power over this organization". This is a crazily disempowering move that I expect people will by default be too savvy to make in almost all cases. (Both for people with substantial de facto and with de jure power.)
Even independent of the advice consideration, people will probably want AIs running organizations to be honest to at least the people controlling the organization. Given that I expect explicit control by people in almost all cases, if things are going in an existential direction, people can vote to change them in almost all cases.
I don't buy that there will be some sort of existential multi-polar trap even without coordination (though I also expect coordintion) due to things like the strategy stealing assumption as I also discuss in that comment [LW(p) · GW(p)].
If a subset of organizations diverge from a reasonable interpretation of what they were supposed to do (but are still basically obeying the law and some interpretation of what they were intentionally aligned to) and this is clear to the rest of the world (as I expect would be the case given some type of AI advisors), then the rest of the world can avoid problems from this subset via the court system or other mechanisms. Even if this subset of organizations run by effectively rogue AIs just runs away with resources successfully, this is probably only a subset of resources.

I think your response to a lot of this will be something like:

People won't have or won't listen to AI advisors.
Institutions will intentionally delude relevant people to acquire more power.

But, the key thing is that I expect at least some people will keep power, even if large subsets are deluded. E.g., I expect that corporate shareholders, boardmembers, or government will be very interested in the questions of whether they will be disempowered by changes in structure. It does seem plausible (or even likely) to me that some people will engage in power grabs via ensuring AIs are aligned to them, deluding the world about what's going on using a variety of mechanisms (including, e.g., denying or manipulating access to AI representation/advisors), and expanding their (hard) power over time. The thing I don't buy is a case where very powerful people don't ask for advice at all prior to having been deluded by the organizations that they themselves run!

I think human power grabs like this are concerning and there are a variety of plausible solutions which seem somewhat reasonable.

Maybe your response is that the solutions that will be implemented in practice given concerns about human power grabs will involve aligning AIs to institutions in ways that yield the dynamics you describe? I'm skeptical given the dynamics discussed above about asking AIs for advice.

milan-w on Milan W's Shortform

NVIDIA CEO Jensen Huang to meet with Donald Trump today

Financial Times

2024-01-31

nathan-helm-burger on ryan_greenblatt's Shortform

I'm not the only one thinking along these lines... https://x.com/8teAPi/status/1885340234352910723

ete on plex's Shortform

Nice! I haven't read a ton of Buddhism, cool that this fits into a known framework.

I'm uncertain of how you use the word consciousness here do you mean our blob of sensory experience or something else?

Yeah, ~subjective experience.

ollie_ on DeepSeek: Don’t Panic

I believe that opensource advancements like R1 will drive wider adoption of ai systems.

I think that the pricing models will change soon. Everyone talks about cost per million tokens to contact a hosted service, but I think it'll switch to be cloud costs to provide infrastructure that can run models. Virtual machines running something like ollama.

This solves another huge problem, privacy and how prompt data is handled. If you're using an api to a hosted service you need to have a very good understanding of how your submitted prompt data is handled. This is key for organisations. I feel like the lack of understanding here is preventing widespread adoption, especially for communication tools that handle sensitive data.

For example, you could run Deepseek R1 using ollama on an Azure virtual machine (nc series) that you pay per hour for, and then your cost isn't based on usage of your ai. Right now it's expensive to provision the infra to support decent models, but these costs fall continuously.

I can imagine a world where organisations provision cloud infrastructure in their environments running open source models.

https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nc-series?tabs=sizebasic

https://huggingface.co/deepseek-ai/DeepSeek-R1

vladimir_nesov on CBiddulph's Shortform

This is an obvious thing to try, but it's not what currently already works, and it's not certain to work without some additional ideas. You can do a little bit of this, but not nearly to the extent that o1/R1 inch towards saturating benchmarks on math/coding olympiad-like problems. So long as using LLMs as reward for scalable RL doesn't work yet, supercharged capabilities of o1/R1-like models plausibly remain restricted to verifiable tasks.

daniel-tan on Daniel Tan's Shortform

What if the correct way to do safety post training is to train a different aligned model on top (the face) instead of directly trying to align the base model?