LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Review: Conor Moreton's "Civilization & Cooperation"
Duncan Sabien (Deactivated) (Duncan_Sabien) · 2024-05-26T19:32:43.131Z · comments (8)

[link] Sabotage Evaluations for Frontier Models
David Duvenaud (david-duvenaud) · 2024-10-18T22:33:14.320Z · comments (11)

Prediction Markets aren't Magic
SimonM · 2023-12-21T12:54:07.754Z · comments (29)

[link] Large Language Models can Strategically Deceive their Users when Put Under Pressure.
ReaderM · 2023-11-15T16:36:04.446Z · comments (8)

[link] Finishing The SB-1047 Documentary In 6 Weeks
Michaël Trazzi (mtrazzi) · 2024-10-28T20:17:47.465Z · comments (4)

Based Beff Jezos and the Accelerationists
Zvi · 2023-12-06T16:00:08.380Z · comments (29)

Partial value takeover without world takeover
KatjaGrace · 2024-04-05T06:20:03.961Z · comments (23)

Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers
hugofry · 2024-04-29T20:57:35.127Z · comments (8)

Public Call for Interest in Mathematical Alignment
Davidmanheim · 2023-11-22T13:22:09.558Z · comments (9)

[link] New report: Safety Cases for AI
joshc (joshua-clymer) · 2024-03-20T16:45:27.984Z · comments (13)

Dragon Agnosticism
jefftk (jkaufman) · 2024-08-01T17:00:06.434Z · comments (60)

AI #73: Openly Evil AI
Zvi · 2024-07-18T14:40:05.770Z · comments (20)

story-based decision-making
bhauth · 2024-02-07T02:35:27.286Z · comments (11)

[link] Debating with More Persuasive LLMs Leads to More Truthful Answers
Akbir Khan (akbir-khan) · 2024-02-07T21:28:10.694Z · comments (14)

On the abolition of man
Joe Carlsmith (joekc) · 2024-01-18T18:17:06.201Z · comments (18)

[link] Executable philosophy as a failed totalizing meta-worldview
jessicata (jessica.liu.taylor) · 2024-09-04T22:50:18.294Z · comments (40)

Covert Malicious Finetuning
Tony Wang (tw) · 2024-07-02T02:41:51.698Z · comments (4)

Stagewise Development in Neural Networks
Jesse Hoogland (jhoogland) · 2024-03-20T19:54:06.181Z · comments (1)

Live Theory Part 0: Taking Intelligence Seriously
Sahil · 2024-06-26T21:37:10.479Z · comments (3)

Singular learning theory: exercises
Zach Furman (zfurman) · 2024-08-30T20:00:03.785Z · comments (5)

Teaching CS During Take-Off
andrew carle (andrew-carle) · 2024-05-14T22:45:39.447Z · comments (13)

[link] I found >800 orthogonal "write code" steering vectors
Jacob G-W (g-w1) · 2024-07-15T19:06:17.636Z · comments (19)

We might be missing some key feature of AI takeoff; it'll probably seem like "we could've seen this coming"
Lukas_Gloor · 2024-05-09T15:43:11.490Z · comments (36)

Growth and Form in a Toy Model of Superposition
Liam Carroll (liam-carroll) · 2023-11-08T11:08:04.359Z · comments (7)

Solving adversarial attacks in computer vision as a baby version of general AI alignment
Stanislav Fort (stanislavfort) · 2024-08-29T17:17:47.136Z · comments (8)

Natural Latents: The Concepts
johnswentworth · 2024-03-20T18:21:19.878Z · comments (18)

[link] Re: Anthropic's suggested SB-1047 amendments
RobertM (T3t) · 2024-07-27T22:32:39.447Z · comments (13)

[link] More Hyphenation
Arjun Panickssery (arjun-panickssery) · 2024-02-07T19:43:29.086Z · comments (19)

How well do truth probes generalise?
mishajw · 2024-02-24T14:12:19.729Z · comments (11)

[link] Detecting Genetically Engineered Viruses With Metagenomic Sequencing
jefftk (jkaufman) · 2024-06-27T14:01:34.868Z · comments (10)

Research update: Towards a Law of Iterated Expectations for Heuristic Estimators
Eric Neyman (UnexpectedValues) · 2024-10-07T19:29:29.033Z · comments (2)

[link] Self-Help Corner: Loop Detection
adamShimi · 2024-10-02T08:33:23.487Z · comments (6)

2024 Petrov Day Retrospective
Ben Pace (Benito) · 2024-09-28T21:30:14.952Z · comments (25)

I'm a bit skeptical of AlphaFold 3
Oleg Trott (oleg-trott) · 2024-06-25T00:04:41.274Z · comments (14)

OpenAI: Helen Toner Speaks
Zvi · 2024-05-30T21:10:02.938Z · comments (8)

GPT-o1
Zvi · 2024-09-16T13:40:06.236Z · comments (34)

The Aspiring Rationalist Congregation
maia · 2024-01-10T22:52:54.298Z · comments (23)

There is a globe in your LLM
jacob_drori (jacobcd52) · 2024-10-08T00:43:40.300Z · comments (4)

Apply to be a Safety Engineer at Lockheed Martin!
yanni kyriacos (yanni) · 2024-03-31T21:02:08.499Z · comments (3)

Addressing Feature Suppression in SAEs
Benjamin Wright (Benw8888) · 2024-02-16T18:32:51.927Z · comments (3)

[link] Dario Amodei’s prepared remarks from the UK AI Safety Summit, on Anthropic’s Responsible Scaling Policy
Zac Hatfield-Dodds (zac-hatfield-dodds) · 2023-11-01T18:10:31.110Z · comments (1)

[Valence series] 2. Valence & Normativity
Steven Byrnes (steve2152) · 2023-12-07T16:43:49.919Z · comments (5)

[question] What are the best arguments for/against AIs being "slightly 'nice'"?
Raemon · 2024-09-24T02:00:19.605Z · answers+comments (51)

Scalable oversight as a quantitative rather than qualitative problem
Buck · 2024-07-06T17:42:41.325Z · comments (11)

A simple case for extreme inner misalignment
Richard_Ngo (ricraz) · 2024-07-13T15:40:37.518Z · comments (41)

Rejecting Television
Declan Molony (declan-molony) · 2024-04-23T04:59:50.253Z · comments (10)

[link] Environmentalism in the United States Is Unusually Partisan
Jeffrey Heninger (jeffrey-heninger) · 2024-05-13T21:23:10.755Z · comments (26)

[link] Anxiety vs. Depression
Sable · 2024-03-17T00:15:08.255Z · comments (35)

Reflections on Less Online
Error · 2024-07-07T03:49:44.534Z · comments (15)

[link] Nietzsche's Morality in Plain English
Arjun Panickssery (arjun-panickssery) · 2023-12-04T00:57:42.839Z · comments (13)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

johnswentworth on Three Notions of "Power"

Good guess, but that's not cruxy for me. Yes, LDT/FDT-style things are one possibility. But even if those fail, I still expect non-hierarchical coordination mechanisms among highly capable agents.

Gesturing more at where the intuition comes from: compare hierarchical management to markets, as a control mechanism. Markets require clean factorization - a production problem needs to be factored into production of standardized, verifiable intermediate goods in order for markets to handle the production pipeline well. If that can be done, then markets scale very well, they pass exactly the information and incentives people need (in the form of prices). Hierarchies, in contrast, scale very poorly. They provide basically-zero built-in mechanisms for passing the right information between agents, or for providing precise incentives to each agent. They're the sort of thing which can work ok at small scale, where the person at the top can track everything going on everywhere, but quickly become extremely bottlenecked on the top person as you scale up. And you can see this pretty clearly at real-world companies: past a very small size, companies are usually extremely bottlenecked on the attention of top executives, because lower-level people lack the incentives/information to coordinate on their own across different parts of the company.

(Now, you might think that an AI in charge of e.g. a company could make the big hierarchy work efficiently by just being capable enough to track everything themselves. But at that point, I wouldn't expect to see an hierarchy at all; the AI can just do everything itself and not have multiple agents in the first place. Unlike humans, AIs will not be limited by their number of hands. If there is to be some arrangement involving multiple agents coordinating in the first place, then it shouldn't be possible for one mind to just do everything itself.)

On the other hand, while dominance relations scale very poorly as a coordination mechanism, they are algorithmically relatively simple. Thus my claim from the post that dominance seems like a hack for low-capability agents, and higher-capability agents will mostly rely on some other coordination mechanism.

warty on avturchin's Shortform

burning the dog defense commons 😔

zach-stein-perlman on Zach Stein-Perlman's Shortform

Some not-super-ambitious asks for labs (in progress):

Do evals on on dangerous-capabilities-y and agency-y tasks; look at the score before releasing or internally deploying the model
Have a high-level capability threshold at which securing model weights is very important
Do safety research at all
Have a safety plan at all; talk publicly about safety strategy at all
- Have a post like Core Views on AI Safety
- Have a post like The Checklist [AF · GW]
- On the website, clearly describe a worst-case plausible outcome from AI and state its credence in such outcomes (perhaps unconditionally, conditional on no major government action, and/or conditional on leading labs being unable or unwilling to pay a large alignment tax [EA · GW]).
Whistleblower protections?
- Not sure what the ask is. Right to Warn is a starting point. In particular, an anonymous internal reporting pipeline for failures to comply with safety policies is clearly good (but likely inadequate).
Publicly explain the processes and governance structures that determine deployment decisions
- (And ideally make those processes and structures good)

sharmake-farah on Three Notions of "Power"

Maybe, but I'm not sure it's even necessary to invoke LDT/FDT/UDT, and instead argue that coordinating even through solely causal methods is very cheap for AIs to the point where coordination, and as a side effect, interfaces become quite a lot less of a bottleneck compared to today.

In essence, I think the diff between John's models and tailcalled's models is plausibly in how easy coordination in a more general sense can ever be for AIs, and whether AIs have much better ability to coordinate compared to humans today, where John thinks that coordination is a taut constraint for humans but not for AI, but tailcalled thinks it's hard to coordinate for both AIs and humans due to fundamental limits.

tailcalled on Three Notions of "Power"

LDT/FDT is a central example of rationalist-Gnostic heresy [LW · GW].

d0themath on Three Notions of "Power"

I am going to guess that the diff between you and John's models here is that John thinks LDT/FDT solves this, and you don't.

sharmake-farah on How might we solve the alignment problem? (Part 1: Intro, summary, ontology)

The real answer to Sorites paradox for intelligence is that memory is an issue, and if a subject requires you to learn more information than can be stored in memory, you can't learn it no matter how much time you invest into the subject, and differing intelligences also usually differ in memory capacity.

However, assuming memory isn't an issue, than the answer to the question is that intelligence really is a continuum, and the better metric is rate of learning per time step say, since any difference in intelligence is solely a difference in time, so no hard barriers exists.

I think that memory will not be the bottleneck in an intelligence explosion for understanding AI, and instead time will be the bottleneck.

sharmake-farah on Noosphere89's Shortform

This also BTW explains why we cannot rely on economic arguments on AI to make the future go well.

avturchin on avturchin's Shortform

Yes. It is important point.

sharmake-farah on Noosphere89's Shortform

Here's a underrated frame for why AI alignment is likely necessary for the future to go very well under human values, even though in our current society we don't need human to human alignment to make modern capitalism be good and can rely on selfishness instead.

The reason is because there's a real likelihood that human labor, and more generally human existence is not economically valuable or even have negative economic value, say where the addition of a human to the AI company makes that company worse in the near future.

The reason this matters is that once labor is much easier to scale than capital, as is likely in an AI future, it's now economically viable or even beneficial to break a lot of the rules that help humans survive, contra Matthew Barnett's view, and this is even more incentivized by the fact that an unaligned AI released into society would likely not be punishable/incentivizable by mere humans, solely due to controlling robotic armies and robotic workforces that allow it to dispense with societal constraints humans have to accept.

dr_s talks about the equilibrium that is totally valid under AI automation economics that is very bad for humans, and avoiding these sorts of equilibriums can't be done through economic forces, because of the fact that the companies doing this are too powerful to have any real incentives work on them, since they can either neutralize or turn the attempted boycott/shopping around to their own benefit, and thus avoiding this outcome requires alignment to your values, and can't work with selfishness:

Consider a scenario in which AGI and human-equivalent robotics are developed and end up owned (via e.g. controlling exclusively the infrastructure that runs it, and being closed source) by a group of, say, 10,000 people overall who have some share in this automation capital. If these people have exclusive access to it, a perfectly functional equilibrium is "they trade among peers goods produced by their automated workers and leave everyone else to fend for themselves".

This framing of the alignment problem, of how to get an AI that values humans such that this outcome is prevented, also has an important implication:

It's not enough to solve the technical problem of alignment, absent modeling the social situation, because of suffering risk issues plaus catastrophic risk issues, and also means the level of alignment of AI needs to be closer to the fictional benevolent angels than it is to humans in relationship to other humans, so it motivates a more ambitious version of the alignment objectives than making AIs merely not break the law or steal from humans..

I'm actually reasonably hopeful the more ambitious versions of alignment are possible, and think there's a realistic chance we can actually do them.

But we actually need to do the work, and AI that automates everything might come in your lifetime, so we should prepare the foundations soon.