LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Motivation control
Joe Carlsmith (joekc) · 2024-10-30T17:15:50.881Z · comments (7)

[link] What Ketamine Therapy Is Like
Sable · 2024-11-11T11:09:08.602Z · comments (5)

Which LessWrong/Alignment topics would you like to be tutored in? [Poll]
Ruby · 2024-09-19T01:35:02.999Z · comments (12)

[link] Analyzing how SAE features evolve across a forward pass
bensenberner · 2024-11-07T22:07:02.827Z · comments (0)

How difficult is AI Alignment?
Sammy Martin (SDM) · 2024-09-13T15:47:10.799Z · comments (6)

[link] cancer rates after gene therapy
bhauth · 2024-10-16T15:32:53.949Z · comments (0)

Minimal Motivation of Natural Latents
johnswentworth · 2024-10-14T22:51:58.125Z · comments (14)

[link] [Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF
Leon Lang (leon-lang) · 2024-10-22T13:57:41.125Z · comments (0)

Australian AI Safety Forum 2024
Liam Carroll (liam-carroll) · 2024-09-27T00:40:11.451Z · comments (0)

Startup Success Rates Are So Low Because the Rewards Are So Large
AppliedDivinityStudies (kohaku-none) · 2024-10-10T20:22:01.557Z · comments (6)

Formalizing the Informal (event invite)
abramdemski · 2024-09-10T19:22:53.564Z · comments (0)

Time Efficient Resistance Training
romeostevensit · 2024-10-07T15:15:44.950Z · comments (8)

MATS AI Safety Strategy Curriculum v2
DanielFilan · 2024-10-07T22:44:06.396Z · comments (6)

[link] An Interactive Shapley Value Explainer
James Stephen Brown (james-brown) · 2024-09-28T05:01:21.169Z · comments (9)

Monthly Roundup #23: October 2024
Zvi · 2024-10-16T13:50:05.869Z · comments (12)

Reflections on the Metastrategies Workshop
gw · 2024-10-24T18:30:46.255Z · comments (5)

[link] [Paper] Programming Refusal with Conditional Activation Steering
Bruce W. Lee (bruce-lee) · 2024-09-11T20:57:08.714Z · comments (0)

D&D Sci Coliseum: Arena of Data
aphyer · 2024-10-18T22:02:54.305Z · comments (23)

[link] Point of Failure: Semiconductor-Grade Quartz
Annapurna (jorge-velez) · 2024-09-30T15:57:40.495Z · comments (8)

[link] IAPS: Mapping Technical Safety Research at AI Companies
Zach Stein-Perlman · 2024-10-24T20:30:41.159Z · comments (12)

Metastatic Cancer Treatment Since 2010: The Success Stories
sarahconstantin · 2024-11-04T22:50:09.386Z · comments (0)

[question] Implications of China's recession on AGI development?
Eric Neyman (UnexpectedValues) · 2024-09-28T01:12:36.443Z · answers+comments (3)

[Linkpost] Play with SAEs on Llama 3
Tom McGrath · 2024-09-25T22:35:44.824Z · comments (2)

2025 Color Trends
sarahconstantin · 2024-10-07T21:20:03.962Z · comments (7)

Winners of the Essay competition on the Automation of Wisdom and Philosophy
AI Impacts (AI Imacts) · 2024-10-28T17:10:04.272Z · comments (3)

Anthropic rewrote its RSP
Zach Stein-Perlman · 2024-10-15T14:25:12.518Z · comments (19)

Signaling with Small Orange Diamonds
jefftk (jkaufman) · 2024-11-07T20:20:08.026Z · comments (1)

Are we dropping the ball on Recommendation AIs?
Charbel-Raphaël (charbel-raphael-segerie) · 2024-10-23T17:48:00.000Z · comments (16)

[link] An X-Ray is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation
hugofry · 2024-10-07T08:53:14.658Z · comments (0)

Compelling Villains and Coherent Values
Cole Wyeth (Amyr) · 2024-10-06T19:53:47.891Z · comments (4)

[link] Characterizing stable regions in the residual stream of LLMs
Jett Janiak (jett) · 2024-09-26T13:44:58.792Z · comments (4)

0.202 Bits of Evidence In Favor of Futarchy
niplav · 2024-09-29T21:57:59.896Z · comments (0)

Book Review: On the Edge: The Business
Zvi · 2024-09-25T12:20:06.230Z · comments (0)

[link] Generative ML in chemistry is bottlenecked by synthesis
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-16T16:31:34.801Z · comments (2)

[link] AISafety.info: What is the "natural abstractions hypothesis"?
Algon · 2024-10-05T12:31:14.195Z · comments (2)

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing
Connor Kissane (ckkissane) · 2024-10-27T18:46:21.316Z · comments (1)

The murderous shortcut: a toy model of instrumental convergence
Thomas Kwa (thomas-kwa) · 2024-10-02T06:48:06.787Z · comments (0)

COT Scaling implies slower takeoff speeds
Logan Zoellner (logan-zoellner) · 2024-09-28T16:20:00.320Z · comments (56)

Exploring SAE features in LLMs with definition trees and token lists
mwatkins · 2024-10-04T22:15:28.108Z · comments (5)

LASR Labs Spring 2025 applications are open!
Erin Robertson · 2024-10-04T13:44:20.524Z · comments (0)

OODA your OODA Loop
Raemon · 2024-10-11T00:50:48.119Z · comments (3)

[link] A Percentage Model of a Person
Sable · 2024-10-12T17:55:07.560Z · comments (3)

AI Safety Camp 10
Robert Kralisch (nonmali-1) · 2024-10-26T11:08:09.887Z · comments (7)

I'm creating a deep dive podcast episode about the original Leverage Research - would you like to take part?
spencerg · 2024-09-22T14:03:22.164Z · comments (2)

Glitch Token Catalog - (Almost) a Full Clear
Lao Mein (derpherpize) · 2024-09-21T12:22:16.403Z · comments (3)

A New Class of Glitch Tokens - BPE Subtoken Artifacts (BSA)
Lao Mein (derpherpize) · 2024-09-20T13:13:26.181Z · comments (7)

Eye contact is effortless when you’re no longer emotionally blocked on it
Chipmonk · 2024-09-27T21:47:01.970Z · comments (24)

[link] Big tech transitions are slow (with implications for AI)
jasoncrawford · 2024-10-24T14:25:06.873Z · comments (16)

Is the Power Grid Sustainable?
jefftk (jkaufman) · 2024-10-26T02:30:06.612Z · comments (37)

Video and transcript of presentation on Otherness and control in the age of AGI
Joe Carlsmith (joekc) · 2024-10-08T22:30:38.054Z · comments (1)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

martin-randall on Abstractions are not Natural

Let's expand on this line of argument and look at your example of bee waggle-dances. You question whether the abstractions represented by the various dances are natural. I agree! Using a Cartesian frame that treats bees and humans as separate agents, not part of Nature, they are not Natural Abstractions. With an Embedded frame they are a Natural Abstraction for anyone seeking to understand bees, but in a trivial way. As you say, "one of the systems explicitly values and works towards understanding the abstractions the other system is using".

Also, the meter is not a natural abstraction, which we can see by observing other cultures using yards, cubits, and stadia. If we re-ran cultural evolution, we'd expect to see different measurements of distance chosen. The Natural Abstraction isn't the meter, it's Distance. Related concepts like relative distance are also Natural Abstractions. If we re-ran cultural evolution, we would still think that trees are taller than grass.

I'm not a bee expect, but Wikipedia says:

In the case of Apis mellifera ligustica, the round dance is performed until the resource is about 10 meters away from the hive, transitional dances are performed when the resource is at a distance of 20 to 30 meters away from the hive, and finally, when it is located at distances greater than 40 meters from the hive, the waggle dance is performed

The dance doesn't actually mean "greater than 40 meters", because bees don't use the metric system. There is some distance, the Waggle Distance, where bees switch from a transitional dance to a waggle dance. Claude says, with low confidence, that the Waggle Distance varies based on energy expenditure. In strong winds, the Waggle Distance goes down.

Humans also have ways of communicating energy expenditure or effort. I don't know enough about bees or humans to know if there is a shared abstraction of Effort here. It may be that the Waggle Distance is bee-specific. And that's an important limitation on the NAH, it says, as you quote, "there exist abstractions which are natural", but I think we should also believe the Artificial Abstraction Hypothesis that says that there exist abstractions which are not natural.

This confusion is on display in the discussion around My AI Model Delta Compared To Yudkowsky [LW · GW], where Yudkowsky is quoted as apparently rejecting the NAH:

The AI does not think like you do, the AI doesn't have thoughts built up from the same concepts you use, it is utterly alien on a staggering scale. Nobody knows what the hell GPT-3 is thinking, not only because the matrices are opaque, but because the stuff within that opaque container is, very likely, incredibly alien - nothing that would translate well into comprehensible human thinking, even if we could see past the giant wall of floating-point numbers to what lay behind.

But then in a comment on that post he appears to partially endorse the NAH:

I think that the AI's internal ontology is liable to have some noticeable alignments to human ontology w/r/t the purely predictive aspects of the natural world; it wouldn't surprise me to find distinct thoughts in there about electrons.

But also endorses the AAH:

As the internal ontology takes on any reflective aspects, parts of the representation that mix with facts about the AI's internals, I expect to find much larger differences -- not just that the AI has a different concept boundary around "easy to understand", say, but that it maybe doesn't have any such internal notion as "easy to understand" at all, because easiness isn't in the environment and the AI doesn't have any such thing as "effort".

james-lucassen on Context-dependent consequentialism

I agree that I wouldn't want to lean on the sweet-spot-by-default version of this, and I agree that the example is less strong than I thought it was. I still think there might be safety gains to be had from blocking higher level reflection if you can do it without damaging lower level reflection. I don't think that requires a task where the AI doesn't try and fail and re-evaluate - it just requires that the re-evalution never climbs above a certain level in the stack.

There's such a thing as being pathologically persistent, and such a thing as being pathologically flaky. It doesn't seem too hard to train a model that will be pathologically persistent in some domains while remaining functional in others. A lot of my current uncertainty is bound up in how robust these boundaries are going to have to be.

tsvibt on Daniel Kokotajlo's Shortform

we've checked for various forms of funny business and our tools would notice if it was happening.

I think it's a high bar due to the nearest unblocked strategy problem and alienness.

I agree that when AGI R&D starts to 2x or 5x due to AI automating much of the process, that's when we need the slowdown/pause)

If you start stopping proliferation when you're a year away from some runaway thing, then everyone has the tech that's one year away from the thing. That makes it more impossible that no one will do the remaining research, compared to if the tech everyone has is 5 or 20 years away from the thing.

linch on Survival without dignity

There's also a comic series with explicitly this premise, unfortunately this is a major plot point so revealing it will be a spoiler:

Transdimensional Brain-Chip

martin-randall on Abstractions are not Natural

I appreciate the brevity of the title as it stands. It's normal for a title to summarize the thesis of a post or paper and this is also standard practice on LessWrong. For example:

The sun is big but superintelligences will not spare the Earth a little sunlight [LW · GW].
The point of trade [LW · GW]
There's no fire alarm for AGI [LW · GW]

The introductory paragraphs sufficiently described the epistemic status of the author for my purposes. Overall, I found the post easier to engage with because it made its arguments without hedging.

jenniferrm on Jan_Kulveit's Shortform

I already think that "the entire shape of the zeitgeist in America" is downstream of non-trivial efforts by more than one state actor. Those links explain documented cases of China and Russia both trying to foment race war in the US, but I could pull links for other subdimensions of culture (in science, around the second amendment, and in other areas) where this has been happening since roughly 2014.

My personal response is to reiterate over and over in public that there should be a coherent response by the governance systems of free people, so that, for example, TikTok should either (1) be owned by human people who themselves have free speech rights and rights to a jury trial, or else (2) should be shut down by the USG via taxes, withdrawal of corporate legal protections, etc...

...and also I just track actual specific people, and what they have personally seen and inferred and probably want and so on, in order to build a model of the world from "second hand info".

I've met you personally, Jan, at a conference, and you seemed friendly and weird and like you had original thoughts based on original seeing, and so even if you were on the payroll of the Russians somehow... (which to me clear I don't think you are) ....hey: Cyborgs [AF · GW]! Neat idea! Maybe true. Maybe not. Maybe useful. Maybe not.

Whether or not your cyborg ideas are good or bad can be screened off [LW · GW] from whether or not you're on the payroll of a hostile state actor. Basically, attending primarily to local validity [LW · GW] is basically always possible, and nearly always helpful :-)

charlie-steiner on Flipping Out: The Cosmic Coinflip Thought Experiment Is Bad Philosophy

I think you can do some steelmanning of the anti-flippers with something like Lara Buchak's arguments on risk and rationality. Then you'd be replacing the vague "the utility maximizing policy seems bad" argument with a more concrete "I want to do population ethics over the multiverse" argument.

martin-randall on Abstractions are not Natural

I appreciate the clarity of the pixel game as a concrete thought experiment. Its clarity makes it easier for me to see where I disagree with your understanding of the Natural Abstraction Hypothesis.

The Natural Abstraction Hypothesis is about the abstractions available in Nature, that is to say, the environment. So we have to decide where to draw the boundary around Nature. Options:

Nature is just the pixel game itself (Cartesian)
Nature is the pixel game and the agent(s) flipping pixels (Embedded)
Nature if the pixel game and the utility function(s) but not the decision algorithms (Hybrid)

In the Cartesian frame, none of "top half", "bottom half", "outer rim", and "middle square" are all Unnatural Abstractions, because they're not in Nature, they're in the utility functions.

In the Hybrid and Embedded frames, when System A is playing the game, then "top half" and "bottom half" are Natural Abstractions, but "outer rim" and "middle square" are not. The opposite is true when System B is playing the game.

Let's make this a multi-player game, and have both systems playing on the same board. In that case all of "top half", "bottom half", "outer rim", and "middle square" are Natural Abstractions. We expect system A to learn "outer rim" and "middle square" as it needs to predict the actions of system B, at least given sufficient learning capabilities. I think this is a clean counter-example to your claim:

Two systems require similar utility functions in order to converge on similar abstractions.

david-matolcsi on Jan_Kulveit's Shortform

I'm from Hungary that is probably politically the closest to Russia among Central European countries, but I don't really know of any significant figure who turned out to be a Russian asset, or any event that seemed like a Russian intelligence operation. (Apart from one of our far-right politicians in the EU Parliament being a Russian spy, which was a really funny event, but its not like the guy was significantly shaping the national conversation or anything, I don't think many have heard of him before his cover was blown.) What are prominent examples in Czechia or other Central European countries, of Russian assets or operations?

james-lucassen on Evaluating Stability of Unreflective Alignment

I want to flag this as an assumption that isn't obvious. If this were true for the problems we care about, we could solve them by employing a lot of humans.
humans provides a pretty strong intuitive counterexample

Yup not obvious. I do in fact think a lot more humans would be helpful. But I also agree that my mental picture of "transformative human level research assistant" relies heavily on serial speedup, and I can't immediately picture a version that feels similarly transformative without speedup. Maybe evhub or Ethan Perez or one of the folks running a thousand research threads at once would disagree.