LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Bigger Livers?
sarahconstantin · 2024-11-08T21:50:09.814Z · comments (13)

MIRI’s 2024 End-of-Year Update
Rob Bensinger (RobbBB) · 2024-12-03T04:33:47.499Z · comments (2)

The "Think It Faster" Exercise
Raemon · 2024-12-11T19:14:10.427Z · comments (13)

[link] Seven lessons I didn't learn from election day
Eric Neyman (UnexpectedValues) · 2024-11-14T18:39:07.053Z · comments (33)

The purposeful drunkard
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-12T12:27:51.952Z · comments (13)

[question] What are the strongest arguments for very short timelines?
Kaj_Sotala · 2024-12-23T09:38:56.905Z · answers+comments (74)

How do you deal w/ Super Stimuli?
Logan Riggs (elriggs) · 2025-01-14T15:14:51.552Z · comments (25)

Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models
Andrew Mack (andrew-mack) · 2024-12-03T21:19:42.333Z · comments (7)

[link] Anthropic: Three Sketches of ASL-4 Safety Case Components
Zach Stein-Perlman · 2024-11-06T16:00:06.940Z · comments (33)

[link] Sabotage Evaluations for Frontier Models
David Duvenaud (david-duvenaud) · 2024-10-18T22:33:14.320Z · comments (55)

Reasons for and against working on technical AI safety at a frontier AI lab
bilalchughtai (beelal) · 2025-01-05T14:49:53.529Z · comments (12)

[link] Finishing The SB-1047 Documentary In 6 Weeks
Michaël Trazzi (mtrazzi) · 2024-10-28T20:17:47.465Z · comments (5)

LLMs Look Increasingly Like General Reasoners
eggsyntax · 2024-11-08T23:47:28.886Z · comments (45)

We probably won't just play status games with each other after AGI
Matthew Barnett (matthew-barnett) · 2025-01-15T04:56:38.330Z · comments (20)

Catastrophic sabotage as a major threat model for human-level AI systems
evhub · 2024-10-22T20:57:11.395Z · comments (11)

Science advances one funeral at a time
Cameron Berg (cameron-berg) · 2024-11-01T23:06:19.381Z · comments (9)

Comment on "Death and the Gorgon"
Zack_M_Davis · 2025-01-01T05:47:30.730Z · comments (32)

Zvi’s Thoughts on His 2nd Round of SFF
Zvi · 2024-11-20T13:40:08.092Z · comments (2)

Introducing Squiggle AI
ozziegooen · 2025-01-03T17:53:42.915Z · comments (15)

A very strange probability paradox
notfnofn · 2024-11-22T14:01:36.587Z · comments (26)

Anvil Problems
Screwtape · 2024-11-13T22:57:41.974Z · comments (13)

The subset parity learning problem: much more than you wanted to know
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-03T09:13:59.245Z · comments (18)

AIs Will Increasingly Fake Alignment
Zvi · 2024-12-24T13:00:07.770Z · comments (0)

[link] Should you be worried about H5N1?
gw · 2024-12-05T21:11:06.996Z · comments (2)

Three Notions of "Power"
johnswentworth · 2024-10-30T06:10:08.326Z · comments (44)

(Salt) Water Gargling as an Antiviral
Elizabeth (pktechgirl) · 2024-11-22T18:00:02.765Z · comments (6)

Agent Foundations 2025 at CMU
Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2025-01-19T23:48:22.569Z · comments (10)

Matryoshka Sparse Autoencoders
Noa Nabeshima (noa-nabeshima) · 2024-12-14T02:52:32.017Z · comments (15)

Is "VNM-agent" one of several options, for what minds can grow up into?
AnnaSalamon · 2024-12-30T06:36:20.890Z · comments (54)

Tips On Empirical Research Slides
James Chua (james-chua) · 2025-01-08T05:06:44.942Z · comments (4)

Thoughts on the conservative assumptions in AI control
Buck · 2025-01-17T19:23:38.575Z · comments (5)

Parable of the vanilla ice cream curse (and how it would prevent a car from starting!)
Mati_Roy (MathieuRoy) · 2024-12-08T06:57:45.783Z · comments (21)

[link] Five Recent AI Tutoring Studies
Arjun Panickssery (arjun-panickssery) · 2025-01-19T03:53:47.714Z · comments (0)

Circling as practice for “just be yourself”
Kaj_Sotala · 2024-12-16T07:40:04.482Z · comments (5)

5 homegrown EA projects, seeking small donors
Austin Chen (austin-chen) · 2024-10-28T23:24:25.745Z · comments (4)

[link] The Manhattan Trap: Why a Race to Artificial Superintelligence is Self-Defeating
Corin Katzke (corin-katzke) · 2025-01-21T16:57:00.998Z · comments (6)

Self-prediction acts as an emergent regularizer
Cameron Berg (cameron-berg) · 2024-10-23T22:27:03.664Z · comments (6)

Scaling Sparse Feature Circuit Finding to Gemma 9B
Diego Caples (diego-caples) · 2025-01-10T11:08:11.999Z · comments (10)

JargonBot Beta Test
Raemon · 2024-11-01T01:05:26.552Z · comments (55)

[link] On Eating the Sun
jessicata (jessica.liu.taylor) · 2025-01-08T04:57:20.457Z · comments (92)

🇫🇷 Announcing CeSIA: The French Center for AI Safety
Charbel-Raphaël (charbel-raphael-segerie) · 2024-12-20T14:17:13.104Z · comments (2)

[link] Is Deep Learning Actually Hitting a Wall? Evaluating Ilya Sutskever's Recent Claims
garrison · 2024-11-13T17:00:01.005Z · comments (14)

Stargate AI-1
Zvi · 2025-01-24T15:20:18.752Z · comments (1)

Remap your caps lock key
bilalchughtai (beelal) · 2024-12-15T14:03:33.623Z · comments (18)

Some arguments against a land value tax
Matthew Barnett (matthew-barnett) · 2024-12-29T15:17:00.740Z · comments (39)

[question] What are the good rationality films?
Ben Pace (Benito) · 2024-11-20T06:04:56.757Z · answers+comments (53)

[link] Gwern Branwen interview on Dwarkesh Patel’s podcast: “How an Anonymous Researcher Predicted AI's Trajectory”
Said Achmiz (SaidAchmiz) · 2024-11-14T23:53:34.922Z · comments (0)

AI #92: Behind the Curve
Zvi · 2024-11-28T14:40:05.448Z · comments (7)

Implications of the inference scaling paradigm for AI safety
Ryan Kidd (ryankidd44) · 2025-01-14T02:14:53.562Z · comments (61)

Dentistry, Oral Surgeons, and the Inefficiency of Small Markets
GeneSmith · 2024-11-01T17:26:06.466Z · comments (16)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

q-home on Q Home's Shortform

Epistemic status: Draft of a post. I want to propose a method of learning environmental goals (a super big, super important subproblem in Alignment). It's informal, so has a lot of gaps. I worry I missed something obvious, rendering my argument completely meaningless. I asked LessWrong feedback team, but they couldn't get someone knowledgeable enough to take a look.

Can you tell me the biggest conceptual problems of my method? Can you tell me if agent foundations [? · GW] researchers are aware of this method or not?

If you're not familiar with the problem, here's the context: Environmental goals; identifying causal goal concepts from sensory data; ontology identification problem; Pointers Problem [LW · GW]; Eliciting Latent Knowledge [? · GW].

Explanation 1

One naive solution

Imagine we have a room full of animals. AI sees the room through a camera. How can AI learn to care about the real animals in the room rather than their images on the camera?

Assumption 1. Let's assume AI models the world as a bunch of objects interacting in space and time. I don't know how critical or problematic this assumption is.

Idea 1. Animals in the video are objects with certain properties (they move continuously, they move with certain relative speeds, they have certain sizes, etc). Let's make the AI search for the best world-model which contains objects with similar properties (P properties).

Problem 1. Ideally, AI will find clouds of atoms which move similarly to the animals on the video. However, AI might just find a world-model (X) which contains the screen of the camera. So it'll end up caring about "movement" of the pixels on the screen. Fail.

Observation 1. Our world contains many objects with P properties which don't show up on the camera. So, X is not the best world-model containing the biggest number of objects with P properties.

Idea 2. Let's make the AI search for the best world-model containing the biggest number of objects with P properties.

Question 1. For "Idea 2" to make practical sense, we need to find a smart way to limit the complexity of the models. Otherwise AI might just make any model contain arbitrary amounts of any objects. Can we find the right complexity prior?

Question 2. Assume we resolved the previous question positively. What if "Idea 2" still produces an alien ontology humans don't care about? Can it happen?

Question 3. Assume everything works out. How do we know that this is a general method of solving the problem? We have an object in sense data (A), we care about the physical thing corresponding to it (B): how do we know B always behaves similarly to A and there are always more instances of B than of A?

One philosophical argument

I think there's a philosophical argument which allows to resolve Questions 2 & 3 (giving evidence that Question 1 should be resolvable too).

By default, we only care about objects with which we can "meaningfully" interact with in our daily life. This guarantees that B always has to behave similarly to A, in some technical sense (otherwise we wouldn't be able to meaningfully interact with B). Also, sense data is a part of reality, so B includes A, therefore there are always more instances of B than of A, in some technical sense. This resolves Question 3.
By default, we only care about objects with which we can "meaningfully" interact with in our daily life. This guarantees that models of the world based on such objects are interpretable [? · GW]. This resolves Question 2.
Can we define what "meaningfully" means? I think that should be relatively easy, at least in theory. There doesn't have to be One True Definition Which Covers All Cases.

If the argument is true, the pointers problem should be solvable without Natural Abstraction hypothesis [? · GW] being true.

Anyway, I'll add a toy example which hopefully helps to better understand what's this all about.

One toy example

You're inside a 3D video game. 1st person view. The game contains landscapes and objects, both made of small balls (the size of tennis balls) of different colors. Also a character you control.

The character can push objects. Objects can break into pieces. Physics is Newtonian. Balls are held together by some force. Balls can have dramatically different weights.

Light is modeled by particles. Sun emits particles, they bounce off of surfaces.

The most unusual thing: as you move, your coordinates are fed into a pseudorandom number generator. The numbers from the generator are then used to swap places of arbitrary balls.

You care about pushing boxes (as everything, they're made of balls too) into a certain location.

...

So, the reality of the game has roughly 5 levels:

The level of sense data (2D screen of the 1st person view).
A. The level of ball structures. B. The level of individual balls.
A. The level of waves of light particles. B. The level of individual light particles.

I think AI should be able to figure out that it needs to care about 2A level of reality. Because ball structures are much simpler to control (by doing normal activities with the game's character) than individual balls. And light particles are harder to interact with than ball structures, due to their speed and nature.

Explanation 2

An alternative explanation of my argument:

Imagine activities which are crucial for a normal human life. For example: moving yourself in space (in a certain speed range); moving other things in space (in a certain speed range); staying in a single spot (for a certain time range); moving in a single direction (for a certain time range); having varied visual experiences (changing in a certain frequency range); etc. Those activities can be abstracted into mathematical properties of certain variables (speed of movement, continuity of movement, etc). Let's call them "fundamental variables". Fundamental variables are defined using sensory data or abstractions over sensory data.
Some variables can be optimized (for a long enough period of time) by fundamental variables. Other variables can't be optimized (for a long enough period of time) by fundamental variables. For example: proximity of my body to my bed is an optimizable variable (I can walk towards the bed — walking is a normal activity); the amount of things I see is an optimizable variable (I can close my eyes or hide some things — both actions are normal activities); closeness of two particular oxygen molecules might be a non-optimizable variable (it might be impossible to control their positions without doing something weird).
By default, people only care about optimizable variables. Unless there are special philosophical reasons to care about some obscure non-optimizable variable which doesn't have any significant effect on optimizable variables.
You can have a model which describes typical changes of an optimizable variable. Models of different optimizable variables have different predictive power. For example, "positions & shapes of chairs" and "positions & shapes of clouds of atoms" are both optimizable variables, but models of the latter have much greater predictive power. Complexity of the models needs to be limited, by the way, otherwise all models will have the same predictive power.
Collateral conclusions: typical changes of any optimizable variable are easily understandable by a human (since it can be optimized by fundamental variables, based on typical human activities); all optimizable variables are "similar" to each other, in some sense (since they all can be optimized by the same fundamental variables); there's a natural hierarchy of optimizable variables (based on predictive power). Main conclusion: while the true model of the world might be infinitely complex, physical things which ground humans' high-level concepts (such as "chairs", "cars", "trees", etc.) always have to have a simple model (which works most of the time, where "most" has a technical meaning determined by fundamental variables).

Formalization

So, the core of my idea is this:

AI is given "P properties" which a variable of its world-model might have. (Let's call a variable with P properties P-variable.)
AI searches for a world-model with the biggest amount of P-variables. AI makes sure it doesn't introduce useless P-variables. We also need to be careful with how we measure the "amount" of P-variables: we need to measure something like "density" rather than "amount" (i.e. the amount of P-variables contributing to a particular relevant situation, rather than the amount of P-variables overall?).
AI gets an interpretable world-model (because P-variables are highly interpretable), adequate for defining what we care about (because by default, humans only care about P-variables).

How far are we from being able to do something like this? Are agent foundations researches pursuing this or something else?

weightt-an on RobertM's Shortform

I think you also have to factor in selection bias. Like suppose there are 3 organizations with 100 resource units, 10 with 20 units, 30 with 5 units. And maybe resources are helpful, but not helpful enough that all the advancements will concentrate in the top 3.

brendan-long on Nvidia doesn’t just sell shovels

It seems like the big players already have plans to cut Nvidia out of the loop though.

And while they seem to have the best general purpose hardware, they're limited by competition with AMD, Apple, and Qualcomm.

silentbob on The Misconception of AGI as an Existential Threat: A Reassessment

Downvoted for 3 reasons:

The style strikes me as very AI-written. Maybe it isn't - but the very repetitive structure looks exactly like the type of text I tend to get out of ChatGPT much of the time. Which makes it very hard to read.
There are many highly superficial claims here without much reasoning to back them up. Many claims of what AGI "would" do without elaboration. "AGI approaches challenges as problems to be solved, not battles to be won." - first, why? Second, how does this help us when the best way to solve the problem involves getting rid of humans?
Lastly, I don't get the feeling this post engages with the most common AI safety arguments at all. Neither does it with evidence from recent AI developments. How do you expect "international agreements" with any teeth in the current arms race? When we don't even get national or state level agreements. While Bing/Sydney was not an AGI, it clearly showed that much of what this post dismisses as anthropocentric projections is realistic, and, currently, maybe even the default of what we can expect of AGI as long as it's LLM-based. And even if you dismiss LLMs and think of more "Bostromian" AGIs, that still leaves you with instrumental convergence, which blows too many holes into this piece to leave anything of much substance.

mo-putera on Mo Putera's Shortform

While Dyson's birds and frogs archetypes of mathematicians is oft-mentioned, David Mumford's tribes of mathematicians is underappreciated, and I find myself pointing to it often in discussions that devolve into "my preferred kind of math research is better than yours"-type aesthetic arguments:

... the subjective nature and attendant excitement during mathematical activity, including a sense of its beauty, varies greatly from mathematician to mathematician... I think one can make a case for dividing mathematicians into several tribes depending on what most strongly drives them into their esoteric world. I like to call these tribes explorers, alchemists, wrestlers and detectives. Of course, many mathematicians move between tribes and some results are not cleanly part the property of one tribe.
Explorers are people who ask -- are there objects with such and such properties and if so, how many? They feel they are discovering what lies in some distant mathematical continent and, by dint of pure thought, shining a light and reporting back what lies out there. The most beautiful things for them are the wholly new objects that they discover (the phrase 'bright shiny objects' has been in vogue recently) and these are especially sought by a sub-tribe that I call Gem Collectors. Explorers have another sub-tribe that I call Mappers who want to describe these new continents by making some sort of map as opposed to a simple list of 'sehenswürdigkeiten'.
Alchemists, on the other hand, are those whose greatest excitement comes from finding connections between two areas of math that no one had previously seen as having anything to do with each other. This is like pouring the contents of one flask into another and -- something amazing occurs, like an explosion!
Wrestlers are those who are focussed on relative sizes and strengths of this or that object. They thrive not on equalities between numbers but on inequalities, what quantity can be estimated or bounded by what other quantity, and on asymptotic estimates of size or rate of growth. This tribe consists chiefly of analysts and integrals that measure the size of functions but people in every field get drawn in.
Finally Detectives are those who doggedly pursue the most difficult, deep questions, seeking clues here and there, sure there is a trail somewhere, often searching for years or decades. These too have a sub-tribe that I call Strip Miners: these mathematicians are convinced that underneath the visible superficial layer, there is a whole hidden layer and that the superficial layer must be stripped off to solve the problem. The hidden layer is typically more abstract, not unlike the 'deep structure' pursued by syntactical linguists. Another sub-tribe are the Baptizers, people who name something new, making explicit a key object that has often been implicit earlier but whose significance is clearly seen only when it is formally defined and given a name.

Mumford's examples of each, both results and mathematicians:

Explorers:
- Theaetetus (ncient Greek list of the five Platonic solids)
- Ludwig Schläfli (extended the Greek list to regular polytopes in n dimensions)
- Bill Thurston ("I never met anyone with anything close to his skill in visualization")
- the list of finite simple groups
- Michael Artin (discovered non-commutative rings "lying in the middle ground between the almost commutative area and the truly huge free rings")
- Set theorists ("exploring that most peculiar, almost theological world of 'higher infinities'")
Mappers:
- Mumford himself
- arguably, the earliest mathematicians (the story told by cuneiform surveying tablets)
- the Mandelbrot set
- Ramanujan's "integer expressible two ways as a sum of two cubes"
- the Concinnitas project of Bob Feldman and Dan Rockmore of ten aquatints
Alchemists:
- Abraham De Moivre
- Oscar Zariski, Mumford's PhD advisor ("his deepest work was showing how the tools of commutative algebra, that had been developed by straight algebraists, had major geometric meaning and could be used to solve some of the most vexing issues of the Italian school of algebraic geometry")
- the Riemann-Roch theorem ("it was from the beginning a link between complex analysis and the geometry of algebraic curves. It was extended by pure algebra to characteristic p, then generalized to higher dimensions by Fritz Hirzebruch using the latest tools of algebraic topology. Then Michael Atiyah and Isadore Singer linked it to general systems of elliptic partial differential equations, thus connecting analysis, topology and geometry at one fell swoop")
Wrestlers:
- Archimedes ("he loved estimating π and concocting gigantic numbers")
- Calculus ("stems from the work of Newton and Leibniz and in Leibniz's approach depends on distinguishing the size of infinitesimals from the size of their squares which are infinitely smaller")
- Euler's strange infinite series formulas
- Stirling's formula for the approximate size of n!
- Augustin-Louis Cauchy ("his eponymous inequality remains the single most important inequality in math")
- Sergei Sobolev
- Shing-Tung Yau
Detectives:
- Andrew Wiles is probably the archetypal example
- Roger Penrose (""My own way of thinking is to ponder long and, I hope, deeply on problems and for a long time ... and I never really let them go.")
Strip Miners:
- Alexander Grothendieck ("he greatest contemporary practitioner of this philosophy in the 20th century... Of all the mathematicians that I have met, he was the one whom I would unreservedly call a "genius". ... He considered that the real work in solving a mathematical problem was to find le niveau juste in which one finds the right statement of the problem at its proper level of generality. And indeed, his radical abstractions of schemes, functors, K-groups, etc. proved their worth by solving a raft of old problems and transforming the whole face of algebraic geometry)
- Leonard Euler from Switzerland and Carl Fredrich Gauss ("both showed how two dimensional geometry lay behind the algebra of complex numbers")
- Eudoxus and his spiritual successor Archimedes ("he level they reached was essentially that of a rigorous theory of real numbers with which they are able to calculate many specific integrals. Book V in Euclid's Elements and Archimedes The Method of Mechanical Theorems testify to how deeply they dug")
- Aryabhata

Some miscellaneous humorous quotes:

When I was teaching algebraic geometry at Harvard, we used to think of the NYU Courant Institute analysts as the macho guys on the scene, all wrestlers. I have heard that conversely they used the phrase 'French pastry' to describe the abstract approach that had leapt the Atlantic from Paris to Harvard.
Besides the Courant crowd, Shing-Tung Yau is the most amazing wrestler I have talked to. At one time, he showed me a quick derivation of inequalities I had sweated blood over and has told me that mastering this skill was one of the big steps in his graduate education. Its crucial to realize that outside pure math, inequalities are central in economics, computer science, statistics, game theory, and operations research. Perhaps the obsession with equalities is an aberration unique to pure math while most of the real world runs on inequalities.
In many ways [the Detective approach to mathematical research exemplified by e.g. Andrew Wiles] is the public's standard idea of what a mathematician does: seek clues, pursue a trail, often hitting dead ends, all in pursuit of a proof of the big theorem. But I think it's more correct to say this is one way of doing math, one style. Many are leery of getting trapped in a quest that they may never fulfill.

ryan_greenblatt on Yonatan Cale's Shortform

I think this is very different from RSPs: RSPs are more like "if everyone is racing ahead (and so we feel we must also race), there is some point where we'll still chose to unilaterally stop racing"

In practice, I don't think any currently existing RSP-like policy will result in a company doing this as I discuss here [LW · GW].

chavam on A Three-Layer Model of LLM Psychology

I'm trying to figure out to what extent the character/ground layer distinction is different from the simulacrum/simulator distinction. At some points in your comment you seem to say they are mutually inconsistent, but at other points you seem to say they are just different ways of looking at the same thing.

"The key difference is that in the three-layer model, the ground layer is still part of the model's "mind" or cognitive architecture, while in simulator theory, the simulator is a bit more analogous to physics - it's not a mind at all, but rather the rules that minds (and other things) operate under."

I think this clarifies the difference for me, because as I was reading your post I was thinking: If you think of it as a simulacrum/simulator distinction, I'm not sure that the character and the surface layer can be "in conflict" with the ground layer, because both the surface layer and the character layer are running "on top of" the ground layer, like a windows virtual machine on a linux pc, or like a computer simulation running inside physics. Physical can never be "in conflict" with social phenomena.

But it seems you maybe think that the character layer is actually embedded in the basic cognitive architecture. This would be a distinct claim from simulator theory, and *mutually inconsistent*. But I am unsure this is true, because we know that the ground layer was (1) trained first (so that it's easier for character training to work by just adjusting some parameters/prior of the ground layer, and (2) trained for much longer than the character layer (admittedly I'm not up to date on how they're trained, maybe this is no longer true for Claude?), so that it seems hard for the model to have a character layer become separately embedded in the basic architecture.

Taking a more neuroscience rather than psychology analogy: It seems to me more likely that character training is essentially adjusting the prior of the ground layer, but the character is still fully running on top of the ground layer, and the ground layer could still switch to any other character (but it doesn't because the prior is adjusted so heavily by character-training). e.g. the character is not some separate subnetwork inside the model, but remains a simulated entity running on top of the model.

Do you disagree with this?

winstonbosan on Nvidia doesn’t just sell shovels

I think this category of actors are neglected as a whole. (As well as SKH, micron etc.)

TSMC makes the chips for NVIDIA and everyone - I didn’t talk too much about them because they are already a lynchpin in many countries’ AI/national security policy (China PRC, Taiwan and at least United States). And by their nature, they are already under heavy surveillance for prosaic (trad. National security and chip self-sufficiency) reasons.

lemonhope on Nvidia doesn’t just sell shovels

What do you think of the cards held by TSMC and Samsung?

wassname on Implications of the inference scaling paradigm for AI safety

What you might do is impose a curriculum:

In FBAI's COCONUT they use a curriculum to teach it to think shorter and differently and it works. They are teaching it to think using fewer steps, but compress into latent vectors instead of tokens.

first it thinks with tokens
then they replace one thinking step with a latent <thought> token
then 2
...

It's not RL, but what is RL any more? It's becoming blurry. They don't reward or punish it for anything in the thought token. So it learns thoughts that are helpful in outputting the correct answer.

There's another relevant paper "Compressed Chain of Thought: Efficient Reasoning through Dense Representations" which used teacher forcing. Although I haven't read the whole thing yet.