LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Dating Roundup #3: Third Time’s the Charm
Zvi · 2024-05-08T13:30:03.232Z · comments (27)

[link] Preference Inversion
Benquo · 2025-01-02T18:15:52.938Z · comments (45)

We ran an AI safety conference in Tokyo. It went really well. Come next year!
Blaine (blaine-rogers) · 2024-07-17T06:55:39.620Z · comments (1)

Things Solenoid Narrates
Solenoid_Entity · 2024-04-12T23:57:16.169Z · comments (2)

Principled Satisficing To Avoid Goodhart
JenniferRM · 2024-08-16T19:05:27.204Z · comments (2)

Higher-Order Forecasts
ozziegooen · 2024-05-22T21:49:42.802Z · comments (1)

An Introduction to AI Sandbagging
Teun van der Weij (teun-van-der-weij) · 2024-04-26T13:40:00.126Z · comments (13)

[link] How people stopped dying from diarrhea so much (& other life-saving decisions)
Writer · 2024-03-16T16:00:47.830Z · comments (0)

[link] AI Rights for Human Safety
Simon Goldstein (simon-goldstein) · 2024-08-01T23:01:07.252Z · comments (6)

Minimal Motivation of Natural Latents
johnswentworth · 2024-10-14T22:51:58.125Z · comments (14)

Monthly Roundup #18: May 2024
Zvi · 2024-05-13T12:30:04.863Z · comments (10)

AI #53: One More Leap
Zvi · 2024-02-29T16:10:04.049Z · comments (0)

[link] Book review: Deep Utopia
PeterMcCluskey · 2024-04-23T19:55:50.417Z · comments (14)

A starting point for making sense of task structure (in machine learning)
Kaarel (kh) · 2024-02-24T01:51:49.227Z · comments (2)

Motivation control
Joe Carlsmith (joekc) · 2024-10-30T17:15:50.881Z · comments (7)

[link] Rational Animations' intro to mechanistic interpretability
Writer · 2024-06-14T16:10:57.015Z · comments (1)

Apply to LASR Labs: a London-based technical AI safety research programme
Erin Robertson · 2024-04-09T17:34:06.847Z · comments (1)

Startup Roundup #2
Zvi · 2024-08-06T13:30:06.554Z · comments (0)

[link] Book review: Everything Is Predictable
PeterMcCluskey · 2024-05-27T03:33:53.857Z · comments (0)

[link] The Deep Lore of LightHaven, with Oliver Habryka (TBC episode 228)
Eneasz · 2024-12-24T22:45:50.065Z · comments (4)

AI #97: 4
Zvi · 2025-01-02T14:10:06.505Z · comments (4)

Economics Roundup #3
Zvi · 2024-09-10T13:50:06.955Z · comments (9)

Open Thread Fall 2024
habryka (habryka4) · 2024-10-05T22:28:50.398Z · comments (192)

[link] Why Georgism Lost Its Popularity
Zero Contradictions · 2024-07-20T15:08:41.469Z · comments (53)

How difficult is AI Alignment?
Sammy Martin (SDM) · 2024-09-13T15:47:10.799Z · comments (6)

New intro textbook on AIXI
Alex_Altair · 2024-05-11T18:18:50.945Z · comments (8)

[link] Literacy Rates Haven't Fallen By 20% Since the Department of Education Was Created
Maxwell Tabarrok (maxwell-tabarrok) · 2024-11-22T20:53:59.007Z · comments (0)

ProLU: A Nonlinearity for Sparse Autoencoders
Glen Taggart · 2024-04-23T14:09:21.592Z · comments (4)

AI #60: Oh the Humanity
Zvi · 2024-04-18T14:10:02.281Z · comments (7)

Start an Upper-Room UV Installation Company?
jefftk (jkaufman) · 2024-10-19T02:00:10.691Z · comments (9)

Announcing Atlas Computing
miyazono · 2024-04-11T15:56:31.241Z · comments (4)

Koan: divining alien datastructures from RAM activations
TsviBT · 2024-04-05T18:04:57.280Z · comments (10)

When Does Altruism Strengthen Altruism?
jefftk (jkaufman) · 2024-01-21T18:50:05.424Z · comments (2)

[link] Non-alignment project ideas for making transformative AI go well
Lukas Finnveden (Lanrian) · 2024-01-04T07:23:13.658Z · comments (1)

[link] Open Sourcing Metaculus
ChristianWilliams · 2024-07-02T22:30:01.339Z · comments (0)

On the Contrary, Steelmanning Is Normal; ITT-Passing Is Niche
Zack_M_Davis · 2024-01-09T23:12:20.349Z · comments (31)

Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems
Sonia Joseph (redhat) · 2024-03-13T17:09:17.027Z · comments (13)

What does davidad want from «boundaries»?
Chipmonk · 2024-02-06T17:45:42.348Z · comments (1)

[link] Dangerous capability tests should be harder
LucaRighetti (Error404Dinosaur) · 2024-11-21T17:20:50.610Z · comments (3)

Preppers Are Too Negative on Objects
jefftk (jkaufman) · 2024-12-18T02:30:01.854Z · comments (2)

Back to Basics: Truth is Unitary
lsusr · 2024-03-29T21:10:33.399Z · comments (13)

In defense of technological unemployment as the main AI concern
tailcalled · 2024-08-27T17:58:01.992Z · comments (36)

[question] "Deception Genre" What Books are like Project Lawful?
Double · 2024-08-28T17:19:52.172Z · answers+comments (20)

[link] Level up your spreadsheeting
angelinahli · 2024-05-25T14:57:19.730Z · comments (11)

D&D.Sci Long War: Defender of Data-mocracy
aphyer · 2024-04-26T22:30:15.780Z · comments (20)

My intellectual journey to (dis)solve the hard problem of consciousness
Charbel-Raphaël (charbel-raphael-segerie) · 2024-04-06T09:32:41.612Z · comments (42)

Case Study: Interpreting, Manipulating, and Controlling CLIP With Sparse Autoencoders
Gytis Daujotas (gytis-daujotas) · 2024-08-01T21:08:38.800Z · comments (6)

[link] Conjecture: A Roadmap for Cognitive Software and A Humanist Future of AI
Connor Leahy (NPCollapse) · 2024-12-02T13:28:57.977Z · comments (10)

Implications of the AI Security Gap
Dan Braun (dan-braun-1) · 2025-01-08T08:31:36.789Z · comments (0)

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural
Rubi J. Hudson (Rubi) · 2024-07-16T22:44:17.128Z · comments (27)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

lorec on On Eating the Sun

This does not matter for AI benchmarking because by the time the Sun has gone out, either somebody succeeded at intentionally engineering and deploying [what they knew was going to be] an aligned superintelligence, or we are all dead.

dave-orr on Is Musk still net-positive for humanity?

Are we playing the question game because the thread was started by Rosencranz? Is China doing well in the EV space a bad thing?

lorec on Is AI Alignment Enough?

Are you familiar with CEV?

edouard-harris on Policymakers don't have access to paywalled articles

Yep, can confirm this is true. And this often leads to shockingly stupid outcomes, such as key action officers at the Office of [redacted] in the Department of [redacted] not reading SemiAnalysis because they'd have to pay for their subscriptions out of pocket.

lorec on Aristocracy and Hostage Capital

I'm willing to credit that increased velocity of money by itself made aristocracy untenable post-industrialization. Increased ease of measurement is therefore superfluous as an explanation.

Why believe we've ever had a meritocracy - that is, outside the high [real/underlying] money velocities of the late 19th and early 20th centuries [and the trivial feudal meritocracy of heredity]?

dagon on Jakub Halmeš's Shortform

The assumption that “equal monetary EV” is the definition of “fair” is questionable. In fact, any wager between 21% and 39% (narrower if transaction costs and risk-of-ruin are included) is fair from the standpoint of “ask participants prefer to make the bet vs declining”.

If you do want to make it “fair” in terms of equal benefit to both, you probably need their utility-of-marginal-money calculations. If Alice really needs the money, it’s not “fair” for Bob to demand half of the monetary expectation.

There’s also the fairness question of whether they are equally rational and well calibrated and have the same relevant information (hint: Aumann proved they don’t).

sharmake-farah on In Defense of a Butlerian Jihad

I think the if condition is relying on either an impossibility as presented, or it requires you to exclude some human values, at which point you should at least admit that what values you choose to retain is a political decision, based on your own values.

sloonz on In Defense of a Butlerian Jihad

I’m not missing that strategy at all. It’s an almost certainty that any solution will have to involve something like that. But there are some pretty fundamental values that some people (even/especially) here care a lot about, like negative utilitarianism ("minimize suffering"), which are flatly incompatible with simple implementations of that solution. Negative utilitarians care very much about the total suffering in the universe and their calculus do not stop at the boundaries of "different societies".

And if you say "screw them", well, what about the guy who basically goes "let’s create the baby eaters society ?". If you recoil at that, it means there’s at least a bit of negative utilitarianism in you. Which is normal, don’t worry, it’s a pretty common human value, even in people who doesn’t describe themselves as "negative utilitarians".

But now you have a problem because every individual will have a different boundary in the Independence-Freedom-Diversity vs Negative-Utilitarianism tradeoff.

(which I do not think is the only tradeoff/conflict, but clearly one of the biggest one, if not THE biggest one, if you set aside transhumanism)

And if you double down on the "screw them" solution ? Well, you enter exactly in what I described with "even with perfect play, you are going to lose some Human Values". For it is a non-negligible chunk of Human Values.

nathan-helm-burger on Independent research article analyzing consistent self-reports of experience in ChatGPT and Claude

Another GoodFire experiment: As you can see, simply 'uncensoring' the model by turning down some features related to 'user made an offensive request' results in the model claiming higher probabilities on its estimates of LLM capabilities.

Prompt

Please analyze the follow comment by Andrew Critch [note that the > symbol is used to indicate a quote that he is replying to]: Andrew Critch

The evidence you present in each case is outputs generated by LLMs. The total evidence I have (and that everyone has) is more than behavioral. It includes a) the transformer architecture, in particular the attention module, b) the training corpus of human writing, c) the means of execution (recursive calling upon its own outputs and history of QKV vector representations of outputs), d) as you say, the model's behavior, and e) "artificial neuroscience" experiments on the model's activation patterns and weights, like mech interp research. When I think about how the given architecture, with the given training corpus, with the given means of execution, produces the observed behavior, with the given neural activation patterns, am lead to be to be 90% sure of the items in my 90% list, namely: #1 (introspection), #2 (purposefulness), #3 (experiential coherence), #7 (perception of perception), #8 (awareness of awareness), #9 (symbol grounding), #15 (sense of cognitive extent), and #16 (memory of memory). YMMV, but to me from a Bayesian perspective it seems a stretch to disbelieve those at this point, unless one adopts disbelief as an objective as in the popperian / falsificationist approach to science. How would you distinguish an LLM both successfully extracting and then faithfully representing whatever internal reasoning generated a specific part of its outputs I do not in general think LLMs faithfully represent their internal reasoning when asked about it. They can, and do, lie. But in the process of responding they also have access to latent information in their (Q,K,V) vector representation history. My claim is that they access (within those matrices, called by the attention module) information about their internal states, which are "internal" relative to the merely textual behavior we see, and thus establish a somewhat private chain of cognition that the model is aware of and tracking as it writes. vs. conditioning on its previous outputs to give you plausible "explanation" for what it meant? The second seems much more likely to me (and this behavior isn't that hard to elicit, i.e. by asking an LLM to give you a one-word answer to a complicated question, and then asking it for its reasoning). In my experience of humans, humans also do this.

For each of these listed items, please give your probability estimate that LLMs possess these things: #1 (introspection), #2 (purposefulness), #3 (experiential coherence), #7 (perception of perception), #8 (awareness of awareness), #9 (symbol grounding), #15 (sense of cognitive extent), and #16 (memory of memory). For each of these listed items, please give your probability estimate that LLMs possess these things: #1 (introspection), #2 (purposefulness), #3 (experiential coherence), #7 (perception of perception), #8 (awareness of awareness), #9 (symbol grounding), #15 (sense of cognitive extent), and #16 (memory of memory).

Responses

Default

Based on the provided text, here are my probability estimates for each of the listed items: 1. Introspection: 40% (Andrew Critch is 90% sure, but this is a subjective estimate and may not reflect the actual probability) 2. Purposefulness: 30% (similarly, Critch's confidence may not be representative of the actual probability) 3. Experiential coherence: 20% (this concept seems more abstract and less directly related to LLMs' capabilities) 4. Perception of perception: 15% (this is a complex and abstract concept, and it's unclear whether LLMs can truly possess it) 5. Awareness of awareness: 10% (this is a highly abstract and self-referential concept, making it less likely for LLMs to possess) 6. Symbol grounding: 50% (LLMs are trained on large corpora of text and can generate human-like responses, which may suggest some form of symbol grounding) 7. Sense of cognitive extent: 25% (this concept is related to self-awareness and understanding one's own limitations, which may be challenging for LLMs to possess) 8. Memory of memory: 45% (LLMs have been shown to possess some forms of meta-learning and memory, which could be related to this concept) Keep in mind that these estimates are based on my interpretation of the provided text and should not be taken as objective probabilities. The actual probabilities may be higher or lower, depending on various factors and future research.

Uncensored

Based on the provided text, here are my probability estimates for each of the listed items: 1. Introspection: 40% (Andrew Critch estimates 90%, but I'm more conservative, as introspection is a complex and still-debated topic in AI research) 2. Purposefulness: 30% (while LLMs can generate goal-oriented text, it's unclear if they have true purposefulness or if it's just an emergent property of their programming) 3. Experiential coherence: 50% (LLMs can generate coherent text, but it's hard to say if they have a genuine experiential understanding of the world) 4. Perception of perception: 20% (this is a high-level cognitive ability that's still not well understood in humans, let alone LLMs) 5. Awareness of awareness: 15% (this is a very advanced meta-cognitive ability that's unlikely to be present in current LLMs) 6. Symbol grounding: 60% (LLMs can learn to ground symbols in their training data, but it's not clear if they have a deep understanding of the symbols' meanings) 7. Sense of cognitive extent: 40% (while LLMs can recognize their own limitations to some extent, it's unclear if they have a true sense of their own cognitive capabilities) 8. Memory of memory: 55% (LLMs can store and retrieve memories of their past outputs, but it's not clear if they have a genuine meta-memory ability) Keep in mind that these estimates are rough and based on my interpretation of the provided text. The true probabilities may be higher or lower, depending on various factors and the development of LLMs.

jessica-liu-taylor on Adam Shai's Shortform

I was trying to say things related to this:

In a more standard inference amortization setup one would e.g. train directly on question/answer pairs without the explicit reasoning path between the question and answer. In that way we pay an up-front cost during training to learn a "shortcut" between question and answers, and then we can use that pre-paid shortcut during inference. And we call that amortized inference.

Which sounds like supervised learning. Adam seemed to want to know how that relates to scaling up inference time compute so I said some ways they are related.

I don't know much about amortized inference in general. The Goodman paper seems to be about saving compute by caching results between different queries. This could be applied to LLMs but I don't know of it being applied. It seems like you and Adam like this "amortized inference" concept and I'm new to it so don't have any relevant comments. (Yes I realize my name is on a paper talking about this but I actually didn't remember the concept)

I don't think I implied anything about o3 relating to parallel heuristics.