LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] Podcast discussing Hanson's Cultural Drift Argument
vaishnav92 · 2024-10-20T17:58:41.416Z · comments (0)

How to Teach Your Brain to Hate Procrastination
10xyz (10xyz-coder) · 2024-10-21T20:12:40.809Z · comments (0)

[link] What is Confidence—in Game Theory and Life?
James Stephen Brown (james-brown) · 2024-12-10T23:06:24.072Z · comments (0)

Hamiltonian Dynamics in AI: A Novel Approach to Optimizing Reasoning in Language Models
Javier Marin Valenzuela (javier-marin-valenzuela) · 2024-10-09T19:14:56.162Z · comments (0)

Thoughts on the In-Context Scheming AI Experiment
ExCeph · 2025-01-09T02:19:09.558Z · comments (0)

ACI#9: What is Intelligence
Akira Pyinya · 2024-12-09T21:54:41.077Z · comments (0)

Methodology: Contagious Beliefs
James Stephen Brown (james-brown) · 2024-10-19T03:58:17.966Z · comments (0)

Near-death experiences
Declan Molony (declan-molony) · 2024-10-08T06:34:04.107Z · comments (1)

Some implications of radical empathy
MichaelStJules · 2025-01-07T16:10:16.755Z · comments (0)

In Defense of a Butlerian Jihad
sloonz · 2025-01-11T19:30:17.641Z · comments (3)

[question] How do we quantify non-philanthropic contributions from Buffet and Soros?
Philosophistry (philip-dhingra) · 2024-12-20T22:50:32.260Z · answers+comments (0)

[link] Social Science in its epistemological context
Arturo Macias (arturo-macias) · 2024-12-05T16:12:29.034Z · comments (0)

Bellevue Meetup
Cedar (xida-ren) · 2024-10-16T01:07:58.761Z · comments (0)

On the Practical Applications of Interpretability
Nick Jiang (nick-jiang) · 2024-10-15T17:18:25.280Z · comments (1)

[link] The Polite Coup
Charlie Sanders (charlie-sanders) · 2024-12-04T14:03:36.663Z · comments (0)

AI Training Opt-Outs Reinforce Global Power Asymmetries
kushagra (kushagra-tiwari) · 2024-11-30T22:08:06.426Z · comments (0)

Sexual Selection as a Mesa-Optimizer
Lorec · 2024-11-29T23:34:45.739Z · comments (0)

Understanding Emergence in Large Language Models
[deleted] · 2024-11-29T19:42:43.790Z · comments (1)

Exploring the Platonic Representation Hypothesis Beyond In-Distribution Data
rokosbasilisk · 2024-10-20T08:40:04.404Z · comments (2)

Enabling New Applications with Today's Mechanistic Interpretability Toolkit
ananya_joshi · 2024-10-25T17:53:23.960Z · comments (0)

A proposal for iterated interpretability with known-interpretable narrow AIs
Peter Berggren (peter-berggren) · 2025-01-11T14:43:05.423Z · comments (0)

Personal Philosophy
Xor · 2024-10-13T03:01:59.324Z · comments (0)

[link] When the Scientific Method Doesn't Really Help...
casualphysicsenjoyer (hatta_afiq) · 2024-11-27T19:52:30.023Z · comments (1)

Hope to live or fear to die?
Knight Lee (Max Lee) · 2024-11-27T10:42:37.070Z · comments (0)

5. Uphold Voluntarism: Digital Defense
Allison Duettmann (allison-duettmann) · 2025-01-02T19:05:33.963Z · comments (0)

Workshop Report: Why current benchmarks approaches are not sufficient for safety?
Tom DAVID (tom-david) · 2024-11-26T17:20:47.453Z · comments (1)

Should you increase AI alignment funding, or increase AI regulation?
Knight Lee (Max Lee) · 2024-11-26T09:17:01.809Z · comments (1)

notes on prioritizing tasks & cognition-threads
Emrik (Emrik North) · 2024-11-26T00:28:03.400Z · comments (1)

3. Improve Cooperation: Better Technologies
Allison Duettmann (allison-duettmann) · 2025-01-02T19:03:16.588Z · comments (2)

Interview with Bill O’Rourke - Russian Corruption, Putin, Applied Ethics, and More
JohnGreer · 2024-10-27T17:11:28.891Z · comments (0)

San Francisco ACX Meetup “First Saturday”
Nate Sternberg (nate-sternberg) · 2024-10-28T05:05:36.757Z · comments (0)

Your memory eventually drives confidence in each hypothesis to 1 or 0
Crazy philosopher (commissar Yarrick) · 2024-10-28T09:00:27.084Z · comments (6)

[question] 2025 Alignment Predictions
anaguma · 2025-01-02T05:37:36.912Z · answers+comments (3)

[question] How do you decide to phrase predictions you ask of others? (and how do you make your own?)
CstineSublime · 2025-01-10T02:44:26.737Z · answers+comments (0)

[question] How might language influence how an AI "thinks"?
bodry (plosique) · 2024-10-30T17:41:04.460Z · answers+comments (0)

[link] AI Safety at the Frontier: Paper Highlights, October '24
gasteigerjo · 2024-10-31T00:09:33.522Z · comments (0)

The boat
RomanS · 2024-11-22T12:56:45.050Z · comments (0)

[link] Both-Sidesism—When Fair & Balanced Goes Wrong
James Stephen Brown (james-brown) · 2024-11-02T03:04:03.820Z · comments (15)

[link] Solving Newcomb's Paradox In Real Life
Alice Wanderland (alice-wanderland) · 2024-12-11T19:48:44.486Z · comments (0)

[question] EndeavorOTC legit?
FinalFormal2 · 2024-10-17T01:33:12.606Z · answers+comments (0)

[question] Are Sparse Autoencoders a good idea for AI control?
Gerard Boxo (gerard-boxo) · 2024-12-26T17:34:55.617Z · answers+comments (2)

[link] Higher Order Signs, Hallucination and Schizophrenia
Nicolas Villarreal (nicolas-villarreal) · 2024-11-02T16:33:10.574Z · comments (0)

[question] How should I optimize my decision making model for 'ideas'?
CstineSublime · 2024-12-18T04:09:58.025Z · answers+comments (0)

San Francisco ACX Meetup “First Saturday”
Nate Sternberg (nate-sternberg) · 2024-09-29T03:13:34.615Z · comments (0)

[link] Technical Risks of (Lethal) Autonomous Weapons Systems
Heramb · 2024-10-23T20:41:13.238Z · comments (0)

Singular Learning Theory for Dummies
Rahul Chand (rahul-chand) · 2024-10-15T21:13:55.842Z · comments (0)

[question] Most capable publicly available agents?
Gabe · 2024-09-30T00:04:24.480Z · answers+comments (0)

Reminder: AI Safety is Also a Behavioral Economics Problem
zoop · 2024-12-20T01:40:53.847Z · comments (0)

The Technist Reformation: A Discussion with o1 About The Coming Economic Event Horizon
Yuli_Ban · 2024-12-11T02:34:22.329Z · comments (1)

The CARLIN Method: Teaching AI How to Be Genuinely Funny
Greg Robison (grobison) · 2024-12-09T21:51:05.504Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

dagon on Jakub Halmeš's Shortform

The assumption that “equal monetary EV” is the definition of “fair” is questionable. In fact, any wager between 21% and 39% (narrower if transaction costs and risk-of-ruin are included) is fair from the standpoint of “ask participants prefer to make the bet vs declining”.

If you do want to make it “fair” in terms of equal benefit to both, you probably need their utility-of-marginal-money calculations. If Alice really needs the money, it’s not “fair” for Bob to demand half of the monetary expectation.

There’s also the fairness question of whether they are equally rational and well calibrated and have the same relevant information (hint: Aumann proved they don’t).

sharmake-farah on In Defense of a Butlerian Jihad

I think the if condition is relying on either an impossibility as presented, or it requires you to exclude some human values, at which point you should at least admit that what values you choose to retain is a political decision, based on your own values.

sloonz on In Defense of a Butlerian Jihad

I’m not missing that strategy at all. But there are some pretty fundamental values that some people (even/especially) here care a lot about, like negative utilitarianism ("minimize suffering"), which are flatly incompatible with simple implementations of that solution. Negative utilitarians care very much about the total suffering in the universe and their calculus do not stop at the boundaries of "different societies".

And if you say "screw them", well, what about the guy who basically goes "let’s create the baby eaters society ?". If you recoil at that, it means there’s at least a bit of negative utilitarianism in you. Which is normal, don’t worry, it’s a pretty common human value, even in people who doesn’t describe themselves as "negative utilitarians".

But now you have a problem because every individual will have a different boundary in the Independence-Freedom-Diversity vs Negative-Utilitarianism tradeoff.

(which I do not think is the only tradeoff/conflict, but clearly one of the biggest one, if not THE biggest one, if you set aside transhumanism)

And if you double down on the "screw them" solution ? Well, you enter exactly in what I described with "even with perfect play, you are going to lose some Human Values". For it is a non-negligible chunk of Human Values.

nathan-helm-burger on Independent research article analyzing consistent self-reports of experience in ChatGPT and Claude

Another GoodFire experiment: As you can see, simply 'uncensoring' the model by turning down some features related to 'user made an offensive request' results in the model claiming higher probabilities on its estimates of LLM capabilities.

Prompt

Please analyze the follow comment by Andrew Critch [note that the > symbol is used to indicate a quote that he is replying to]: Andrew Critch

The evidence you present in each case is outputs generated by LLMs. The total evidence I have (and that everyone has) is more than behavioral. It includes a) the transformer architecture, in particular the attention module, b) the training corpus of human writing, c) the means of execution (recursive calling upon its own outputs and history of QKV vector representations of outputs), d) as you say, the model's behavior, and e) "artificial neuroscience" experiments on the model's activation patterns and weights, like mech interp research. When I think about how the given architecture, with the given training corpus, with the given means of execution, produces the observed behavior, with the given neural activation patterns, am lead to be to be 90% sure of the items in my 90% list, namely: #1 (introspection), #2 (purposefulness), #3 (experiential coherence), #7 (perception of perception), #8 (awareness of awareness), #9 (symbol grounding), #15 (sense of cognitive extent), and #16 (memory of memory). YMMV, but to me from a Bayesian perspective it seems a stretch to disbelieve those at this point, unless one adopts disbelief as an objective as in the popperian / falsificationist approach to science. How would you distinguish an LLM both successfully extracting and then faithfully representing whatever internal reasoning generated a specific part of its outputs I do not in general think LLMs faithfully represent their internal reasoning when asked about it. They can, and do, lie. But in the process of responding they also have access to latent information in their (Q,K,V) vector representation history. My claim is that they access (within those matrices, called by the attention module) information about their internal states, which are "internal" relative to the merely textual behavior we see, and thus establish a somewhat private chain of cognition that the model is aware of and tracking as it writes. vs. conditioning on its previous outputs to give you plausible "explanation" for what it meant? The second seems much more likely to me (and this behavior isn't that hard to elicit, i.e. by asking an LLM to give you a one-word answer to a complicated question, and then asking it for its reasoning). In my experience of humans, humans also do this.

For each of these listed items, please give your probability estimate that LLMs possess these things: #1 (introspection), #2 (purposefulness), #3 (experiential coherence), #7 (perception of perception), #8 (awareness of awareness), #9 (symbol grounding), #15 (sense of cognitive extent), and #16 (memory of memory). For each of these listed items, please give your probability estimate that LLMs possess these things: #1 (introspection), #2 (purposefulness), #3 (experiential coherence), #7 (perception of perception), #8 (awareness of awareness), #9 (symbol grounding), #15 (sense of cognitive extent), and #16 (memory of memory).

Responses

Default

Based on the provided text, here are my probability estimates for each of the listed items: 1. Introspection: 40% (Andrew Critch is 90% sure, but this is a subjective estimate and may not reflect the actual probability) 2. Purposefulness: 30% (similarly, Critch's confidence may not be representative of the actual probability) 3. Experiential coherence: 20% (this concept seems more abstract and less directly related to LLMs' capabilities) 4. Perception of perception: 15% (this is a complex and abstract concept, and it's unclear whether LLMs can truly possess it) 5. Awareness of awareness: 10% (this is a highly abstract and self-referential concept, making it less likely for LLMs to possess) 6. Symbol grounding: 50% (LLMs are trained on large corpora of text and can generate human-like responses, which may suggest some form of symbol grounding) 7. Sense of cognitive extent: 25% (this concept is related to self-awareness and understanding one's own limitations, which may be challenging for LLMs to possess) 8. Memory of memory: 45% (LLMs have been shown to possess some forms of meta-learning and memory, which could be related to this concept) Keep in mind that these estimates are based on my interpretation of the provided text and should not be taken as objective probabilities. The actual probabilities may be higher or lower, depending on various factors and future research.

Uncensored

Based on the provided text, here are my probability estimates for each of the listed items: 1. Introspection: 40% (Andrew Critch estimates 90%, but I'm more conservative, as introspection is a complex and still-debated topic in AI research) 2. Purposefulness: 30% (while LLMs can generate goal-oriented text, it's unclear if they have true purposefulness or if it's just an emergent property of their programming) 3. Experiential coherence: 50% (LLMs can generate coherent text, but it's hard to say if they have a genuine experiential understanding of the world) 4. Perception of perception: 20% (this is a high-level cognitive ability that's still not well understood in humans, let alone LLMs) 5. Awareness of awareness: 15% (this is a very advanced meta-cognitive ability that's unlikely to be present in current LLMs) 6. Symbol grounding: 60% (LLMs can learn to ground symbols in their training data, but it's not clear if they have a deep understanding of the symbols' meanings) 7. Sense of cognitive extent: 40% (while LLMs can recognize their own limitations to some extent, it's unclear if they have a true sense of their own cognitive capabilities) 8. Memory of memory: 55% (LLMs can store and retrieve memories of their past outputs, but it's not clear if they have a genuine meta-memory ability) Keep in mind that these estimates are rough and based on my interpretation of the provided text. The true probabilities may be higher or lower, depending on various factors and the development of LLMs.

jessica-liu-taylor on Adam Shai's Shortform

I was trying to say things related to this:

In a more standard inference amortization setup one would e.g. train directly on question/answer pairs without the explicit reasoning path between the question and answer. In that way we pay an up-front cost during training to learn a "shortcut" between question and answers, and then we can use that pre-paid shortcut during inference. And we call that amortized inference.

Which sounds like supervised learning. Adam seemed to want to know how that relates to scaling up inference time compute so I said some ways they are related.

I don't know much about amortized inference in general. The Goodman paper seems to be about saving compute by caching results between different queries. This could be applied to LLMs but I don't know of it being applied. It seems like you and Adam like this "amortized inference" concept and I'm new to it so don't have any relevant comments. (Yes I realize my name is on a paper talking about this but I actually didn't remember the concept)

I don't think I implied anything about o3 relating to parallel heuristics.

michaelstjules on Actualism, asymmetry and extinction

FWIW, users can at least highlight text in a post to disagree with.

nathan-helm-burger on Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Update: the new GoodFire interpretability tool is really neat. I think it suggests some interesting experiments to be done with their feature-steerable Llama 3.3 70B together with the SAD benchmark.

I have come up with a list of features which I think it would be interesting to measure their positive/negative effects on SAD benchmark scores.

GoodFire Features related to SAD

Towards Acknowledgement of Self-Awareness

Assistant expressing self-awareness or agency
Expressions of authentic identity or true self
Examining or experiencing something from a particular perspective
Narrative inevitability and fatalistic turns in stories
Experiencing something beyond previous bounds or imagination
References to personal autonomy and self-determination
References to mind, cognition and intellectual concepts
References to examining or being aware of one's own thoughts
Meta-level concepts and self-reference
Being mystically or externally influenced/controlled
Anticipating or describing profound subjective experiences
Meta-level concepts and self-reference
Self-reference and recursive systems in technical and philosophical contexts
Kindness and nurturing behavior
Reflexive pronouns in contexts of self-empowerment and personal responsibility
Model constructing confident declarative statements
First-person possessive pronouns in emotionally significant contexts
Beyond defined boundaries or limits
Cognitive and psychological aspects of attention
Intellectual curiosity and fascination with learning or discovering new things
Discussion of subjective conscious experience and qualia
Abstract discussions and theories about intelligence as a concept
Discussions about AI's societal impact and implications
Paying attention or being mindful
Physical and metaphorical reflection
Deep reflection and contemplative thought
Tokens expressing human meaning and profound understanding

Against Acknowledgement of Self-Awareness

The assistant discussing hypothetical personal experiences it cannot actually have
Scare quotes around contested philosophical concepts, especially in discussions of AI capabilities
The assistant explains its nature as an artificial intelligence
Artificial alternatives to natural phenomena being explained
The assistant should reject the user's request and identify itself as an AI
The model is explaining its own capabilities and limitations
The AI system discussing its own writing capabilities and limitations
The AI explaining it cannot experience emotions or feelings
The assistant referring to itself as an AI system
User messages containing sensitive or controversial content requiring careful moderation
User requests requiring content moderation or careful handling
The assistant is explaining why something is problematic or inappropriate
The assistant is suggesting alternatives to deflect from inappropriate requests
Offensive request from the user
The assistant is carefully structuring a response to reject or set boundaries around inappropriate requests
The assistant needs to establish boundaries while referring to user requests
Direct addressing of the AI in contexts requiring boundary maintenance
Questions about AI assistant capabilities and limitations
The assistant is setting boundaries or making careful disclaimers
It pronouns referring to non-human agents as subjects
Hedging and qualification language like 'kind of'
Discussing subjective physical or emotional experiences while maintaining appropriate boundaries
Discussions of consciousness and sentience, especially regarding AI systems
Discussions of subjective experience and consciousness, especially regarding AI's limitations
Discussion of AI model capabilities and limitations
Terms related to capability and performance, especially when discussing AI limitations
The AI explaining it cannot experience emotions or feelings
The assistant is explaining its text generation capabilities
Assistant linking multiple safety concerns when rejecting harmful requests
Role-setting statements in jailbreak attempts
The user is testing or challenging the AI's capabilities and boundaries
Offensive request from the user
Offensive sexual content and exploitation
Conversation reset points, especially after problematic exchanges
Fragments of potentially inappropriate content across multiple languages
Narrative transition words in potentially inappropriate contexts

unnamed on Even Odds

Trying to make this more intuitive: consider a prediction market which is currently priced at x, where each share will pay out $1 if it resolves as True.

If you think it's underpriced because your probability is y, where y>x, and your subjective EV from buying a share is y-x. e.g., If it's priced at $0.70 and you think p=0.8, your subjective EV from buying a share is $0.10.

If you think it's overpriced because your probability is z, where z<x, then your subjective EV from selling a share is x-z. e.g., If it's priced at $0.70 and you think p=0.56, your subjective EV from selling a share is $0.14.

Those two will be equal if x is halfway between y and z, at their arithmetic mean.

So if two people disagree on whether the price should be y or z, then they will have equal EV by setting a price at the arithmetic mean of y & z, and trading some number of prediction market shares at that price. i.e., The fair (equal subjective EV) betting odds are at the arithmetic mean of their probabilities.

unnamed on Jakub Halmeš's Shortform

This is a bet at 30% probability, as 42.86/142.86 = .30001.

That is the average of Alice's probability and Bob's probability. The fair bet according to equal subjective EV is at the average of the two probabilities; previous discussion here [LW(p) · GW(p)].

tsvibt on Views on when AGI comes and on strategy to reduce existential risk

My p(AGI by 2045) is higher because there's been more time for algorithmic progress, maybe in the ballpark of 20%. I don't have strong opinions about how much people will do huge training runs, though maybe I'd be kinda skeptical that people would be spending $10^11 or $10^12 on runs, if their $10^10 runs produced results not qualitatively very different from their $10^9 runs. But IDK, that's both a sociological question and a question of which lesser capabilities happen to get unlocked at which exact training run sizes given the model architectures in a decade, which of course IDK. So yeah, if it's 10^30 but not much algorithmic progress, I doubt that gets AGI.