LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Two LessWrong speed friending experiments
mikko (morrel) · 2024-06-15T10:52:26.081Z · comments (3)

[link] The Mysterious Trump Buyers on Polymarket
Annapurna (jorge-velez) · 2024-10-18T13:26:25.565Z · comments (9)

Gradient Descent on the Human Brain
Jozdien · 2024-04-01T22:39:24.862Z · comments (5)

Schelling points in the AGI policy space
mesaoptimizer · 2024-06-26T13:19:25.186Z · comments (2)

Announcing the Double Crux Bot
sanyer (santeri-koivula) · 2024-01-09T18:54:15.361Z · comments (8)

Anthropical Paradoxes are Paradoxes of Probability Theory
Ape in the coat · 2023-12-06T08:16:26.846Z · comments (18)

Reflections on my first year of AI safety research
Jay Bailey · 2024-01-08T07:49:08.147Z · comments (3)

The case for stopping AI safety research
catubc (cat-1) · 2024-05-23T15:55:18.713Z · comments (38)

Was Releasing Claude-3 Net-Negative?
Logan Riggs (elriggs) · 2024-03-27T17:41:56.245Z · comments (5)

AI #45: To Be Determined
Zvi · 2024-01-04T15:00:05.936Z · comments (4)

[link] OpenAI Staff (including Sutskever) Threaten to Quit Unless Board Resigns
Seth Herd · 2023-11-20T14:20:33.539Z · comments (28)

Parental Writing Selection Bias
jefftk (jkaufman) · 2024-10-13T14:00:03.225Z · comments (3)

BatchTopK: A Simple Improvement for TopK-SAEs
Bart Bussmann (Stuckwork) · 2024-07-20T02:20:51.848Z · comments (0)

Pseudonymity and Accusations
jefftk (jkaufman) · 2023-12-21T19:20:19.944Z · comments (20)

AI #43: Functional Discoveries
Zvi · 2023-12-21T15:50:04.442Z · comments (26)

Can we build a better Public Doublecrux?
Raemon · 2024-05-11T19:21:53.326Z · comments (6)

Claude Sonnet 3.5.1 and Haiku 3.5
Zvi · 2024-10-24T14:50:06.286Z · comments (9)

Polysemantic Attention Head in a 4-Layer Transformer
Jett Janiak (jett) · 2023-11-09T16:16:35.132Z · comments (0)

Does literacy remove your ability to be a bard as good as Homer?
Adrià Garriga-alonso (rhaps0dy) · 2024-01-18T03:43:14.994Z · comments (19)

Rewilding the Gut VS the Autoimmune Epidemic
GGD · 2024-08-16T18:00:46.239Z · comments (0)

[link] The Good Balsamic Vinegar
jenn (pixx) · 2024-01-26T19:30:57.435Z · comments (4)

Book Review: Righteous Victims - A History of the Zionist-Arab Conflict
Yair Halberstadt (yair-halberstadt) · 2024-06-24T11:02:03.490Z · comments (8)

[link] Anthropic's updated Responsible Scaling Policy
Zac Hatfield-Dodds (zac-hatfield-dodds) · 2024-10-15T16:46:48.727Z · comments (3)

Model evals for dangerous capabilities
Zach Stein-Perlman · 2024-09-23T11:00:00.866Z · comments (9)

Cooperating with aliens and AGIs: An ECL explainer
Chi Nguyen · 2024-02-24T22:58:47.345Z · comments (8)

Provably Safe AI: Worldview and Projects
bgold · 2024-08-09T23:21:02.763Z · comments (43)

[link] how birds sense magnetic fields
bhauth · 2024-06-27T18:59:35.075Z · comments (4)

How to Give in to Threats (without incentivizing them)
Mikhail Samin (mikhail-samin) · 2024-09-12T15:55:50.384Z · comments (26)

Llama Llama-3-405B?
Zvi · 2024-07-24T19:40:07.565Z · comments (9)

On Lex Fridman’s Second Podcast with Altman
Zvi · 2024-03-25T12:20:08.780Z · comments (10)

Will 2024 be very hot? Should we be worried?
A.H. (AlfredHarwood) · 2023-12-29T11:22:50.200Z · comments (12)

[link] Prices are Bounties
Maxwell Tabarrok (maxwell-tabarrok) · 2024-10-12T14:51:40.689Z · comments (13)

The Shutdown Problem: Incomplete Preferences as a Solution
EJT (ElliottThornley) · 2024-02-23T16:01:16.378Z · comments (23)

Applying refusal-vector ablation to a Llama 3 70B agent
Simon Lermen (dalasnoin) · 2024-05-11T00:08:08.117Z · comments (14)

On OpenAI’s Preparedness Framework
Zvi · 2023-12-21T14:00:05.144Z · comments (4)

[link] Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
Gunnar_Zarncke · 2024-05-16T13:09:39.265Z · comments (20)

Reformative Hypocrisy, and Paying Close Enough Attention to Selectively Reward It.
Andrew_Critch · 2024-09-11T04:41:24.872Z · comments (7)

How might we solve the alignment problem? (Part 1: Intro, summary, ontology)
Joe Carlsmith (joekc) · 2024-10-28T21:57:12.063Z · comments (5)

[link] Bed Time Quests & Dinner Games for 3-5 year olds
Gunnar_Zarncke · 2024-06-22T07:53:38.989Z · comments (0)

D&D.Sci Alchemy: Archmage Anachronos and the Supply Chain Issues Evaluation & Ruleset
aphyer · 2024-06-17T21:29:08.778Z · comments (11)

Toy models of AI control for concentrated catastrophe prevention
Fabien Roger (Fabien) · 2024-02-06T01:38:19.865Z · comments (2)

AI #52: Oops
Zvi · 2024-02-22T21:50:07.393Z · comments (9)

[link] Can AI Outpredict Humans? Results From Metaculus's Q3 AI Forecasting Benchmark
ChristianWilliams · 2024-10-10T18:58:46.041Z · comments (2)

Sherlockian Abduction Master List
Cole Wyeth (Amyr) · 2024-07-11T20:27:00.000Z · comments (63)

Gemini 1.0
Zvi · 2023-12-07T14:40:05.243Z · comments (7)

On Complexity Science
Garrett Baker (D0TheMath) · 2024-04-05T02:24:32.039Z · comments (19)

[link] A starter guide for evals
Marius Hobbhahn (marius-hobbhahn) · 2024-01-08T18:24:23.913Z · comments (2)

Transfer learning and generalization-qua-capability in Babbage and Davinci (or, why division is better than Spanish)
RP (Complex Bubble Tea) · 2024-02-09T07:00:45.825Z · comments (6)

Vipassana Meditation and Active Inference: A Framework for Understanding Suffering and its Cessation
Benjamin Sturgeon (benjamin-sturgeon) · 2024-03-21T12:32:22.475Z · comments (8)

AI #82: The Governor Ponders
Zvi · 2024-09-19T13:30:04.863Z · comments (8)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

abramdemski on Complete Feedback

How about “purely epistemic” means “updated by self-supervised learning”, i.e. the updates (gradients, trader bankrolls, whatever) are derived from “things being true vs false” as opposed to “things being good vs bad”. Right?

You can define it that way, but then I don't think it's highly relevant for this context.

The story I'm telling here is that partial feedback (typically: learning some sort of input-output relation via some sort of latents) always leaves us with undesired hypotheses which we can't rule out using the restricted feedback mechanism.

Reinforcement Learning cannot rule out the wireheading hypothesis or human-manipulation hypothesis.
Updating on things being true or false cannot rule out agentic hypotheses (the inner optimizer problem).

Any sufficiently rich hypotheses space has agentic policies, which can't be ruled out by the feedback. "Purely epistemic" in your sense filters for hypotheses which make good predictions, but this doesn't constrain things to be non-agentic [LW · GW]. The system can learn to use predictions as actions [LW · GW] in some way.

[I learned the term teleosemantics [LW · GW] from you! :) ]

I think it would be fair to define a teleosemantic notion of "purely epistemic" as something like "there is no optimization (anywhere in the system -- 'inner' or 'outer') except optimization for epistemic accuracy".

The obvious application of my main point is that some form of "complete feedback" is a necessary (but insufficient) condition for this.

"Epistemic accuracy" here has to be defined in such a way as to capture the one-way "direction-of-fit" optimization of the map to fit the territory, but never the territory to fit the map. IE the optimization algorithm has to ignore the causal impact of its predictions.

However, I don't particularly endorse this as the correct design choice -- although a system with this property would be relatively safe in the sense of eliminating inner-alignment concerns and (in a sense) outer-alignment concerns, it is doing so by ignoring its impact on the world, which creates its own set of dangers. If such a system were widely deployed and became highly trusted for its predictions, it could stumble into bad self-fulfilling prophecies.

So, in my view, "epistemic" systems should be as transparent as possible with human users about possible multiple-fixed-point issues, try to keep humans in the loop and give the important decisions to humans; but ultimately, we need to view even "purely epistemic" systems as making some important (instrumental) decisions, and have them take some responsibility for making those decisions well instead of poorly.

The original LI paper was in that category, IIUC. The updates (to which traders had more vs less money) are derived from mathematical propositions being true vs false.

I was going to remind you that the paper didn't say how the fixed-point selection works, and we can do that part in an agentic way, [LW · GW] but then you go on to say basically the same thing (with the caveat that where you say "put the LI into a larger system that follows the rule: whatever the expectations are about the AI's own actions, make that actually happen" I would say the more general "put the LI into an environment which somehow reacts to its predictions"):

LI defines a notion of logically uncertain variable, which can be used to represent desires
I would say that they don’t really represent desires. They represent expectations about what’s going to happen, possibly including expectations about an AI’s own actions.
And then you can then put the LI into a larger system that follows the rule: whatever the expectations are about the AI’s own actions, make that actually happen.
The important thing that changes in this situation is that the convergence of the algorithm is underdetermined—you can have multiple fixed points. I can expect to stand up, and then I stand up, and my expectation was validated. No update. I can expect to stay seated, and then I stay seated, and my expectation was validated. No update.
(I don’t think I’m saying anything you don’t already know well.)
Anyway, if you do that, then I guess you could say that the LI’s expectations “can be used” to represent desires … but I maintain that that’s a somewhat confused and unproductive way to think about what’s going on. If I intervene to change the LI variable, it would be analogous to changing habits (what do I expect myself to do ≈ which action plans seem most salient and natural), not analogous to changing desires.
(I think the human brain has a system vaguely like LI, and that it resolves the underdetermination by a separate valence [LW · GW] system, which evaluates expectations as being good vs bad, and applies reinforcement learning to systematically seek out the good ones.)

I don't understand what you're trying to accomplish in these paragraphs. To me you sound sorta like Bob in the following:

Alice: Here's my computer model of an agent.
Bob: Uh oh, that sounds sort of like active inference. How did you represent values? Did you confuse them with beliefs?
Alice: I used floating-point numbers to represent the expected value of a state. Here, look at my code. It's a Q-learning algorithm.
Bob: You realize that "expected value" is a statistics thing, right? That makes it epistemic, not really value-laden in the sense that makes something agentic. It's a prediction of what a number will be. Indeed, we can justify expected values as min-quadratic-loss estimates. That makes them epistemic!
Alice: Well, I agree that expected values aren't automatically "values" in the agentic sense, but look, my code can solve mazes and stuff -- like a rat learning to get cheese.
Bob: Of course I agree that it can be used in an instrumental way, but that's a really misleading way to describe it overall, right? If you changed one Q-value estimated by the system, that would be analogous to changing habits, not desires, right?
Alice: um??? If we agree that it can be used in an instrumental way, then what are you saying is misleading?
Bob: I mean, sure, the human brain does something like this.
Alice: Ok??

It seems possible that you think we have some disagreement that we don't have?

Also, it is easy for end users to build agentlike things out of belieflike things by making queries about how to accomplish things. Thus, we need to train epistemic systems to be responsible about how such queries are answered (as is already apparent in existing chatbots).
I’m not sure that this is coming from a coherent threat model (or else I don’t follow).
If Dr. Evil trains his own AGI, then this whole thing is moot, because he wants the AGI to have accurate beliefs about bioweapons.
If Benevolent Bob trains the AGI and gives API access to Dr. Evil, then Bob can design the AGI to (1) have accurate beliefs about bioweapons, and (2) not answer Dr. Evil’s questions about bioweapons. That might ideally look like what we’re used to in the human world: the AGI says things because it wants to say those things, all things considered, and it doesn’t want Dr. Evil to build bioweapons, either directly or because it’s guessing what Bob would want.

I'm not clear on what you're trying to disagree with here. It sounds like we both agree that if Benevolent Bob builds a powerful "purely epistemic system" (by whatever definition), without limiting its knowledge, then Dr. Evil can misuse it; and we both agree that as a consequence of this, it makes sense to instead build some agency into the system, so that the system can decide not to give users dangerous information.

Possibly you disagree with the claim "it is easy to build agentlike things out of belieflike things"? What I have in mind is a powerful epistemic oracle. As a simple example, let's say it can give highly accurate guesses to mathematically-posed problems. Then Dr. Evil can implement AIXI by feeding in AIXI's mathematical definition, for example. This is the sort of thing I had in mind, but generalized to the nonmathematical case. (EG, "conditional on my owning a super-powerful death ray soon, what actions do I take now")

jozdien on johnswentworth's Shortform

For what it's worth, and for the purpose of making a public prediction in case I'm wrong, my median prediction is that [some mixture of scaling + algorithmic improvements still in the LLM regime, with at least 25% gains coming from the former] will continue for another couple years. And that's separate from my belief that if we did try to only advance through the same mixture of scale and algorithmic advancement, we'd still get much more powerful models, just slower.

I'm not very convinced by the claims about scaling hitting a wall, considering we haven't had the compute to train models significantly larger than GPT-4 until recently. Plus other factors like post-training taking a lot of time (GPT-4 took ~6 months from the base model being completed to release, I think? And this was a lot longer than GPT-3), labs just not being good at understanding how good their models are, etc. Though I'm not sure how much of your position is closer to "scaling will be <25-50% of future gains" than "scaling gains will be marginal / negligible", especially since a large part of this trajectory involves e.g. self-play or curated data for overcoming the data wall (would that count more as an algorithmic improvement or scaling?)

haiku-1 on The Compendium, A full argument about extinction risk from AGI

I used to be a creationist, and I have put some thought into this stumbling block. I came to the conclusion that it isn't worth leaving out analogies to evolution, because the style of argument that would work best for most creationists is completely different to begin with. Creationism is correlated with religious conservatism, and most religious conservatives outright deny that human extinction is a possibility.

The Compendium isn't meant for that audience, because it explicitly presents a worldview, and religious conservatives tend to strongly resist shifts to their worldviews or the adoption of new worldviews (moreso than others already do). I think it is best left to other orgs to make arguments about AI Risk that are specifically friendly to religious conservatism. (This isn't entirely hypothetical. PauseAI US has recently begun to make inroads with religious organizations.)

unexpectedvalues on Seven lessons I didn't learn from election day

Yeah, if you were to use the neighbor method, the correct way to do so would involve post-processing, like you said. My guess, though, is that you would get essentially no value from it even if you did that, and that the information you get from normal polls would prrtty much screen off any information you'd get from the neighbor method.

d0themath on D0TheMath's Shortform

The mistake you are making is assuming that "ZFC is consistent" = Consistent(ZFC) where the ladder is the Godel encoding for "ZFC is consistent" specified within the language of ZFC.

If your logic were valid, it would just as well break the entirety of the second incompleteness theorem. That is, you would say "well of course ZFC can prove Consistent(ZFC) if it is consistent, for either ZFC is consistent, and we're done, or ZFC is not consistent, but that is a contradiction since 'ZFC is consistent' => Consistent(ZFC)".

The fact is that ZFC itself cannot recognize that Consistent(ZFC) is equivalent to "ZFC is consistent".

@Morpheus [LW · GW] you too seem confused by this, so tagging you as well.

unexpectedvalues on Seven lessons I didn't learn from election day

I think this just comes down to me having a narrower definition of a city.

johnswentworth on johnswentworth's Shortform

Regarding the recent memes about the end of LLM scaling: David and I have been planning on this as our median world since about six months ago. The data wall has been a known issue for a while now, updates from the major labs since GPT-4 already showed relatively unimpressive qualitative improvements by our judgement, and attempts to read the tea leaves of Sam Altman's public statements pointed in the same direction too. I've also talked to others (who were not LLM capability skeptics in general) who had independently noticed the same thing and come to similar conclusions.

Our guess at that time was that LLM scaling was already hitting a wall, and this would most likely start to be obvious to the rest of the world around roughly December of 2024, when the expected GPT-5 either fell short of expectations or wasn't released at all. Then, our median guess was that a lot of the hype would collapse, and a lot of the investment with it. That said, since somewhere between 25%-50% of progress has been algorithmic all along, it wouldn't be that much of a slowdown to capabilities progress, even if the memetic environment made it seem pretty salient. In the happiest case a lot of researchers would move on to other things, but that's an optimistic take, not a median world.

(To be clear, I don't think you should be giving us much prediction-credit for that, since we didn't talk about it publicly. I'm posting mostly because I've seen a decent number of people for whom the death of scaling seems to be a complete surprise and they're not sure whether to believe it. For those people: it's not a complete surprise, this has been quietly broadcast for a while now.)

leogao on Seven lessons I didn't learn from election day

People generally assume those around them agree with them (even when they don't see loud support of their position - see "silent majority"). So when you ask what their neighbors think, they will guess their neighbors have the same views as themselves, and will report their own beliefs with plausible deniability.

jsd on Win/continue/lose scenarios and execute/replace/audit protocols

This distinction reminds me of Evading Black-box Classifiers Without Breaking Eggs, in the black box adversarial examples setting.

buck on Fields that I reference when thinking about AI takeover prevention

FWIW, the there are some people around the AI safety space, especially people who work on safety cases, who have that experience. E.g. UK AISI works with some people who are experienced safety analysts from other industries.