LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

On Dwarkesh’s Podcast with OpenAI’s John Schulman
Zvi · 2024-05-21T17:30:04.332Z · comments (4)

Mistakes people make when thinking about units
Isaac King (KingSupernova) · 2024-06-25T03:39:20.138Z · comments (14)

When "yang" goes wrong
Joe Carlsmith (joekc) · 2024-01-08T16:35:50.607Z · comments (6)

In Defense of Open-Minded UDT
abramdemski · 2024-08-12T18:27:36.220Z · comments (27)

Interpreting Preference Models w/ Sparse Autoencoders
Logan Riggs (elriggs) · 2024-07-01T21:35:40.603Z · comments (12)

[link] LK-99 in retrospect
bhauth · 2024-07-07T02:06:27.660Z · comments (21)

[link] Excerpts from "A Reader's Manifesto"
Arjun Panickssery (arjun-panickssery) · 2024-09-06T22:37:40.254Z · comments (1)

Laziness death spirals
PatrickDFarley · 2024-09-19T15:58:30.252Z · comments (6)

AXRP Episode 31 - Singular Learning Theory with Daniel Murfet
DanielFilan · 2024-05-07T03:50:05.001Z · comments (4)

[link] Soft Nationalization: how the USG will control AI labs
Deric Cheng (deric-cheng) · 2024-08-27T15:11:14.601Z · comments (7)

Intro to Superposition & Sparse Autoencoders (Colab exercises)
CallumMcDougall (TheMcDouglas) · 2023-11-29T12:56:21.608Z · comments (8)

AI for Bio: State Of The Field
sarahconstantin · 2024-08-30T18:00:02.187Z · comments (2)

Announcing Suffering For Good
Garrett Baker (D0TheMath) · 2024-04-01T17:08:12.322Z · comments (5)

FarmKind's Illusory Offer
jefftk (jkaufman) · 2024-08-09T11:30:07.082Z · comments (5)

Testbed evals: evaluating AI safety even when it can’t be directly measured
joshc (joshua-clymer) · 2023-11-15T19:00:41.908Z · comments (2)

LW Frontpage Experiments! (aka "Take the wheel, Shoggoth!")
Ruby · 2024-04-23T03:58:43.443Z · comments (27)

Survey for alignment researchers!
Cameron Berg (cameron-berg) · 2024-02-02T20:41:44.323Z · comments (11)

We need a Science of Evals
Marius Hobbhahn (marius-hobbhahn) · 2024-01-22T20:30:39.493Z · comments (13)

Guide to SB 1047
Zvi · 2024-08-20T13:10:07.408Z · comments (18)

Some Rules for an Algebra of Bayes Nets
johnswentworth · 2023-11-16T23:53:11.650Z · comments (30)

Do sparse autoencoders find "true features"?
Demian Till · 2024-02-22T18:06:59.630Z · comments (33)

Related Discussion from Thomas Kwa's MIRI Research Experience
Raemon · 2023-10-07T06:25:00.994Z · comments (140)

Linking Alt Accounts
jefftk (jkaufman) · 2023-10-06T17:00:09.802Z · comments (33)

[link] The True Story of How GPT-2 Became Maximally Lewd
Writer · 2024-01-18T21:03:08.167Z · comments (7)

AI Safety is Dropping the Ball on Clown Attacks
trevor (TrevorWiesinger) · 2023-10-22T20:09:31.810Z · comments (73)

Secular interpretations of core perennialist claims
zhukeepa · 2024-08-25T23:41:02.683Z · comments (31)

Eliezer's example on Bayesian statistics is wr... oops!
Zane · 2023-10-17T18:38:18.327Z · comments (13)

If we solve alignment, do we die anyway?
Seth Herd · 2024-08-23T13:13:10.933Z · comments (65)

Update on Chinese IQ-related gene panels
Lao Mein (derpherpize) · 2023-12-14T10:12:21.212Z · comments (7)

[link] Yoshua Bengio: Reasoning through arguments against taking AI safety seriously
Judd Rosenblatt (judd) · 2024-07-11T23:53:17.187Z · comments (3)

[link] [Repost] The Copenhagen Interpretation of Ethics
mesaoptimizer · 2024-01-25T15:20:08.162Z · comments (4)

[link] OpenAI: Preparedness framework
Zach Stein-Perlman · 2023-12-18T18:30:10.153Z · comments (23)

Creating unrestricted AI Agents with Command R+
Simon Lermen (dalasnoin) · 2024-04-16T14:52:50.917Z · comments (12)

D&D.Sci Scenario Index
aphyer · 2024-07-23T02:00:43.483Z · comments (0)

“Artificial General Intelligence”: an extremely brief FAQ
Steven Byrnes (steve2152) · 2024-03-11T17:49:02.496Z · comments (6)

Claude 3 claims it's conscious, doesn't want to die or be modified
Mikhail Samin (mikhail-samin) · 2024-03-04T23:05:00.376Z · comments (113)

Adam Optimizer Causes Privileged Basis in Transformer LM Residual Stream
Diego Caples (diego-caples) · 2024-09-06T17:55:34.265Z · comments (7)

Epistemic Hell
rogersbacon · 2024-01-27T17:13:09.578Z · comments (20)

Dumbing down
Martin Sustrik (sustrik) · 2024-06-09T06:50:47.469Z · comments (0)

[link] A framing for interpretability
Nina Panickssery (NinaR) · 2023-11-14T16:14:15.713Z · comments (5)

[link] InterLab – a toolkit for experiments with multi-agent interactions
Tomáš Gavenčiak (tomas-gavenciak) · 2024-01-22T18:23:35.661Z · comments (0)

AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt
DanielFilan · 2024-04-11T21:30:04.244Z · comments (10)

Finding Sparse Linear Connections between Features in LLMs
Logan Riggs (elriggs) · 2023-12-09T02:27:42.456Z · comments (5)

[link] We're all in this together
Tamsin Leake (carado-1) · 2023-12-05T13:57:46.270Z · comments (65)

[link] Davidad's Provably Safe AI Architecture - ARIA's Programme Thesis
simeon_c (WayZ) · 2024-02-01T21:30:44.090Z · comments (17)

[link] Former OpenAI Superalignment Researcher: Superintelligence by 2030
Julian Bradshaw · 2024-06-05T03:35:19.251Z · comments (30)

How We Picture Bayesian Agents
johnswentworth · 2024-04-08T18:12:48.595Z · comments (14)

[link] The Inner Ring by C. S. Lewis
Saul Munn (saul-munn) · 2024-04-24T22:48:09.228Z · comments (6)

[link] Paper: Understanding and Controlling a Maze-Solving Policy Network
TurnTrout · 2023-10-13T01:38:09.147Z · comments (0)

Flagging Potentially Unfair Parenting
jefftk (jkaufman) · 2023-12-26T12:40:05.099Z · comments (1)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

nathan-helm-burger on Reinforcement Learning from Information Bazaar Feedback, and other uses of information markets

Neat idea! I've been thinking about things along this line for years because of science fiction writers like Charles Stross who wrote about the idea of having digital clones of yourself that you could spawn to act as assistants and do things like enter a simulation with a set of information to analyzed, and return a report after being run faster-than-realtime. For example, to analyze potentially infohazardous information sent by an untrusted party.

Also, the idea that people might assign their digital clones to go on a date, and then agree to go on a date if both their clones came back with a positive recommendation.

Of course, literally using a digital clone of yourself requires being rather cavalier about destroying conscious beings after you are done using them as a tool. Seems like it makes a lot more sense to use a non-conscious tool-AI without emotions for this sort of purpose.

david-lorell on We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap

Not quite what we were trying to say in the post. Rather than tradeoffs being decided on reflection, we were trying to talk about the causal-inference-style "explaining away" which the reflection gives enough compute for. In Johannes's example, the idea is that the sadist might model the reward as coming potentially from two independent causes: a hardcoded sadist response, and "actually" valuing the pain caused. Since the probability of one cause, given the effect, goes down when we also know that the other cause definitely obtained, the sadist might lower their probability that they actually value hurting people given that (after reflection) they're quite sure they are hardcoded to get reward for it. That's how it's analagous to the ant thing.

nathan-helm-burger on RLHF is the worst possible thing done when facing the alignment problem

I just want to say that I am someone who is afraid that the world is currently in a very offense-dominant strategic position currently, but I don't think defense is pointless at all. I think it's quite tractable and should be heavily invested in! Let's get some d/acc going people!

In fact, a lot of my hope for good outcomes for the future route through Good Actors (probably also making a good profit) using powerful tool-AI to do defensive acceleration of R&D in a wide range of fields. Automated Science-for-Good, including automated alignment research. Getting there without the AI causing catastrophe in the meantime is a challenge, but not an intractable one.

tailcalled on tailcalled's Shortform

The universe has many conserved and approximately-conserved quantities, yet among them energy feels "special" to me. Some speculations why:

The sun bombards the earth with a steady stream of free energy, which leaves out into the night.
Time-evolution is determined by a 90-degree rotation of energy (Schrodinger equation/Hamiltonian mechanics).
Breaking a system down into smaller components primarily requires energy.
While aspects of thermodynamics could apply to many conserved quantities, we usually apply it to energy only, and it was first discovered in the context of energy.

I guess the standard rationalist-empiricist-reductionist answer would be to say that this is all caused by the second point combined with some sort of space symmetry. I would have agreed until recently, but now it feels circular to me since the reduction into energy relies on our energy-centered way of perceiving the world. So instead I'm wondering if the first point is closer to the core.

dxu on We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap

These two kinds of “learning” are not synonymous. Adaptive systems “learn” things, but they don’t necessarily “learn about” things; they don’t necessarily have an internal map of the external territory. (Yes, the active inference folks will bullshit about how any adaptive system must have a map of the territory, but their math does not substantively support that interpretation.) The internal heuristics or behaviors “learned” by an adaptive system are not necessarily “about” any particular external thing, and don’t necessarily represent any particular external thing.

I think I am confused both about whether I think this is true, and about how to interpret in such a way that it might be true. Could you go into more detail on what it means for a learner to learn something without there being some representational semantics that could be used to interpret what it's learned, even if the learner itself doesn't explicitly represent those semantics? Or is the lack of explicit representation actually the core substance of the claim here?

johannes-c-mayer on We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap

Yes exactly. The larva example illustrates that there are different kinds of values. I thought it was underexplored in the OP to characterize exactly what these different kinds of values are.

In the sadist example we have:

the hardcoded pleasure of hurting people.
And we have, let's assume, the wish to make other people happy.

These two things both seem like values. However, they seem to be qualitatively different kinds of values. I intuit that more precisely characterizing this difference is important. I have a bunch of thoughts on this that I failed to write up so far.

richard_kennaway on tailcalled's Shortform

I will look forward to that. I have read the LDSL posts, but I cannot say that I understand them, or guess what the connection might be with destiny and higher powers.

raemon on We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap

Okay, I think one crystallization here for me is that "explaining away" is a matter of degree. (I think I found the second half of the comment less helpful, but the combo of the first half + John's response is helpful both for my own updating, and seeing where you guys are currently at)

richard_kennaway on Secular interpretations of core perennialist claims

I prefer to see Reality as "nihil supernum" rather than Goodness. Reality does not speak. It promises me nothing. It owes me nothing. If it is not as I wish it to be, it is up to me, and me only, to act to make it more to my liking. There is no-one and nothing to magically make things right. Or wrong, for that matter.

This does not have the problem that the Goodness idea has, of how to justify calling it Good to people who are in very bad circumstances. Nor is there any question of dropping a letter and calling it God.

sharmake-farah on Alexander Gietelink Oldenziel's Shortform

We have been shown that this search algorithm works, and we not yet have been shown that the other approaches don't work.

Remember, technological development is disjunctive, and just because you've shown that 1 approach works, doesn't mean that we have been shown that only that approach works.

Of course, people will absolutely try to scale this one up now that they found success, and I think that timelines have definitely been shortened, but remember that AI progress is closer to a disjunctive scenario than conjunctive scenario:

I agree with this quote below, but I wanted to point out the disjunctiveness of AI progress:

As we spoke earlier - it was predictable that this was going to be the next step. It was likely it was going to work, but there was a hopeful world in which doing the obvious thing turned out to be harder. That hope has been dashed - it suggests longer horizons might be easy too. This means superintelligence within two years is not out of the question.

https://gwern.net/forking-path