LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] An Opinionated Evals Reading List
Marius Hobbhahn (marius-hobbhahn) · 2024-10-15T14:38:58.778Z · comments (0)

2. Corrigibility Intuition
Max Harms (max-harms) · 2024-06-08T15:52:29.971Z · comments (10)

[link] Static Analysis As A Lifestyle
adamShimi · 2024-07-03T18:29:37.384Z · comments (11)

How a chip is designed
YM (Yannick_Muehlhaeuser_duplicate0.05902100825326273) · 2024-06-28T08:04:27.392Z · comments (4)

Occupational Licensing Roundup #1
Zvi · 2024-10-30T11:00:04.516Z · comments (11)

[link] The Perceptron Controversy
Yuxi_Liu · 2024-01-10T23:07:23.341Z · comments (18)

Schelling game evaluations for AI control
Olli Järviniemi (jarviniemi) · 2024-10-08T12:01:24.389Z · comments (5)

Interpreting and Steering Features in Images
Gytis Daujotas (gytis-daujotas) · 2024-06-20T18:33:59.512Z · comments (6)

Do Not Mess With Scarlett Johansson
Zvi · 2024-05-22T15:10:03.215Z · comments (7)

[link] Drexler's Nanotech Software
PeterMcCluskey · 2024-12-02T04:55:20.432Z · comments (9)

Social status part 2/2: everything else
Steven Byrnes (steve2152) · 2024-03-05T16:29:19.072Z · comments (2)

SAEs are highly dataset dependent: a case study on the refusal direction
Connor Kissane (ckkissane) · 2024-11-07T05:22:18.807Z · comments (4)

Book Review: On the Edge: The Fundamentals
Zvi · 2024-09-23T13:40:11.058Z · comments (3)

On the Debate Between Jezos and Leahy
Zvi · 2024-02-06T14:40:05.487Z · comments (6)

AI Craftsmanship
abramdemski · 2024-11-11T22:17:01.112Z · comments (7)

Retrospective: PIBBSS Fellowship 2024
DusanDNesic · 2024-12-20T15:55:24.194Z · comments (1)

A Qualitative Case for LTFF: Filling Critical Ecosystem Gaps
Linch · 2024-12-03T21:57:23.597Z · comments (2)

Another argument against maximizer-centric alignment paradigms
Fiora from Rosebloom · 2024-09-22T07:28:27.856Z · comments (39)

[link] Learn to write well BEFORE you have something worth saying
eukaryote · 2024-12-29T23:42:31.906Z · comments (18)

On the Gladstone Report
Zvi · 2024-03-20T19:50:05.186Z · comments (11)

[link] AI, centralization, and the One Ring
owencb · 2024-09-13T14:00:16.126Z · comments (11)

[Intuitive self-models] 8. Rooting Out Free Will Intuitions
Steven Byrnes (steve2152) · 2024-11-04T18:16:26.736Z · comments (16)

[question] What's with all the bans recently?
[deleted] · 2024-04-04T06:16:49.062Z · answers+comments (83)

[link] DeepMind: Frontier Safety Framework
Zach Stein-Perlman · 2024-05-17T17:30:02.504Z · comments (0)

AI research assistants competition 2024Q3: Tie between Elicit and You.com
Elizabeth (pktechgirl) · 2024-10-12T15:10:05.417Z · comments (4)

[link] Moving on from community living
Vika · 2024-04-17T17:02:11.357Z · comments (7)

All About Concave and Convex Agents
mako yass (MakoYass) · 2024-03-24T21:37:17.922Z · comments (23)

Intricacies of Feature Geometry in Large Language Models
7vik (satvik-golechha) · 2024-12-07T18:10:51.375Z · comments (0)

[link] Outrage Bonding
Jonathan Moregård (JonathanMoregard) · 2024-08-09T13:46:59.818Z · comments (12)

[link] RL, but don't do anything I wouldn't do
Gunnar_Zarncke · 2024-12-07T22:54:50.714Z · comments (5)

Against most, but not all, AI risk analogies
Matthew Barnett (matthew-barnett) · 2024-01-14T03:36:16.267Z · comments (41)

[question] Is cybercrime really costing trillions per year?
Fabien Roger (Fabien) · 2024-09-27T08:44:07.621Z · answers+comments (28)

What mistakes has the AI safety movement made?
EuanMcLean (euanmclean) · 2024-05-23T11:19:02.717Z · comments (29)

On Llama-3 and Dwarkesh Patel’s Podcast with Zuckerberg
Zvi · 2024-04-22T13:10:02.645Z · comments (4)

Neuroscience of human social instincts: a sketch
Steven Byrnes (steve2152) · 2024-11-22T16:16:52.552Z · comments (0)

[link] Pay-on-results personal growth: first success
Chipmonk · 2024-09-14T03:39:12.975Z · comments (8)

[link] Superforecasting the Origins of the Covid-19 Pandemic
DanielFilan · 2024-03-12T19:01:15.914Z · comments (0)

AiPhone
Zvi · 2024-06-12T22:20:02.141Z · comments (4)

[link] A primer on why computational predictive toxicology is hard
Abhishaike Mahajan (abhishaike-mahajan) · 2024-08-19T17:16:37.735Z · comments (2)

Self-Awareness: Taxonomy and eval suite proposal
Daniel Kokotajlo (daniel-kokotajlo) · 2024-02-17T01:47:01.802Z · comments (2)

[link] Improving Dictionary Learning with Gated Sparse Autoencoders
Senthooran Rajamanoharan (SenR) · 2024-04-25T18:43:47.003Z · comments (38)

Bayesian updating in real life is mostly about understanding your hypotheses
Max H (Maxc) · 2024-01-01T00:10:30.978Z · comments (4)

Perils of Generalizing from One's Social Group
localdeity · 2024-11-24T15:31:18.332Z · comments (1)

[link] Electrostatic Airships?
DaemonicSigil · 2024-10-27T04:32:34.852Z · comments (13)

Managing catastrophic misuse without robust AIs
ryan_greenblatt · 2024-01-16T17:27:31.112Z · comments (17)

RTFB: California’s AB 3211
Zvi · 2024-07-30T13:10:03.853Z · comments (2)

Do not delete your misaligned AGI.
mako yass (MakoYass) · 2024-03-24T21:37:07.724Z · comments (13)

[link] Twitter thread on AI safety evals
Richard_Ngo (ricraz) · 2024-07-31T00:18:14.076Z · comments (3)

[link] Ice: The Penultimate Frontier
Roko · 2024-07-13T23:44:56.827Z · comments (56)

[link] Slightly More Than You Wanted To Know: Pregnancy Length Effects
JustisMills · 2024-10-21T01:26:02.030Z · comments (4)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

quila on quila's Shortform

The key features here in this future is that the superhuman equals optimal assumption is false [...]

oh, well to clarify then, i was trying to say that i didn't mean 'superhuman' at all, i directly meant optimal. i don't believe that superhuman = optimal.

tailcalled on Is Musk still net-positive for humanity?

What does it mean to assess it at any point, as distinct from in the long run? And was he really ever good for humanity if assessed through your one-point method? (E.g. climate impacts seems intrinsically a long-run thing...)

david-matolcsi on On Eating the Sun

Fair, I also haven't made any specific commitments, I phrased it wrongly. I agree there can be extreme scenarios with trillions of digital minds tortured where you'd maybe want to declare war on the. rest of society. But I would still like people to write down that "of course, I wouldn't want to destroy Earth before we can save all the people who want to live in their biological bodies, just to get a few years of acceleration in the cosmic conquest". I feel a sentence like this should really have been included in the original post about dismantling the Sun, and until people are not willing to write this down, I remain paranoid that they would in fact haul the Amish the extermination camps if it feels like a good idea at the time. (As I said, I met people who really held this position.)

habryka4 on On Eating the Sun

As I mentioned in the other thread, it seems right to me that some people will want the sun to continue being the sun, but my sense is that within the set of people who don't want to leave the solar system, don't want to be uploads, don't want to be cryogenically shipped to other solar systems, or otherwise for some reason will have strong preferences over what happens with this specific solar system, this will be a much less important preference than using the sun for things that people care about more.

sharmake-farah on quila's Shortform

i'm not sure what this means. my values basically refer to other beings having not-tormentful (and next in order of priority, happy/good) existences. (tried to formalize this more but it's hard)

That would immediately exclude quite a bit of people, from both the far left and far right, because I predict a lot of people definitely want at least some people to have tormentful lives.

in particular, i'm not sure if you're saying something which would seem trivially true to me or not. (example trivially true thing: someone who wants to tile literally the entire lightcone with happy humans not being able to do that is losing out under 'cosmopolitan' values relative to if their values controlled the entire lightcone. example trivially true thing 2: "the best possible world is relative to a given value set")

I was trying to say something trivially true in your ontology, but far too many people tend to deny that you do in fact have to make other values lose out, and people usually think the best possible world is absolute, not relative, and in particular I think a lot of people use the idea of value-aligned superintelligence as though it was a magic wand that could solve all conflict.

maxwell-peterson on Drake Thomas's Shortform

Thanks!

sharmake-farah on quila's Shortform

One example of such a future is a case where in 2028, OpenAI managed to scale up enough to make an AI that while not as good as a human worker in general (at least without heavy inference costs), it is good enough to act as a notable accelerant to AI research, such that by 2030-2031, AI research has been more or less automated away by Open AI, with competitors having such systems by 2031-2032, meaning AI progress becomes notably faster such that by 2033, we are on the brink of AI that can do a lot of job work, but the best models at this point are instead reinvested in AI R&D such that by 2035, superhuman AI is broadly achieved, and this is when the economy starts getting seriously disrupted.

The key features here in this future is that the superhuman equals optimal assumption is false, intent alignment works well enough that AI generally takes instructions from specific humans, and it's easy for others to get their own superintelligences with different values, such that conflict doesn't go away.

nathan-helm-burger on Human takeover might be worse than AI takeover

Yeah, I definitely don't think we could trust a continually learning or self-improving AI to stay trustworthy over a long period of time.

Indeed, the ability to appoint a static mind to a particular role is a big plus. It wouldn't be vulnerable to corruption by power dynamics.

Maybe we don't need a genius-level AI, maybe just a reasonably smart and very well aligned AI would be good enough. If the governance system was able to prevent superintelligent AI from ever being created (during the pre-agreed upon timeframe for pause), then we could manage a steady-state world peace.

benito on On Eating the Sun

It is good to have deontological commitments about what you would do with a lot of power. But this situation is very different from "a lot of power", it's also "if you were to become wiser and more knowledgeable than anyone in history so far". One can imagine the Christians of old asking for a commitment that "If you get this new scientific and industrial civilization that you want in 2,000 years from now, will you commit to following the teachings of Jesus?" and along the way I sadly find out that even though it seemed like a good and moral commitment at the time, it totally screwed my ability to behave morally in the future because Christianity is necessarily predicated on tons of falsehoods and many of its teachings are immoral.

But there is some version of this commitment I think is good to make... something like "Insofar as the players involved are all biological humans, I will respect the legal structures that exist and the existence of countries, and will not relate to them in ways that would be considered worthy of starting a war in its defense". But I'm not certain about this, for instance what if most countries in the world build 10^10 digital minds and are essentially torturing them? I may well wish to overthrow a country that is primarily torture with a small number of biological humans sitting on thrones on top of these people, and I am not willing to commit not to do that presently.

I understand that there are bad ethical things one can do with post-singularity power, but I do not currently see a clear way to commit to certain ethical behaviors that will survive contact with massive increases in knowledge and wisdom. I am interested if anyone has made other commitments about post-singularity life (or "on the cusp of singularity life") that they expect to survive contact with reality?

Added: At the very least I can say that I am not going to make commitments to do specific things that violate my current ethics. I have certainly made no positive commitment to violate people's bodily autonomy nor have such an intention.

dmitry-vaintrob on Dmitry Vaintrob's Shortform

Why you should try degrading NN behavior in experiments.

I got some feedback on the post I wrote yesterday [LW · GW] that seems right. The post is trying to do too many things, and not properly explaining what it is doing, why this is reasonable, and how the different parts are related.

I want to try to fix this, since I think the main piece of advice in this post is important, but gets lost in all the mess.

This main point is:

experimentalists should in many cases run an experiment on multiple neural nets with a variable complexity dial that allows some "natural" degradations of the NN's performance, and certain dials are better than others depending on context.

I am eventually planning splitting out the post into a few parts, one of which explains this more carefully. When I do this I will replace the current version of the post with just a discussion of the "koan" itself: i.e., nitpicks about work that isn't careful about thinking about the scale at which it is performing interpretability.

For now I want to give a quick reductive take on what I hope to be the main takeaway of this discussion. Namely, why I think "interpretability on degraded networks" is important for better interpretability.

Basically: when ML experiments modify a neural net to identify or induce a particular behavior, this always degrades performance. Now there are two hypotheses for what is going on:

You are messily pulling your NN in the direction of a particular behavior, and confusing this spurious messy phenomenon with finding a "genuine" phenomenon from the program's point of view.
You are messily pulling your NN in the direction of a particular behavior, but also singling out a few "real" internal circuits of the NN that are carrying out this behavior.

Because of how many parameters you have to play with and the polysemanticity of everything in a NN, it's genuinely hard to tell these two behaviors apart. You might find stuff that "looks" like a core circuit, but actually is just bits of other circuits combined together, and your circuit-fitting experiment makes look like a coherent behavior, and any nice properties of the resulting behavior that make it seem like an "authentic" circuit are just artefacts of the way you set up the experiment.

Now the idea behind running this experiment at "natural" degradations of network performance is to try to separate out these two possibilities more cleanly. Namely, an ideal outcome is that in running your experiment on some class of natural degradation of your neural net, you find a regime such that

the intervention you are running no longer significantly affects the (naturally degraded) performance
the observed effect still takes place.

Then what you've done is effectively "cleaned up" your experiment such that you are still probably finding interpretable behaviors in the original neural net (since a good degradation is likely to contain a subset of circuits/behaviors of your original net and not many "new behaviors), in a way that sufficiently reduced the complexity that the behavior you're seeking is no longer "entangled" with a bunch of other behaviors; this should significantly update you that the behavior is indeed "natural" and not spurious.

This is of course a very small, idealized sketch. But the basic idea behind looking at neural nets with degraded performance is to "squeeze" the complexity in a controlled way to suitably match the complexity of the circuit (and how it's embedded in the rest of the network/how it interacts with other circuits). If you then have a circuit of "the correct complexity" that explains a behavior, there is in some sense no "complexity room" for other sneaky phenomena to confound it.

In the post, the natural degradation I suggested is the physics-inspired "SLGD sampling" process which in some sense tries to add a maximal amount of noise to your NN while only having a limited impact on performance (measured by loss); this has a bias of keeping "generally useful" circuits and interactions and noising more inessential/ memorize-y circuits. Other interventions that have different properties are "just adding random noise" (either to weights or activations) to suitable reduce performance, or looking at earlier training checkpoints. I suspect that different degradations (or combinations thereof) are appropriate to isolate the relevant complexity of different experiments.