LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Definition of alignment science I like
quetzal_rainbow · 2025-01-06T20:40:38.187Z · comments (0)

Turning up the Heat on Deceptively-Misaligned AI
J Bostock (Jemist) · 2025-01-07T00:13:28.191Z · comments (16)

Balsa Research 2024 Update
Zvi · 2024-12-03T12:30:06.829Z · comments (0)

D/acc AI Security Salon
Allison Duettmann (allison-duettmann) · 2024-10-19T22:17:57.067Z · comments (0)

Write Good Enough Code, Quickly
Oliver Daniels (oliver-daniels-koch) · 2024-12-15T04:45:56.797Z · comments (10)

[link] Update on the Mysterious Trump Buyers on Polymarket
Annapurna (jorge-velez) · 2024-11-04T19:22:06.540Z · comments (9)

[link] Fragile, Robust, and Antifragile Preference Satisfaction
adamShimi · 2024-11-02T17:25:55.986Z · comments (0)

[link] Forecast 2025 With Vox's Future Perfect Team — $2,500 Prize Pool
ChristianWilliams · 2024-12-20T23:00:35.334Z · comments (0)

Open Thread Winter 2024/2025
habryka (habryka4) · 2024-12-25T21:02:41.760Z · comments (7)

Really radical empathy
MichaelStJules · 2025-01-06T17:46:31.269Z · comments (0)

Whistleblowing Twitter Bot
Mckiev · 2024-12-26T04:09:45.493Z · comments (5)

Review: “The Case Against Reality”
David Gross (David_Gross) · 2024-10-29T13:13:29.643Z · comments (9)

Higher and lower pleasures
Chris_Leong · 2024-12-05T13:13:46.526Z · comments (3)

Bridging the VLM and mech interp communities for multimodal interpretability
Sonia Joseph (redhat) · 2024-10-28T14:41:41.969Z · comments (5)

Economics Roundup #4
Zvi · 2024-10-15T13:20:06.923Z · comments (4)

Theoretical Alignment's Second Chance
lunatic_at_large · 2024-12-22T05:03:51.653Z · comments (0)

subfunctional overlaps in attentional selection history implies momentum for decision-trajectories
Emrik (Emrik North) · 2024-12-22T14:12:49.027Z · comments (1)

AGI with RL is Bad News for Safety
Nadav Brandes (nadav-brandes) · 2024-12-21T19:36:03.970Z · comments (22)

An exhaustive list of cosmic threats
Jordan Stone (jordan-stone) · 2025-01-09T19:59:08.368Z · comments (2)

[link] Chess As The Model Game
criticalpoints · 2024-11-17T19:45:26.499Z · comments (0)

[link] Why OpenAI’s Structure Must Evolve To Advance Our Mission
stuhlmueller · 2024-12-28T04:24:19.937Z · comments (1)

minifest
Austin Chen (austin-chen) · 2024-12-07T03:50:38.573Z · comments (1)

[link] To Be Born in a Bag
Niko_McCarty (niko-2) · 2024-10-06T17:21:00.605Z · comments (1)

Measuring Nonlinear Feature Interactions in Sparse Crosscoders [Project Proposal]
Jason Gross (jason-gross) · 2025-01-06T04:22:12.633Z · comments (0)

Proof Explained for "Robust Agents Learn Causal World Model"
Dalcy (Darcy) · 2024-12-22T15:06:16.880Z · comments (0)

[link] Can o1-preview find major mistakes amongst 59 NeurIPS '24 MLSB papers?
Abhishaike Mahajan (abhishaike-mahajan) · 2024-12-18T14:21:03.661Z · comments (0)

In the Name of All That Needs Saving
pleiotroth · 2024-11-07T15:26:12.252Z · comments (2)

Advisors for Smaller Major Donors?
jefftk (jkaufman) · 2024-11-06T14:30:06.187Z · comments (2)

[question] Does the "ancient wisdom" argument have any validity? If a particular teaching or tradition is old, to what extent does this make it more trustworthy?
SpectrumDT · 2024-11-04T15:20:14.822Z · answers+comments (49)

Economic Post-ASI Transition
[deleted] · 2025-01-01T22:37:31.722Z · comments (11)

[link] AI safety content you could create
Adam Jones (domdomegg) · 2025-01-06T15:35:56.167Z · comments (0)

[link] GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning
ChengCheng (ccstan99) · 2024-11-01T00:10:50.718Z · comments (0)

[link] AI & wisdom 3: AI effects on amortised optimisation
L Rudolf L (LRudL) · 2024-10-28T21:08:56.604Z · comments (0)

[link] AI & wisdom 2: growth and amortised optimisation
L Rudolf L (LRudL) · 2024-10-28T21:07:39.449Z · comments (0)

Reality is Fractal-Shaped
silentbob · 2024-12-17T13:52:16.946Z · comments (1)

Latent Adversarial Training (LAT) Improves the Representation of Refusal
alexandraabbas · 2025-01-06T10:24:53.419Z · comments (5)

Monthly Roundup #25: December 2024
Zvi · 2024-12-23T14:20:04.682Z · comments (3)

Announcing the CLR Foundations Course and CLR S-Risk Seminars
JamesFaville (elephantiskon) · 2024-11-19T01:18:10.085Z · comments (0)

[link] Genesis
PeterMcCluskey · 2024-12-31T22:01:17.277Z · comments (0)

[link] AI & Liability Ideathon
Kabir Kumar (kabir-kumar) · 2024-11-26T13:54:01.820Z · comments (2)

[link] From the Archives: a story
Richard_Ngo (ricraz) · 2024-12-27T16:36:50.735Z · comments (1)

Word Spaghetti
Gordon Seidoh Worley (gworley) · 2024-10-23T05:39:20.105Z · comments (9)

2024 NYC Secular Solstice & Megameetup
Joe Rogero · 2024-11-12T17:46:18.674Z · comments (0)

Dmitry's Koan
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-10T04:27:30.346Z · comments (0)

[link] We are in a New Paradigm of AI Progress - OpenAI's o3 model makes huge gains on the toughest AI benchmarks in the world
garrison · 2024-12-22T21:45:52.026Z · comments (3)

OpenAI defected, but we can take honest actions
Remmelt (remmelt-ellen) · 2024-10-21T08:41:25.728Z · comments (16)

A Collection of Empirical Frames about Language Models
Daniel Tan (dtch1997) · 2025-01-02T02:49:05.965Z · comments (0)

Computational functionalism probably can't explain phenomenal consciousness
EuanMcLean (euanmclean) · 2024-12-10T17:11:28.044Z · comments (34)

[link] A primer on machine learning in cryo-electron microscopy (cryo-EM)
Abhishaike Mahajan (abhishaike-mahajan) · 2024-12-22T15:11:58.860Z · comments (0)

Is Text Watermarking a lost cause?
egor.timatkov · 2024-10-01T16:20:51.113Z · comments (13)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

habryka4 on On Eating the Sun

As I mentioned in the other thread, it seems right to me that some people will want the sun to continue being the sun, but my sense is that within the set of people who don't want to leave the solar system, don't want to be uploads, don't want to be cryogenically shipped to other solar systems, or otherwise for some reason will have strong preferences over what happens with this specific solar system, this will be a much less important preference than using the sun for things that people care about more.

sharmake-farah on quila's Shortform

i'm not sure what this means. my values basically refer to other beings having not-tormentful (and next in order of priority, happy/good) existences. (tried to formalize this more but it's hard)

That would immediately exclude quite a bit of people, from both the far left and far right, because I predict a lot of people definitely want at least some people to have tormentful lives.

in particular, i'm not sure if you're saying something which would seem trivially true to me or not. (example trivially true thing: someone who wants to tile literally the entire lightcone with happy humans not being able to do that is losing out under 'cosmopolitan' values relative to if their values controlled the entire lightcone. example trivially true thing 2: "the best possible world is relative to a given value set")

I was trying to say something trivially true in your ontology, but far too many people tend to deny that you do in fact have to make other values lose out, and people usually think the best possible world is absolute, not relative, and in particular I think a lot of people use the idea of value-aligned superintelligence as though it was a magic wand that could solve all conflict.

maxwell-peterson on Drake Thomas's Shortform

Thanks!

sharmake-farah on quila's Shortform

One example of such a future is a case where in 2028, OpenAI managed to scale up enough to make an AI that while not as good as a human worker in general (at least without heavy inference costs), it is good enough to act as a notable accelerant to AI research, such that by 2030-2031, AI research has been more or less automated away by Open AI, with competitors having such systems by 2031-2032, meaning AI progress becomes notably faster such that by 2033, we are on the brink of AI that can do a lot of job work, but the best models at this point are instead reinvested in AI R&D such that by 2035, superhuman AI is broadly achieved, and this is when the economy starts getting seriously disrupted.

The key features here in this future is that the superhuman equals optimal assumption is false, intent alignment works well enough that AI generally takes instructions from specific humans, and it's easy for others to get their own superintelligences with different values, such that conflict doesn't go away.

nathan-helm-burger on Human takeover might be worse than AI takeover

Yeah, I definitely don't think we could trust a continually learning or self-improving AI to stay trustworthy over a long period of time.

Indeed, the ability to appoint a static mind to a particular role is a big plus. It wouldn't be vulnerable to corruption by power dynamics.

Maybe we don't need a genius-level AI, maybe just a reasonably smart and very well aligned AI would be good enough. If the governance system was able to prevent superintelligent AI from ever being created (during the pre-agreed upon timeframe for pause), then we could manage a steady-state world peace.

benito on On Eating the Sun

It is good to have deontological commitments about what you would do with a lot of power. But this situation is very different from "a lot of power", it's also "if you were to become wiser and more knowledgeable than anyone in history so far". One can imagine the Christians of old asking for a commitment that "If you get this new scientific and industrial civilization that you want in 2,000 years from now, will you commit to following the teachings of Jesus?" and along the way I sadly find out that even though it seemed like a good and moral commitment at the time, it totally screwed my ability to behave morally in the future because Christianity is necessarily predicated on tons of falsehoods and many of its teachings are immoral.

But there is some version of this commitment I think is good to make... something like "Insofar as the players involved are all biological humans, I will respect the legal structures that exist and the existence of countries, and will not relate to them in ways that would be considered worthy of starting a war in its defense". But I'm not certain about this, for instance what if most countries in the world build 10^10 digital minds and are essentially torturing them? I may well wish to overthrow a country that is primarily torture with a small number of biological humans sitting on thrones on top of these people, and I am not willing to commit not to do that presently.

I understand that there are bad ethical things one can do with post-singularity power, but I do not currently see a clear way to commit to certain ethical behaviors that will survive contact with massive increases in knowledge and wisdom. I am interested if anyone has made other commitments about post-singularity life (or "on the cusp of singularity life") that they expect to survive contact with reality?

Added: At the very least I can say that I am not going to make commitments to do specific things that violate my current ethics. I have certainly made no positive commitment to violate people's bodily autonomy nor have such an intention.

dmitry-vaintrob on Dmitry Vaintrob's Shortform

Why you should try degrading NN behavior in experiments.

I got some feedback on the post I wrote yesterday [LW · GW] that seems right. The post is trying to do too many things, and not properly explaining what it is doing, why this is reasonable, and how the different parts are related.

I want to try to fix this, since I think the main piece of advice in this post is important, but gets lost in all the mess.

This main point is:

experimentalists should in many cases run an experiment on multiple neural nets with a variable complexity dial that allows some "natural" degradations of the NN's performance, and certain dials are better than others depending on context.

I am eventually planning splitting out the post into a few parts, one of which explains this more carefully. When I do this I will replace the current version of the post with just a discussion of the "koan" itself: i.e., nitpicks about work that isn't careful about thinking about the scale at which it is performing interpretability.

For now I want to give a quick reductive take on what I hope to be the main takeaway of this discussion. Namely, why I think "interpretability on degraded networks" is important for better interpretability.

Basically: when ML experiments modify a neural net to identify or induce a particular behavior, this always degrades performance. Now there are two hypotheses for what is going on:

You are messily pulling your NN in the direction of a particular behavior, and confusing this spurious messy phenomenon with finding a "genuine" phenomenon from the program's point of view.
You are messily pulling your NN in the direction of a particular behavior, but also singling out a few "real" internal circuits of the NN that are carrying out this behavior.

Because of how many parameters you have to play with and the polysemanticity of everything in a NN, it's genuinely hard to tell these two behaviors apart. You might find stuff that "looks" like a core circuit, but actually is just bits of other circuits combined together, and your circuit-fitting experiment makes look like a coherent behavior, and any nice properties of the resulting behavior that make it seem like an "authentic" circuit are just artefacts of the way you set up the experiment.

Now the idea behind running this experiment at "natural" degradations of network performance is to try to separate out these two possibilities more cleanly. Namely, an ideal outcome is that in running your experiment on some class of natural degradation of your neural net, you find a regime such that

the intervention you are running no longer significantly affects the (naturally degraded) performance
the observed effect still takes place.

Then what you've done is effectively "cleaned up" your experiment such that you are still probably finding interpretable behaviors in the original neural net (since a good degradation is likely to contain a subset of circuits/behaviors of your original net and not many "new behaviors), in a way that sufficiently reduced the complexity that the behavior you're seeking is no longer "entangled" with a bunch of other behaviors; this should significantly update you that the behavior is indeed "natural" and not spurious.

This is of course a very small, idealized sketch. But the basic idea behind looking at neural nets with degraded performance is to "squeeze" the complexity in a controlled way to suitably match the complexity of the circuit (and how it's embedded in the rest of the network/how it interacts with other circuits). If you then have a circuit of "the correct complexity" that explains a behavior, there is in some sense no "complexity room" for other sneaky phenomena to confound it.

In the post, the natural degradation I suggested is the physics-inspired "SLGD sampling" process which in some sense tries to add a maximal amount of noise to your NN while only having a limited impact on performance (measured by loss); this has a bias of keeping "generally useful" circuits and interactions and noising more inessential/ memorize-y circuits. Other interventions that have different properties are "just adding random noise" (either to weights or activations) to suitable reduce performance, or looking at earlier training checkpoints. I suspect that different degradations (or combinations thereof) are appropriate to isolate the relevant complexity of different experiments.

benito on On Eating the Sun

Analogously: "I am claiming that people when informed will want horses to continue being the primary mode of transportation. I also think that most people when informed will not really care that much about economic growth, will continue to believe that you're more responsible for changing things than for maintaining the status quo, etc. And that this is a coherent view that will add up to a large set of people wanting things in cities to remain conservatively the same. I separately claim that if this is true, then other people should just respect this preference, and go find new continents / planets on which to build cars that people in the cities don't care about."

Sometimes it's good to be conservative when you're changing things, like if you're changing lots of social norms or social institutions, but I don't get it at all in this case. The sun is not a complicated social institution, it's primarily a source of heat and light and much of what we need can be easily replicated especially when you have nanobots. I am much more likely to grant that we should be slow to change things like democracy and the legal system than I am that we should change exactly how and where we should get heat and light. Would you have wanted conservatism around moving from candles to lightbulbs? Installing heaters and cookers in the house instead of fire pits? I don't think so.

mikbp on Is Musk still net-positive for humanity?

Sure, but one can assess it at any point. I'm not asking about whether he will end up being net-positive or net-negative overall in the long run.

mikbp on Is Musk still net-positive for humanity?

I'd agree. But he certainly does not seem to even be trying anymore to have positive impact on solving alignment, no?