LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] DeepMind: Frontier Safety Framework
Zach Stein-Perlman · 2024-05-17T17:30:02.504Z · comments (0)

On the Debate Between Jezos and Leahy
Zvi · 2024-02-06T14:40:05.487Z · comments (6)

[Intuitive self-models] 8. Rooting Out Free Will Intuitions
Steven Byrnes (steve2152) · 2024-11-04T18:16:26.736Z · comments (16)

On the Gladstone Report
Zvi · 2024-03-20T19:50:05.186Z · comments (11)

Complex systems research as a field (and its relevance to AI Alignment)
Nora_Ammann · 2023-12-01T22:10:25.801Z · comments (11)

How to Control an LLM's Behavior (why my P(DOOM) went down)
RogerDearnaley (roger-d-1) · 2023-11-28T19:56:49.679Z · comments (30)

Superposition is not "just" neuron polysemanticity
LawrenceC (LawChan) · 2024-04-26T23:22:06.066Z · comments (4)

Book Review: On the Edge: The Fundamentals
Zvi · 2024-09-23T13:40:11.058Z · comments (3)

AI Craftsmanship
abramdemski · 2024-11-11T22:17:01.112Z · comments (7)

Please do not use AI to write for you
Richard_Kennaway · 2024-08-21T09:53:34.425Z · comments (34)

Bayesian updating in real life is mostly about understanding your hypotheses
Max H (Maxc) · 2024-01-01T00:10:30.978Z · comments (4)

[link] Improving Dictionary Learning with Gated Sparse Autoencoders
Senthooran Rajamanoharan (SenR) · 2024-04-25T18:43:47.003Z · comments (38)

Self-Awareness: Taxonomy and eval suite proposal
Daniel Kokotajlo (daniel-kokotajlo) · 2024-02-17T01:47:01.802Z · comments (2)

[question] Is cybercrime really costing trillions per year?
Fabien Roger (Fabien) · 2024-09-27T08:44:07.621Z · answers+comments (28)

AiPhone
Zvi · 2024-06-12T22:20:02.141Z · comments (4)

[link] Moving on from community living
Vika · 2024-04-17T17:02:11.357Z · comments (7)

Perils of Generalizing from One's Social Group
localdeity · 2024-11-24T15:31:18.332Z · comments (1)

On Llama-3 and Dwarkesh Patel’s Podcast with Zuckerberg
Zvi · 2024-04-22T13:10:02.645Z · comments (4)

All About Concave and Convex Agents
mako yass (MakoYass) · 2024-03-24T21:37:17.922Z · comments (23)

Generalization, from thermodynamics to statistical physics
Jesse Hoogland (jhoogland) · 2023-11-30T21:28:50.089Z · comments (9)

Against most, but not all, AI risk analogies
Matthew Barnett (matthew-barnett) · 2024-01-14T03:36:16.267Z · comments (41)

[link] AI, centralization, and the One Ring
owencb · 2024-09-13T14:00:16.126Z · comments (11)

Another argument against maximizer-centric alignment paradigms
Fiora from Rosebloom · 2024-09-22T07:28:27.856Z · comments (39)

What mistakes has the AI safety movement made?
EuanMcLean (euanmclean) · 2024-05-23T11:19:02.717Z · comments (29)

AI research assistants competition 2024Q3: Tie between Elicit and You.com
Elizabeth (pktechgirl) · 2024-10-12T15:10:05.417Z · comments (4)

[link] A primer on why computational predictive toxicology is hard
Abhishaike Mahajan (abhishaike-mahajan) · 2024-08-19T17:16:37.735Z · comments (2)

[link] Superforecasting the Origins of the Covid-19 Pandemic
DanielFilan · 2024-03-12T19:01:15.914Z · comments (0)

[link] Slightly More Than You Wanted To Know: Pregnancy Length Effects
JustisMills · 2024-10-21T01:26:02.030Z · comments (4)

[link] Electrostatic Airships?
DaemonicSigil · 2024-10-27T04:32:34.852Z · comments (13)

AI #55: Keep Clauding Along
Zvi · 2024-03-14T15:40:09.335Z · comments (16)

E.T. Jaynes Probability Theory: The logic of Science I
Jan Christian Refsgaard (jan-christian-refsgaard) · 2023-12-27T23:47:52.579Z · comments (20)

RTFB: California’s AB 3211
Zvi · 2024-07-30T13:10:03.853Z · comments (2)

Black Box Biology
GeneSmith · 2023-11-29T02:27:29.794Z · comments (30)

[link] on bacteria, on teeth
bhauth · 2024-09-30T15:56:56.830Z · comments (9)

Do not delete your misaligned AGI.
mako yass (MakoYass) · 2024-03-24T21:37:07.724Z · comments (13)

Catastrophic Goodhart in RL with KL penalty
Thomas Kwa (thomas-kwa) · 2024-05-15T00:58:20.763Z · comments (10)

[Intuitive self-models] 6. Awakening / Enlightenment / PNSE
Steven Byrnes (steve2152) · 2024-10-22T13:23:08.836Z · comments (6)

A framework for thinking about AI power-seeking
Joe Carlsmith (joekc) · 2024-07-24T22:41:01.685Z · comments (15)

[link] Outrage Bonding
Jonathan Moregård (JonathanMoregard) · 2024-08-09T13:46:59.818Z · comments (12)

On coincidences and Bayesian reasoning, as applied to the origins of COVID-19
viking_math · 2024-02-19T01:14:06.772Z · comments (28)

What is a Tool?
johnswentworth · 2024-06-25T23:40:07.483Z · comments (4)

[link] Twitter thread on AI safety evals
Richard_Ngo (ricraz) · 2024-07-31T00:18:14.076Z · comments (3)

SAEs are highly dataset dependent: a case study on the refusal direction
Connor Kissane (ckkissane) · 2024-11-07T05:22:18.807Z · comments (4)

Don't sleep on Coordination Takeoffs
trevor (TrevorWiesinger) · 2024-01-27T19:55:26.831Z · comments (24)

[link] Ice: The Penultimate Frontier
Roko · 2024-07-13T23:44:56.827Z · comments (56)

[link] Pay-on-results personal growth: first success
Chipmonk · 2024-09-14T03:39:12.975Z · comments (5)

Book Review: On the Edge: The Future
Zvi · 2024-09-27T14:00:05.279Z · comments (1)

Managing risks while trying to do good
Wei Dai (Wei_Dai) · 2024-02-01T18:08:46.506Z · comments (26)

Balsa Update and General Thank You
Zvi · 2023-12-12T20:30:03.980Z · comments (8)

Why imperfect adversarial robustness doesn't doom AI control
Buck · 2024-11-18T16:05:06.763Z · comments (27)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

vladimir_nesov on INTELLECT-1 Release: The First Globally Trained 10B Parameter Model

This is DiLoCo (Nov 2023 paper), a local SGD setup where the outer optimizer updates much more rarely (every 100-500 steps of the inner optimizers), asking for much less bandwidth (it uses Nesterov momentum in its state). The inner optimizers run within individual clusters, and the outer optimizer aggregates updates from individual clusters, using a much slower network that connects the clusters. The experiments were done with models of up to 400M parameters. (See also this paper on asynchronous variants of DiLoCo.)

The original paper lacks good compute efficiency measurements. The distributed training experiments start from a checkpoint trained for 24K steps, continuing for 64K more steps (to a total of 88K) in various distributed configurations. Even for the non-distributed configuration the perplexity keeps increasing to step 29K (Figure 7b, Figure 9). The compute expended in a non-distributed run between steps 24K and 88K gets expended in an 8-cluster run between steps 24K and 32K, when perplexity barely starts going down from the global maximum. So there is no way of comparing how well an 8-cluster run uses its compute, because the non-distributed experiment stops so early (at 88K steps) that the uninformative poorly optimized early state of the model still dominates the distributed configuration that uses the same amount of compute (at 32K steps).

Prime Intellect first reproduced DiLoCo in Jul 2024 (blog post, paper) on models of up to 1.1B parameters, taking training efficiency measurements. The largest experiment with a 1.1B model runs across 4 nodes that communicate only every 125 steps, and matches perplexity of a similar training run within a single cluster (with communication every step) using 20% more compute (Figure 7, comparing with 4x batch size baseline).

The new 10B model lacks baselines for comparison, so doesn't help with understanding how training efficiency depends on scale, but the results on benchmarks seem similar to those of other models with similar size and number of training tokens (Table 4 in the blog post).

gwern on China Hawks are Manufacturing an AI Arms Race

Completely missing the point. This is not about being too stupid to think of >0 strategies, this is about being able & willing to execute strategies.

I too can think of 100 things, and I listed several diverse ways of responding and threw in a historical parallel just in case that wasn't clear after several paragraphs of discussing the problem with not having a viable strategy you can execute. Smartness is not the limit here: we are already smart enough to come up with strategies which could achieve the goal. All of those could potentially work. But none of them seem realistically on the table, and you will note that few critics - and no one serious - is responding something like, "no no, all part of the plan already: after we win, we would then order the AGIs to hack the planet and ensure our perpetual hegemony, that is indeed the exit plan. We botched it last time with nukes, but we'll follow through this time."

There is no difference between "won't execute a strategy" and "can't execute a strategy": they are the same thing. The point is that a strategy has to be executable or else it's not an actual strategy. And acting as if you can execute a strategy can lead you to take terrible decisions. Otherwise, you are like the cat who thinks before climbing a tree: "obviously, I will just climb back down", and who then proceeds climb up and to not climb back down but mew piteously. Well, maybe you shouldn't've climbed up in the first place...?

("arms race bros will srsly launch a global arms race by saying they'll use the decisive advantage from winning the arms race to conquer the world, and then will not conquer the world")

dagon on Is the mind a program?

Hmm, still not following, or maybe not agreeing. I think that "if the reasoning used to solve the problem is philosophical" then "correct solution" is not available. "useful", "consensus", or "applicable in current societal context" might be better evaluations of a philosophical reasoning.

sweenesm on Understanding Emergence in Large Language Models

Thanks for the post. I think it'd be helpful if you could add some links to references for some of the things you say, such as:

For instance, between 10^10 and 10^11 parameters, models showed dramatic improvements in their ability to interpret emoji sequences representing movies.

joseph-miller on Joseph Miller's Shortform

There are two types of people in this world.

There are people who treat the lock on a public bathroom as a tool for communicating occupancy and a safeguard against accidental attempts to enter when the room is unavailable. For these people the standard protocol is to discern the likely state of engagement of the inner room and then tentatively proceed inside if they detect no signs of human activity.

And there are people who view the lock on a public bathroom as a physical barricade with which to temporarily defend possessed territory. They start by giving the door a hearty push to test the tensile strength of the barrier. On meeting resistance they engage with full force, wringing the handle up and down and slamming into the door with their full body weight. Only once their attempts are thwarted do they reluctantly retreat to find another stall.

cbiddulph on You should consider applying to PhDs (soon!)

Thanks, this post made me seriously consider applying to a PhD, and I strong-upvoted. I had vaguely assumed that PhDs take way too long and don't allow enough access to compute compared to industry AI labs. But considering the long lead time required for the application process and the reminder that you can always take new opportunities as they come up, I now think applying is worth it.

However, looking into it, putting together a high-quality application starting now and finishing by the deadline seems approximately impossible? If the deadline were December 15, that would give you two weeks; other schools like Berkeley have even earlier deadlines. I asked ChatGPT how long it would take to apply to just a single school, and it said it would take 43–59 hours of time spent working, or ~4–6 weeks in real time. Claude said 37-55 hours/4-6 weeks.

Not to discourage anyone from starting their application now if they think they can do it - I guess if you're sufficiently productive and agentic and maybe take some amphetamines, you can do anything. But this seems like a pretty crazy timeline. Just the thought of asking someone to write me a recommendation letter in a two-week timeframe makes me feel bad.

Your post does make me think "if I were going to be applying to a PhD next December, what would I want to do now?" That seems pretty clarifying, and would probably be a helpful frame even if it turns out that a better opportunity comes along and I never apply to a PhD.

I think it'd be a good idea for you to repost this in August or early September of next year!

nicholas-heather-kross on Why and When Interpretability Work is Dangerous

Kinda, my current mainline-doom-case is "some AI gets controlled --> powerful people use it to prop themselves up --> world gets worse until AI gets uncontrollably bad --> doom". I would call it a different yet also-important doom case of "perpetual low-grade-AI dictatorship where the AI is controlled by humans in a surveillance state".

sunwillrise on gwern's Shortform

All of these ideas sound awesome and exciting, and precisely the right kind [LW · GW] of use of LLMs that I would like to see on LW!

sunwillrise on A shot at the diamond-alignment problem

It's looking like the values of humans are far, far simpler than a lot of evopsych literature and Yudkowsky thought, and related to this, values are less fragile than people thought 15-20 years ago, in the sense that values generalize far better OOD than people used to think 15-20 years ago

I'm not sure I like this argument very much, as it currently stands. It's not that I believe anything you wrote in this paragraph is wrong per se, but more like this misses the mark a bit in terms of framing.

Yudkowsky had (and, AFAICT, still has) a specific theory [LW · GW] of human values in terms of what they mean in a reductionist [LW · GW] framework, where it makes sense (and is rather natural) to think of (approximate) utility functions [LW · GW] of humans and of Coherent Extrapolated Volition [LW · GW] as things-that-exist-in-the-territory [LW · GW].

I think a lot of writing and analysis, summarized by me here [LW(p) · GW(p)], has cast a tremendous amount of doubt on the viability of this way of thinking and has revealed what seem to me to be impossible-to-patch holes at the core of these theories. I do not believe [LW(p) · GW(p)] "human values" in the Yudkowskian sense ultimately make sense as a coherent concept that carves reality at the joints [LW · GW]; I instead observe a tremendous number of unanswered questions and apparent contradictions [LW(p) · GW(p)] that throw the entire edifice in disarray.

But supplementing this reorientation of thinking around what it means to satisfy human values has been "prosaic" [LW · GW] alignment researchers pivoting more towards intent alignment [LW · GW] as opposed to doomed-from-the-start paradigms like "learning the true human utility function" [LW · GW] or ambitious value learning [LW · GW], a recognition that realism about (AGI) rationality [LW · GW] is likely just straight-up false and that the very specific set of conclusions MIRI-clustered alignment researchers have reached [LW(p) · GW(p)] about what AGI cognition will be like are entirely overconfident and seem contradicted by our modern observations of LLMs [LW(p) · GW(p)], and ultimately an increased focus on the basic observation that full value alignment simply is not required [LW(p) · GW(p)] for a good AI outcome (or at the very least to prevent AI takeover). So it's not so much that human values (to the extent such a thing makes sense) are simpler, but more so that fulfilling those values is just not needed to nearly as high a degree as people used to think.

nadroj on Mechanistically Eliciting Latent Behaviors in Language Models

Couldn't you do something like fit a Gaussian to the model's activations, then restrict the steered activations to be high likelihood (low Mahalanobis distance)? Or (almost) equivalently, you could just do a whitening transformation to activation space before you constrain the L2 distance of the perturbation.

(If a gaussian isn't expressive enough you could model the manifold in some other way, eg. with a VAE anomaly detector or mixture of gaussians or whatever)