LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[Intuitive self-models] 8. Rooting Out Free Will Intuitions
Steven Byrnes (steve2152) · 2024-11-04T18:16:26.736Z · comments (16)

AI Craftsmanship
abramdemski · 2024-11-11T22:17:01.112Z · comments (7)

Perils of Generalizing from One's Social Group
localdeity · 2024-11-24T15:31:18.332Z · comments (1)

SAEs are highly dataset dependent: a case study on the refusal direction
Connor Kissane (ckkissane) · 2024-11-07T05:22:18.807Z · comments (4)

[link] New o1-like model (QwQ) beats Claude 3.5 Sonnet with only 32B parameters
Jesse Hoogland (jhoogland) · 2024-11-27T22:06:12.914Z · comments (3)

[link] electric turbofans
bhauth · 2024-11-02T22:50:59.807Z · comments (2)

Why imperfect adversarial robustness doesn't doom AI control
Buck · 2024-11-18T16:05:06.763Z · comments (27)

Toward Safety Cases For AI Scheming
Mikita Balesni (mykyta-baliesnyi) · 2024-10-31T17:20:06.019Z · comments (1)

Why our politicians aren't Median
Yair Halberstadt (yair-halberstadt) · 2024-11-03T14:03:33.779Z · comments (15)

[link] "Map of AI Futures" - An interactive flowchart
swante · 2024-11-27T21:31:40.269Z · comments (3)

Training AI agents to solve hard problems could lead to Scheming
Marius Hobbhahn (marius-hobbhahn) · 2024-11-19T00:10:55.522Z · comments (12)

Seeking Collaborators
abramdemski · 2024-11-01T17:13:36.162Z · comments (14)

[question] Could orcas be (trained to be) smarter than humans? 
Towards_Keeperhood (Simon Skade) · 2024-11-04T23:29:26.677Z · answers+comments (20)

U.S.-China Economic and Security Review Commission pushes Manhattan Project-style AI initiative
Phib · 2024-11-19T18:42:43.296Z · comments (7)

[link] The Evals Gap
Marius Hobbhahn (marius-hobbhahn) · 2024-11-11T16:42:46.287Z · comments (7)

Toward Safety Case Inspired Basic Research
Lucas Teixeira · 2024-10-31T23:06:32.854Z · comments (2)

Neuroscience of human social instincts: a sketch
Steven Byrnes (steve2152) · 2024-11-22T16:16:52.552Z · comments (0)

Win/continue/lose scenarios and execute/replace/audit protocols
Buck · 2024-11-15T15:47:24.868Z · comments (2)

A Qualitative Case for LTFF: Filling Critical Ecosystem Gaps
Linch · 2024-11-18T00:44:57.133Z · comments (2)

You should consider applying to PhDs (soon!)
bilalchughtai (beelal) · 2024-11-29T20:33:12.462Z · comments (2)

[link] a space habitat design
bhauth · 2024-11-25T17:28:48.481Z · comments (12)

Metastatic Cancer Treatment Since 2010: The Success Stories
sarahconstantin · 2024-11-04T22:50:09.386Z · comments (2)

A Conflicted Linkspost
Screwtape · 2024-11-21T00:37:54.035Z · comments (0)

An alternative approach to superbabies
Towards_Keeperhood (Simon Skade) · 2024-11-05T22:56:15.740Z · comments (19)

[link] Active Recall and Spaced Repetition are Different Things
Saul Munn (saul-munn) · 2024-11-08T20:14:56.092Z · comments (2)

Dave Kasten's AGI-by-2027 vignette
davekasten · 2024-11-26T23:20:47.212Z · comments (8)

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback
Marcus Williams · 2024-11-07T15:39:06.854Z · comments (6)

Secular Solstice Round Up 2024
dspeyer · 2024-11-21T10:49:36.682Z · comments (13)

AI #91: Deep Thinking
Zvi · 2024-11-21T14:30:06.930Z · comments (9)

Which evals resources would be good?
Marius Hobbhahn (marius-hobbhahn) · 2024-11-16T14:24:48.012Z · comments (4)

Looking back on the Future of Humanity Institute - Asterisk
jakeeaton · 2024-11-19T00:44:40.928Z · comments (0)

[link] What Ketamine Therapy Is Like
Sable · 2024-11-11T11:09:08.602Z · comments (8)

[link] Epistemic status: poetry (and other poems)
Richard_Ngo (ricraz) · 2024-11-21T18:13:17.194Z · comments (5)

Live Machinery: An Interface Design Philosophy for Wholesome AI Futures
Sahil · 2024-11-01T17:24:09.957Z · comments (2)

The Shallow Bench
Karl Faulks (karl-faulks) · 2024-11-05T05:07:27.357Z · comments (5)

AI #88: Thanks for the Memos
Zvi · 2024-10-31T15:00:07.412Z · comments (5)

[link] Analyzing how SAE features evolve across a forward pass
bensenberner · 2024-11-07T22:07:02.827Z · comments (0)

[link] Dangerous capability tests should be harder
LucaRighetti (Error404Dinosaur) · 2024-11-21T17:20:50.610Z · comments (3)

[link] The Choice Transition
owencb · 2024-11-18T12:30:56.198Z · comments (4)

[link] Literacy Rates Haven't Fallen By 20% Since the Department of Education Was Created
Maxwell Tabarrok (maxwell-tabarrok) · 2024-11-22T20:53:59.007Z · comments (0)

Reading RFK Jr so that you don’t have to
braces · 2024-11-22T00:59:19.583Z · comments (0)

Estimates of GPU or equivalent resources of large AI players for 2024/5
CharlesD · 2024-11-28T23:01:58.522Z · comments (1)

Monthly Roundup #24: November 2024
Zvi · 2024-11-18T13:20:06.086Z · comments (14)

AI #89: Trump Card
Zvi · 2024-11-07T16:30:05.684Z · comments (12)

[link] Intrinsic Power-Seeking: AI Might Seek Power for Power’s Sake
TurnTrout · 2024-11-19T18:36:20.721Z · comments (5)

Signaling with Small Orange Diamonds
jefftk (jkaufman) · 2024-11-07T20:20:08.026Z · comments (1)

[link] FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
Tamay · 2024-11-14T06:13:22.042Z · comments (0)

How to use bright light to improve your life.
Nat Martin (nat-martin) · 2024-11-18T19:32:10.667Z · comments (10)

[link] College technical AI safety hackathon retrospective - Georgia Tech
yix (Yixiong Hao) · 2024-11-15T00:22:53.159Z · comments (2)

Drug development costs can range over two orders of magnitude
rossry · 2024-11-03T23:13:17.685Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

vladimir_nesov on INTELLECT-1 Release: The First Globally Trained 10B Parameter Model

This is DiLoCo (Nov 2023 paper), a local SGD setup where the outer optimizer updates much more rarely (every 100-500 steps of the inner optimizers), asking for much less bandwidth (it uses Nesterov momentum in its state). The inner optimizers run within individual clusters, and the outer optimizer aggregates updates from individual clusters, using a much slower network that connects the clusters. The experiments were done with models of up to 400M parameters. (See also this paper on asynchronous variants of DiLoCo.)

The original paper lacks good compute efficiency measurements. The distributed training experiments start from a checkpoint trained for 24K steps, continuing for 64K more steps (to a total of 88K) in various distributed configurations. Even for the non-distributed configuration the perplexity keeps increasing to step 29K (Figure 7b, Figure 9). The compute expended in a non-distributed run between steps 24K and 88K gets expended in an 8-cluster run between steps 24K and 32K, when perplexity barely starts going down from the global maximum. So there is no way of comparing how well an 8-cluster run uses its compute, because the non-distributed experiment stops so early (at 88K steps) that the uninformative poorly optimized early state of the model still dominates the distributed configuration that uses the same amount of compute (at 32K steps).

Prime Intellect first reproduced DiLoCo in Jul 2024 (blog post, paper) on models of up to 1.1B parameters, taking training efficiency measurements. The largest experiment with a 1.1B model runs across 4 nodes that communicate only every 125 steps, and matches perplexity of a similar training run within a single cluster (with communication every step) using 20% more compute (Figure 7, comparing with 4x batch size baseline).

The new 10B model lacks baselines for comparison, so doesn't help with understanding how training efficiency depends on scale, but the results on benchmarks seem similar to those of other models with similar size and number of training tokens (Table 4 in the blog post).

gwern on China Hawks are Manufacturing an AI Arms Race

Missing the point. This is not about being too stupid to think of >0 strategies, this is about being able & willing to execute strategies.

I too can think of 100 things, and I listed several diverse ways of responding and threw in a historical parallel just in case that wasn't clear after several paragraphs of discussing the problem with not having a viable strategy you can execute. Smartness is not the limit here: we are already smart enough to come up with strategies which could achieve the goal. All of those could potentially work. But none of them seem realistically on the table as something that the USA as it currently exists would be willing to commit to and see through to completion, and you will note that few critics - and no one serious - is responding something like, "oh sure, all part of the plan already, see our white paper laying out the roadmap: after we win, we would then order the AGIs to hack the planet and ensure our perpetual hegemony; that is indeed the exit plan. We botched it last time with nukes and stood by and let everyone else get nukes, but we'll follow through this time."

There is no difference between "won't execute a strategy" and "can't execute a strategy": they are the same thing. The point is that a strategy has to be executable or else it's not an actual strategy. And acting as if you can execute a strategy that you won't can lead you to take terrible decisions. You are like the cat who thinks before climbing a tree: "obviously, I will just climb back down", and who then proceeds climb up and to not climb back down but mew piteously. Well, maybe you shouldn't've climbed up in the first place then...?

("arms race bros will srsly launch a global arms race by saying they'll use the decisive advantage from winning the arms race to conquer the world, and then will not conquer the world")

dagon on Is the mind a program?

Hmm, still not following, or maybe not agreeing. I think that "if the reasoning used to solve the problem is philosophical" then "correct solution" is not available. "useful", "consensus", or "applicable in current societal context" might be better evaluations of a philosophical reasoning.

sweenesm on Understanding Emergence in Large Language Models

Thanks for the post. I think it'd be helpful if you could add some links to references for some of the things you say, such as:

For instance, between 10^10 and 10^11 parameters, models showed dramatic improvements in their ability to interpret emoji sequences representing movies.

joseph-miller on Joseph Miller's Shortform

There are two types of people in this world.

There are people who treat the lock on a public bathroom as a tool for communicating occupancy and a safeguard against accidental attempts to enter when the room is unavailable. For these people the standard protocol is to discern the likely state of engagement of the inner room and then tentatively proceed inside if they detect no signs of human activity.

And there are people who view the lock on a public bathroom as a physical barricade with which to temporarily defend possessed territory. They start by giving the door a hearty push to test the tensile strength of the barrier. On meeting resistance they engage with full force, wringing the handle up and down and slamming into the door with their full body weight. Only once their attempts are thwarted do they reluctantly retreat to find another stall.

cbiddulph on You should consider applying to PhDs (soon!)

Thanks, this post made me seriously consider applying to a PhD, and I strong-upvoted. I had vaguely assumed that PhDs take way too long and don't allow enough access to compute compared to industry AI labs. But considering the long lead time required for the application process and the reminder that you can always take new opportunities as they come up, I now think applying is worth it.

However, looking into it, putting together a high-quality application starting now and finishing by the deadline seems approximately impossible? If the deadline were December 15, that would give you two weeks; other schools like Berkeley have even earlier deadlines. I asked ChatGPT how long it would take to apply to just a single school, and it said it would take 43–59 hours of time spent working, or ~4–6 weeks in real time. Claude said 37-55 hours/4-6 weeks.

Not to discourage anyone from starting their application now if they think they can do it - I guess if you're sufficiently productive and agentic and maybe take some amphetamines, you can do anything. But this seems like a pretty crazy timeline. Just the thought of asking someone to write me a recommendation letter in a two-week timeframe makes me feel bad.

Your post does make me think "if I were going to be applying to a PhD next December, what would I want to do now?" That seems pretty clarifying, and would probably be a helpful frame even if it turns out that a better opportunity comes along and I never apply to a PhD.

I think it'd be a good idea for you to repost this in August or early September of next year!

nicholas-heather-kross on Why and When Interpretability Work is Dangerous

Kinda, my current mainline-doom-case is "some AI gets controlled --> powerful people use it to prop themselves up --> world gets worse until AI gets uncontrollably bad --> doom". I would call it a different yet also-important doom case of "perpetual low-grade-AI dictatorship where the AI is controlled by humans in a surveillance state".

sunwillrise on gwern's Shortform

All of these ideas sound awesome and exciting, and precisely the right kind [LW · GW] of use of LLMs that I would like to see on LW!

sunwillrise on A shot at the diamond-alignment problem

It's looking like the values of humans are far, far simpler than a lot of evopsych literature and Yudkowsky thought, and related to this, values are less fragile than people thought 15-20 years ago, in the sense that values generalize far better OOD than people used to think 15-20 years ago

I'm not sure I like this argument very much, as it currently stands. It's not that I believe anything you wrote in this paragraph is wrong per se, but more like this misses the mark a bit in terms of framing.

Yudkowsky had (and, AFAICT, still has) a specific theory [LW · GW] of human values in terms of what they mean in a reductionist [LW · GW] framework, where it makes sense (and is rather natural) to think of (approximate) utility functions [LW · GW] of humans and of Coherent Extrapolated Volition [LW · GW] as things-that-exist-in-the-territory [LW · GW].

I think a lot of writing and analysis, summarized by me here [LW(p) · GW(p)], has cast a tremendous amount of doubt on the viability of this way of thinking and has revealed what seem to me to be impossible-to-patch holes at the core of these theories. I do not believe [LW(p) · GW(p)] "human values" in the Yudkowskian sense ultimately make sense as a coherent concept that carves reality at the joints [LW · GW]; I instead observe a tremendous number of unanswered questions and apparent contradictions [LW(p) · GW(p)] that throw the entire edifice in disarray.

But supplementing this reorientation of thinking around what it means to satisfy human values has been "prosaic" [LW · GW] alignment researchers pivoting more towards intent alignment [LW · GW] as opposed to doomed-from-the-start paradigms like "learning the true human utility function" [LW · GW] or ambitious value learning [LW · GW], a recognition that realism about (AGI) rationality [LW · GW] is likely just straight-up false and that the very specific set of conclusions MIRI-clustered alignment researchers have reached [LW(p) · GW(p)] about what AGI cognition will be like are entirely overconfident and seem contradicted by our modern observations of LLMs [LW(p) · GW(p)], and ultimately an increased focus on the basic observation that full value alignment simply is not required [LW(p) · GW(p)] for a good AI outcome (or at the very least to prevent AI takeover). So it's not so much that human values (to the extent such a thing makes sense) are simpler, but more so that fulfilling those values is just not needed to nearly as high a degree as people used to think.

nadroj on Mechanistically Eliciting Latent Behaviors in Language Models

Couldn't you do something like fit a Gaussian to the model's activations, then restrict the steered activations to be high likelihood (low Mahalanobis distance)? Or (almost) equivalently, you could just do a whitening transformation to activation space before you constrain the L2 distance of the perturbation.

(If a gaussian isn't expressive enough you could model the manifold in some other way, eg. with a VAE anomaly detector or mixture of gaussians or whatever)