LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

The Logistics of Distribution of Meaning: Against Epistemic Bureaucratization
Sahil · 2024-11-07T05:27:20.276Z · comments (1)

[link] Towards the Operationalization of Philosophy & Wisdom
Thane Ruthenis · 2024-10-28T19:45:07.571Z · comments (2)

Boston Secular Solstice 2024: Call for Singers and Musicans
jefftk (jkaufman) · 2024-11-15T13:50:07.827Z · comments (0)

[link] AI Model Registries: A Foundational Tool for AI Governance
Elliot Mckernon (elliot) · 2024-10-07T19:27:43.466Z · comments (1)

AI Can be “Gradient Aware” Without Doing Gradient hacking.
Sodium · 2024-10-20T21:02:10.754Z · comments (0)

Musings on Text Data Wall (Oct 2024)
Vladimir_Nesov · 2024-10-05T19:00:21.286Z · comments (2)

[link] Compression Moves for Prediction
adamShimi · 2024-09-14T17:51:12.004Z · comments (0)

[link] Does natural selection favor AIs over humans?
cdkg · 2024-10-03T18:47:43.517Z · comments (1)

[question] What is the alpha in one bit of evidence?
J Bostock (Jemist) · 2024-10-22T21:57:09.056Z · answers+comments (12)

A necessary Membrane formalism feature
ThomasCederborg · 2024-09-10T21:33:09.508Z · comments (6)

Simon DeDeo on Explore vs Exploit in Science
Elizabeth (pktechgirl) · 2024-09-10T03:40:08.311Z · comments (0)

Why Reflective Stability is Important
Johannes C. Mayer (johannes-c-mayer) · 2024-09-05T15:28:19.913Z · comments (2)

Review: “The Case Against Reality”
David Gross (David_Gross) · 2024-10-29T13:13:29.643Z · comments (9)

Announcing the PIBBSS Symposium '24!
DusanDNesic · 2024-09-03T11:19:47.568Z · comments (0)

[link] Fragile, Robust, and Antifragile Preference Satisfaction
adamShimi · 2024-11-02T17:25:55.986Z · comments (0)

D/acc AI Security Salon
Allison Duettmann (allison-duettmann) · 2024-10-19T22:17:57.067Z · comments (0)

[link] Update on the Mysterious Trump Buyers on Polymarket
Annapurna (jorge-velez) · 2024-11-04T19:22:06.540Z · comments (9)

Economics Roundup #4
Zvi · 2024-10-15T13:20:06.923Z · comments (4)

Looking for Goal Representations in an RL Agent - Update Post
CatGoddess · 2024-08-28T16:42:19.367Z · comments (0)

Lab governance reading list
Zach Stein-Perlman · 2024-10-25T18:00:28.346Z · comments (3)

[link] To Be Born in a Bag
Niko_McCarty (niko-2) · 2024-10-06T17:21:00.605Z · comments (1)

Why I'm bearish on mechanistic interpretability: the shards are not in the network
tailcalled · 2024-09-13T17:09:25.407Z · comments (40)

In the Name of All That Needs Saving
pleiotroth · 2024-11-07T15:26:12.252Z · comments (2)

[link] Should Sports Betting Be Banned?
Maxwell Tabarrok (maxwell-tabarrok) · 2024-09-21T14:13:35.404Z · comments (2)

Can Large Language Models effectively identify cybersecurity risks?
emile delcourt (emile-delcourt) · 2024-08-30T20:20:21.345Z · comments (0)

Advisors for Smaller Major Donors?
jefftk (jkaufman) · 2024-11-06T14:30:06.187Z · comments (2)

Avoiding the Bog of Moral Hazard for AI
Nathan Helm-Burger (nathan-helm-burger) · 2024-09-13T21:24:34.137Z · comments (12)

"Real AGI"
Seth Herd · 2024-09-13T14:13:24.124Z · comments (20)

Bridging the VLM and mech interp communities for multimodal interpretability
Sonia Joseph (redhat) · 2024-10-28T14:41:41.969Z · comments (5)

Word Spaghetti
Gordon Seidoh Worley (gworley) · 2024-10-23T05:39:20.105Z · comments (9)

[link] Why Swiss watches and Taylor Swift are AGI-proof
Kevin Kohler (KevinKohler) · 2024-09-05T13:23:27.033Z · comments (11)

OpenAI defected, but we can take honest actions
Remmelt (remmelt-ellen) · 2024-10-21T08:41:25.728Z · comments (15)

[link] Four Levels of Voting Methods
hive · 2024-09-26T18:15:00.565Z · comments (3)

Automating LLM Auditing with Developmental Interpretability
htlou · 2024-09-04T15:50:04.337Z · comments (0)

[question] Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception?
David Scott Krueger (formerly: capybaralet) (capybaralet) · 2024-09-04T12:40:07.678Z · answers+comments (7)

[link] Instruction Following without Instruction Tuning
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-09-24T13:49:09.078Z · comments (0)

[link] GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning
ChengCheng (ccstan99) · 2024-11-01T00:10:50.718Z · comments (0)

[link] College technical AI safety hackathon retrospective - Georgia Tech
yix (Yixiong Hao) · 2024-11-15T00:22:53.159Z · comments (0)

[link] AlignedCut: Visual Concepts Discovery on Brain-Guided Universal Feature Space
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-09-14T23:23:26.296Z · comments (1)

My career exploration: Tools for building confidence
lynettebye · 2024-09-13T11:37:55.843Z · comments (0)

[link] some questionable space launch guns
bhauth · 2024-10-13T22:52:26.418Z · comments (0)

Is Text Watermarking a lost cause?
egor.timatkov · 2024-10-01T16:20:51.113Z · comments (13)

[link] Jonothan Gorard:The territory is isomorphic to an equivalence class of its maps
Daniel C (harper-owen) · 2024-09-07T10:04:47.840Z · comments (18)

[question] Is this voting system strategy proof?
Donald Hobson (donald-hobson) · 2024-09-06T20:44:46.691Z · answers+comments (9)

Hiring a writer to co-author with me (Spencer Greenberg for ClearerThinking.org)
spencerg · 2024-10-27T17:34:50.479Z · comments (0)

Reducing global AI competition through the Commerce Control List and Immigration reform: a dual-pronged approach
Ben Smith (ben-smith) · 2024-09-03T05:28:24.549Z · comments (2)

[question] Is there a CFAR handbook audio option?
FinalFormal2 · 2024-10-26T17:08:36.480Z · answers+comments (0)

[link] Why good things often don’t lead to better outcomes
DMMF · 2024-09-19T16:37:07.778Z · comments (1)

Slave Morality: A place for every man and every man in his place
Martin Sustrik (sustrik) · 2024-09-19T04:20:04.491Z · comments (7)

Appealing to the Public
jefftk (jkaufman) · 2024-10-23T19:00:07.669Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

linch on The Median Researcher Problem

If the means are higher, the tails are also higher as well (usually).

Norm(μ=115, σ=15) distribution will have a much lower proportion of data points above 150 than Norm(μ=130, σ=15). Same argument for other realistic distributions.

abramdemski on 5 ways to improve CoT faithfulness

Would you be interested in having a (perhaps brief) LW Dialogue about it where you start with a basic intro to your shoggoth/mask division-of-labor, and I then present my critique?

abramdemski on o1 is a bad idea

My point here is that at the capability level of GPT4, this distinction isn't very important. There's no way to know for sure until it is too late, of course, but it seems pretty plausible that GPT4 isn't cleverly scheming. It is merely human-level at deception, and doesn't pursue any coherent overarching goal with it. It clumsily muddles through with mildly-above-average-for-human convincingness. For most queries (it seems plausible to me) it isn't even adequately coherent to make a crisp distinction between whether it's honestly trying to answer the question vs deceptively trying to make an answer look good; at its level of capability, it's mostly the same thing one way or the other. The exceptions to this "mostly" aren't strategic enough that we expect them to route around obstacles cleverly.

It isn't much, but it is more than I naively expected.

johnswentworth on johnswentworth's Shortform

FYI, my update from this comment was:

Hmm, seems like a decent argument...
... except he said "we don't know that it doesn't work", which is an extremely strong update that it will clearly not work.

abramdemski on o1 is a bad idea

I think the crux is I think that the important parts of of LLMs re safety isn't their safety properties specifically, but rather the evidence they give to what alignment-relevant properties future AIs have

[insert standard skepticism about these sorts of generalizations when generalizing to superintelligence]

But what lesson do you think you can generalize, and why do you think you can generalize that?

I think this is a crux, in that I don't buy o1 as progressing to a regime where we lose so much dense feedback that it's alignment relevant, because I think sparse-feedback RL will almost certainly be super-uncompetitive with every other AI architecture until well after AI automates all alignment research.

So, as a speculative example, further along in the direction of o1 you could have something like MCTS help train these things to solve very difficult math problems, with the sparse feedback being given for complete formal proofs.

Similarly, playing text-based video games, with the sparse feedback given for winning.

Similarly, training CoT to reason about code, with sparse feedback given for predictions of the code output.

Etc.

You think these sorts of things just won't work well enough to be relevant?

christiankl on The Online Sports Gambling Experiment Has Failed

Most people don't have very fixed ideas about how much a 28% overall increase in bankruptcy happens to be.

If you would ask most people without a libertarian outlook to rank different factors that lead to an increase in bankruptcy, I would not expect them to be able to accurately compare them and find that sports online gambling only will have such a strong influence.

otto-barten on Proposing the Conditional AI Safety Treaty (linkpost TIME)

I'm aware and I don't disagree. However, in xrisk, many (not all) of those who are most worried are also most bullish about capabilities. Reversely, many (not all) who are not worried are unimpressed with capabilities. Being aware of the concept of AGI, that it may be coming soon, and of how impactful it could be, is in practice often a first step towards becoming concerned about the risks, too. This is not true for everyone unfortunately. Still, I would say that at least for our chances to get an international treaty passed, it is perhaps hopeful that the power of AGI is on the radar of leading politicians (although this may also increase risk through other paths).

evhub on Sabotage Evaluations for Frontier Models

The usual plan for control as I understand it is that you use control techniques [LW · GW] to ensure the safety of models that are sufficiently good at themselves doing alignment research that you can then leverage your controlled human-ish-level models to help you align future superhuman models.

christiankl on Lao Mein's Shortform

I would add that convincing Musk to take action against Altman is the highest ROI thing I can think of in terms of decreasing AI extinction risk.

I would expect, the issue isn't about convincing Musk to take action but about finding effective actions that Musk could take.

satron on Sabotage Evaluations for Frontier Models

I do get that point that you are making, but I think this is a little bit unfair to these organizations. Articles like Machines of Loving Grace, The Intelligence Age and Planning for AGI and Beyond are implicit public justifications for building AGI.

These labs have also released their plans on "safe development". I expect a big part of what they say to be the usual business marketing, but it's not like they completely ignoring the issue. In fact, taking one example, Anthropic's research papers on safety are often discussed on this site as genuine improvements on this or that niche of AI Safety.

I don't think that money alone would've convinced CEOs of big companies to run this enterprise. Altman and Amodei, they both have families. If they don't care about their own families, then they at least care about themselves. After all, we are talking about scenarios where these guys would die the same deaths as the rest of us. No amounts of hoarded money would save them. They would have little motivation to do any of this if they believed that they would die as the result of their own actions. And that's not mentioning all of the other researchers working at their labs. Just Anthropic and OpenAI together have almost 2000 employees. Do they all not care about their and their families' well-being?

I think the point about them not engaging with critics is also a bit too harsh. Here [LW · GW] is DeepMind's alignment team response to concerns raised by Yudkowski. I am not saying that their response is flawless or even correct, but it is a response nonetheless. They are engaging with this work. DeepMind's alignment team also seemed to engage with concerns raised by critics in their (relatively) recent work [AF · GW].

EDIT: Another example would be Anthropic creating a dedicated team [LW · GW] for stress testing their alignment proposals. And as far as I can see, this team is lead by someone who has been actively engaged with the topic of AI safety on LessWrong, someone who you sort of praised [LW · GW] a few days ago.