LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

To Boldly Code
StrivingForLegibility · 2024-01-26T18:25:59.525Z · comments (4)

[link] Introduction to Super Powers (for kids!)
Shoshannah Tekofsky (DarkSym) · 2024-09-20T17:17:27.070Z · comments (0)

[link] Sticker Shortcut Fallacy — The Real Worst Argument in the World
ymeskhout · 2024-06-12T14:52:41.988Z · comments (15)

Virtually Rational - VRChat Meetup
Tomás B. (Bjartur Tómas) · 2024-01-28T05:52:36.934Z · comments (3)

Using an LLM perplexity filter to detect weight exfiltration
Adam Karvonen (karvonenadam) · 2024-07-21T18:18:05.612Z · comments (11)

[link] Was a Subway in New York City Inevitable?
Jeffrey Heninger (jeffrey-heninger) · 2024-03-30T00:53:21.314Z · comments (4)

[link] Structured Transparency: a framework for addressing use/mis-use trade-offs when sharing information
habryka (habryka4) · 2024-04-11T18:35:44.824Z · comments (0)

[link] Announcing Open Philanthropy's AI governance and policy RFP
Julian Hazell (julian-hazell) · 2024-07-17T02:02:39.933Z · comments (0)

The case for more Alignment Target Analysis (ATA)
Chi Nguyen · 2024-09-20T01:14:41.411Z · comments (13)

An experiment on hidden cognition
Olli Järviniemi (jarviniemi) · 2024-07-22T03:26:05.564Z · comments (2)

AXRP Episode 30 - AI Security with Jeffrey Ladish
DanielFilan · 2024-05-01T02:50:04.621Z · comments (0)

Economics Roundup #1
Zvi · 2024-03-26T14:00:06.332Z · comments (4)

[link] MIRI's July 2024 newsletter
Harlan · 2024-07-15T21:28:17.343Z · comments (2)

[link] A primer on the next generation of antibodies
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-01T22:37:59.207Z · comments (0)

Fun With The Tabula Muris (Senis)
sarahconstantin · 2024-09-20T18:20:01.901Z · comments (0)

Housing Roundup #9: Restricting Supply
Zvi · 2024-07-17T12:50:05.321Z · comments (8)

A Visual Task that's Hard for GPT-4o, but Doable for Primary Schoolers
Lennart Finke (l-f) · 2024-07-26T17:51:28.202Z · comments (4)

[link] An Intuitive Explanation of Sparse Autoencoders for Mechanistic Interpretability of LLMs
Adam Karvonen (karvonenadam) · 2024-06-25T15:57:16.872Z · comments (0)

Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?
scasper · 2024-07-30T14:57:06.807Z · comments (0)

Standard SAEs Might Be Incoherent: A Choosing Problem & A “Concise” Solution
Kola Ayonrinde (kola-ayonrinde) · 2024-10-30T22:50:45.642Z · comments (0)

[link] Truth is Universal: Robust Detection of Lies in LLMs
Lennart Buerger · 2024-07-19T14:07:25.162Z · comments (3)

[link] Robert Caro And Mechanistic Models In Biography
adamShimi · 2024-07-14T10:56:42.763Z · comments (5)

Scientific Method
Andrij “Androniq” Ghorbunov (andrij-androniq-ghorbunov) · 2024-02-18T21:06:45.228Z · comments (4)

[link] Scenario planning for AI x-risk
Corin Katzke (corin-katzke) · 2024-02-10T00:14:11.934Z · comments (12)

Language and Capabilities: Testing LLM Mathematical Abilities Across Languages
Ethan Edwards · 2024-04-04T13:18:54.909Z · comments (2)

[link] Was Partisanship Good for the Environmental Movement?
Jeffrey Heninger (jeffrey-heninger) · 2024-05-15T17:30:54.796Z · comments (0)

[link] Let's Design A School, Part 2.3 School as Education - The Curriculum (Phase 2, Specific)
Sable · 2024-05-15T20:58:50.981Z · comments (0)

UDT1.01: Local Affineness and Influence Measures (2/10)
Diffractor · 2024-03-31T07:35:52.831Z · comments (0)

[link] Tokyo AI Safety 2025: Call For Papers
Blaine (blaine-rogers) · 2024-10-21T08:43:38.467Z · comments (0)

[link] Altruism and Vitalism Aren't Fellow Travelers
Arjun Panickssery (arjun-panickssery) · 2024-08-09T02:01:11.361Z · comments (2)

[link] "25 Lessons from 25 Years of Marriage" by honorary rationalist Ferrett Steinmetz
CronoDAS · 2024-10-02T22:42:30.509Z · comments (2)

AI Safety University Organizing: Early Takeaways from Thirteen Groups
agucova · 2024-10-02T15:14:00.137Z · comments (0)

[link] Cellular respiration as a steam engine
dkl9 · 2024-02-25T20:17:38.788Z · comments (1)

[question] What percent of the sun would a Dyson Sphere cover?
Raemon · 2024-07-03T17:27:50.826Z · answers+comments (26)

How Congressional Offices Process Constituent Communication
Tristan Williams (tristan-williams) · 2024-07-02T12:38:41.472Z · comments (0)

SAEs you can See: Applying Sparse Autoencoders to Clustering
Robert_AIZI · 2024-10-28T14:48:16.744Z · comments (0)

A Basic Economics-Style Model of AI Existential Risk
Rubi J. Hudson (Rubi) · 2024-06-24T20:26:09.744Z · comments (3)

Best-of-n with misaligned reward models for Math reasoning
Fabien Roger (Fabien) · 2024-06-21T22:53:21.243Z · comments (0)

Evolution did a surprising good job at aligning humans...to social status
Eli Tyre (elityre) · 2024-03-10T19:34:52.544Z · comments (37)

Distinctions when Discussing Utility Functions
ozziegooen · 2024-03-09T20:14:03.592Z · comments (7)

[link] The Living Planet Index: A Case Study in Statistical Pitfalls
Jan_Kulveit · 2024-06-24T10:05:55.101Z · comments (0)

Population ethics and the value of variety
cousin_it · 2024-06-23T10:42:21.402Z · comments (11)

[link] Alignment work in anomalous worlds
Tamsin Leake (carado-1) · 2023-12-16T19:34:26.202Z · comments (4)

[link] The absence of self-rejection is self-acceptance
Chipmonk · 2023-12-21T21:54:52.116Z · comments (1)

Even if we lose, we win
Morphism (pi-rogers) · 2024-01-15T02:15:43.447Z · comments (17)

Defense Against The Dark Arts: An Introduction
Lyrongolem (david-xiao) · 2023-12-25T06:36:06.278Z · comments (36)

[question] Would you have a baby in 2024?
martinkunev · 2023-12-25T01:52:04.358Z · answers+comments (76)

A bet on critical periods in neural networks
kave · 2023-11-06T23:21:17.279Z · comments (1)

A conceptual precursor to today's language machines [Shannon]
Bill Benzon (bill-benzon) · 2023-11-15T13:50:51.226Z · comments (6)

[link] Compensating for Life Biases
Jonathan Moregård (JonathanMoregard) · 2024-01-09T14:39:14.229Z · comments (6)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

evhub on evhub's Shortform

COI: I work at Anthropic and I ran this by Anthropic before posting, but all views are exclusively my own.

I got some private questions about Anthropic's partnership with Palantir using Claude for U.S. government intelligence analysis and whether I support it and think it's reasonable, so I figured I would just write a shortform here with my thoughts. First, I can say that Anthropic has been extremely forthright about this internally, and it didn't come as a surprise to me at all. Second, my personal take would be that I think it's actually good that Anthropic is doing this. If you take catastrophic risks from AI seriously, the U.S. government is an extremely important actor to engage with, and trying to just block the U.S. government out of using AI is not a viable strategy. I do think there are some lines that you'd want to think about very carefully before considering crossing, but using Claude for intelligence analysis seems definitely fine to me. Ezra Klein has a great article on "The Problem With Everything-Bagel Liberalism" and I sometimes worry about Everything-Bagel AI Safety where e.g. it's not enough to just focus on catastrophic risks, you also have to prevent any way that the government could possibly misuse your models. I think it's important to keep your eye on the ball and not become too susceptible to an Everything-Bagel failure mode.

ryan_greenblatt on Zach Stein-Perlman's Shortform

Why not think that sandbagging could reduce productivity by ~100%? I think it can do this in human research organizations.

I happen to think this is pretty doable to address with countermeasures, but certainly without countermeasures this could be quite brutal.

ryan_greenblatt on Zach Stein-Perlman's Shortform

As far as safety research sabotage, I'm often imagining careful sabotage rather than generic sandbagging.

As in, the AI intentionally inserts very subtle errors at adversarially determined times. It might need to maintain plausible deniability in case it is caught a small number of times, but you should generally imagine an error pattern that is quite different from just making more mistakes or being dumber.

I can imagine that this results in your automated safety research being worthless or of negative value.

TBC, it seems reasonably likely to me that in the relevant regime, basic measures like having humans spend a long time trying to understand an output suffice for catching bad behavior and/or training out bad behavior.

stephen-fowler on Thomas Kwa's Shortform

The lack of a robust, highly general paradigm for reasoning about AGI models is the current greatest technical problem, although it is not what most people are working on.

What features of architecture of contemporary AI models will occur in future models that pose an existential risk?

What behavioral patterns of contemporary AI models will be shared with future models that pose an existential risk?

Is there a useful and general mathematical/physical framework that describes how agentic, macroscropic systems process information and interact with the environment?

Does terminology adopted by AI Safety researchers like "scheming", "inner alignment" or "agent" carve nature at the joints?

stephen-fowler on quila's Shortform

I upvoted because I imagine more people reading this would slightly nudge group norms in a direction that is positive.

But being cynical:

I'm sure you believe that this is true, but I doubt that it is literally true.
Signalling this position is very low risk when the community is already on board.
Trying to do good may be insufficient if your work on alignment ends up being dual use.

connor-kissane on SAEs are highly dataset dependent: a case study on the refusal direction

Thanks! I'm not sure. My guess is that if you go super narrow, it may be more likely to result in an inconvenient level of "feature splitting". Since there are only a few total concepts to learn, an SAE of equivalent width might exploit its greater relative capacity to learn niche combinations of features (to reduce sparsity loss).

stephen-fowler on Daniel Kokotajlo's Shortform

My reply definitely missed that you were talking about tunnel densities beyond what has been historically seen.

I'm inclined to agree with your argument that there is a phase shift, but it seems like it is less to do the fact that there are tunnels, and more to do with the geography becoming less tunnel-like and more open.

I have a couple thoughts on your model that aren't direct refutations of anything you've said here:

I think the single term "density" is a too crude of a measure to get a good predictive model of how combat would play out. I'd expect there to be many parameters that describe a tunnel system and have a direct tactical impact. From your discussion of mines, I think "density" is referring to the number of edges in the network? I'd expect tunnel width, geometric layout etc would change how either side behaves.
I'm not sure about your background, but with zero hours of military combat under my belt, I doubt I can predict how modern subterranean combat plays out in tunnel systems with architectures that are beyond anything seen before in history.

l-rudolf-l on Survival without dignity

You have restored my faith in LessWrong! I was getting worried that despite 200+ karma and 20+ comments, no one had actually nitpicked the descriptions of what actually happens.

The zaps of light are diffraction limited.

In practice, if you want the atmospheric nanobots to zap stuff, you'll need to do some complicated mirroring because you need to divert sunlight. And it's not one contiguous mirror but lots of small ones. But I think we can still model this as basic diffraction with some circular mirror / lens.

Intensity , where $E$ is the total power of sunlight falling on the mirror disk, $r$ is the radius of the Airy disk, and $c_{e}$ is an efficiency constant I've thrown in (because of things like atmospheric absorption (Claude says, somewhat surprisingly, this shouldn't be ridiculuously large), and not all the energy in the diffraction pattern being in the Airy disk (about 84% is, says Claude), etc.)

Now, $E = π {(\frac{D}{2})}^{2} L$ , where $D$ is the diameter of the mirror configuration, $L$ is the solar irradiance. And $r = θ l$ , where $l$ is the focal length (distance from mirror to target), and $θ \approx 1.22 λ / D$ the angular size of the central spot.

So we have $I \approx \frac{c_{e} L D^{4}}{{1.22}^{2} \times 4 λ^{2} l^{2}}$ , so the required mirror configuration radius $D = \sqrt[4]{\frac{{1.22}^{2} \times 4 I λ^{2} l^{2}}{c_{e} L}}$ .

Plugging in some reasonable values like $λ \approx 5 \times 10^{- 7}$ m (average incoming sunlight - yes the concentration suffers a bit because it's not all this wavelength), $I = 10^{7}$ W/m^2 (the level of an industrial laser that can cut metal), $l = 10^{4}$ m (lower stratosphere), $L = 1361$ W/m^2 (solar irradiance), and a conservative guess that 99% of power is wasted so $c_{e} = 0.01$ , we get $D \approx 18$ m (and the resulting beam is about 3mm wide).

So a few dozen metres of upper atmosphere nanobots should actually give you a pretty ridiculous concentration of power!

(I did not know this when I wrote the story; I am quite surprised the required radius is this ridiculously tiny. But I had heard of the concept of a "weather machine" like this from the book Where is my flying car?, which I've reviewed here, which suggests that this is possible.)

Partly because it's hard to tell between an actual animal and a bunch of nanobots pretending to be an animal. So you can't zap the nanobots on the ground without making the ground uninhabitable for humans.

I don't really buy this, why is it obvious the nanobots could pretend to be an animal so well that it's indistinguishable? Or why would targeted zaps have bad side-effects?

The "California red tape" thing implies some alignment strategy that stuck the AI to obey the law, and didn't go too insanely wrong despite a superintelligence looking for loopholes

Yeah, successful alignment to legal compliance was established without any real justification halfway through. (How to do this is currently an open technical problem, which, alas, I did not manage to solve for my satirical short story.)

Convince humans that dyson sphere are pretty and don't block the view?

This is a good point, especially since high levels of emotional manipulation was an established in-universe AI capability. (The issue described with the Dyson sphere was less that it itself would block the view, and more that building it would require dismantling the planets in a way that ruins the view - though now I'm realising that "if the sun on Earth is blocked, all Earthly views are gone" is a simpler reason and removes the need for building anything on the other planets at all.)

There is also no clear explanation of why someone somewhere doesn't make a non-red-taped AI.

Yep, this is a plot hole.

cstinesublime on Oxidize's Shortform

I can't speak for the community but after having glanced at your entire post I can't be sure just what it is about. The closest you come to explaining it is near the end you promise to present a "high-level theory on the functional realities" that seem to be related to everything from increased military spending to someone accidentally creating a virus in the lab that wipes out humanity to combating cognitive bias. But what is your theory?

Your post also makes a number of generalize assumptions about the reader and human nature and invokes the pronoun "we" far too many times. I'm a hypocrite for pointing that out, because I tend to do it as well - but the problem is that unless you have a very narrow audience in mind, especially a community that you are a native to and know intimately, often you run the risk of making assumptions or statements they will at best be confused by, and at worst will get defensive for being included with.

Most of your assumptions aren't backed up by specific examples, citations to research. For example, in your first sentence you say that we subconsciously optimize for there being no major societal changes precipitated by technology. You don't back this up. I would assume that part of the reason why there are gold- bugs, just proves there is a huge contingent of people who invest real money based precisely on the fact that they can't anticipate what major economic changes future technologies might bring. There are currently billions of dollars being spent by firms like Apple, Google, even JP Morgan Chase into A.I. assistants, in anticipation of a major change.

I could one by one go through all these general assumptions, but there are too many for it to be worth my while. Not only that, most of the footnotes you use don't make reference to any concepts or observations which are particularly new or alien. The pareto principle, Compound Effect, Rumsfeld's Epistemology... I would expect your average Lesswrong reader is very familiar with these, they present no new insights.

danwil on The Median Researcher Problem

If an outsider's objective is to be taken seriously, they should write papers and submit them to peer review (e.g. conferences and journals).

Yann LeCun has gone so far to say that independent work only counts as "science" if submitted to peer review:

"Without peer review and reproducibility, chances are your methodology was flawed and you fooled yourself into thinking you did something great." - https://x.com/ylecun/status/1795589846771147018?s=19.

From my experience, professors are very open to discuss ideas and their work with anyone who seems serious, interested, and knowledgeable. Even someone inside academia will face skepticism if their work uses completely different methods. They will have to very convincingly prove the methods are valid.