LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] If-Then Commitments for AI Risk Reduction [by Holden Karnofsky]
habryka (habryka4) · 2024-09-13T19:38:53.194Z · comments (0)

[link] Predicting Influenza Abundance in Wastewater Metagenomic Sequencing Data
jefftk (jkaufman) · 2024-09-23T17:25:58.380Z · comments (0)

There aren't enough smart people in biology doing something boring
Abhishaike Mahajan (abhishaike-mahajan) · 2024-10-21T15:52:04.482Z · comments (13)

Interpretability of SAE Features Representing Check in ChessGPT
Jonathan Kutasov (jonathan-kutasov) · 2024-10-05T20:43:36.679Z · comments (2)

[question] What prevents SB-1047 from triggering on deep fake porn/voice cloning fraud?
ChristianKl · 2024-09-26T09:17:39.088Z · answers+comments (21)

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
Daniel Lee (daniel-lee) · 2024-09-06T02:28:41.954Z · comments (0)

European Progress Conference
Martin Sustrik (sustrik) · 2024-10-06T11:10:03.819Z · comments (11)

[link] Liquid vs Illiquid Careers
vaishnav92 · 2024-10-20T23:03:49.725Z · comments (5)

Superintelligence Can't Solve the Problem of Deciding What You'll Do
Vladimir_Nesov · 2024-09-15T21:03:28.077Z · comments (11)

Cheap Whiteboards!
Johannes C. Mayer (johannes-c-mayer) · 2024-08-08T13:52:59.627Z · comments (2)

Distinguishing ways AI can be "concentrated"
Matthew Barnett (matthew-barnett) · 2024-10-21T22:21:13.666Z · comments (2)

[link] Evaluating Synthetic Activations composed of SAE Latents in GPT-2
Giorgi Giglemiani (Rakh) · 2024-09-25T20:37:48.227Z · comments (0)

[question] Any real toeholds for making practical decisions regarding AI safety?
lukehmiles (lcmgcd) · 2024-09-29T12:03:08.084Z · answers+comments (6)

Domain-specific SAEs
jacob_drori (jacobcd52) · 2024-10-07T20:15:38.584Z · comments (0)

[question] Why do Minimal Bayes Nets often correspond to Causal Models of Reality?
Dalcy (Darcy) · 2024-08-03T12:39:44.085Z · answers+comments (1)

An AI crash is our best bet for restricting AI
Remmelt (remmelt-ellen) · 2024-10-11T02:12:03.491Z · comments (1)

Just because an LLM said it doesn't mean it's true: an illustrative example
dirk (abandon) · 2024-08-21T21:05:59.691Z · comments (12)

Optimizing Repeated Correlations
SatvikBeri · 2024-08-01T17:33:23.823Z · comments (1)

[question] Seeking AI Alignment Tutor/Advisor: $100–150/hr
MrThink (ViktorThink) · 2024-10-05T21:28:16.491Z · answers+comments (3)

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?
Taras Kutsyk · 2024-09-29T19:37:30.465Z · comments (7)

The causal backbone conjecture
tailcalled · 2024-08-17T18:50:14.577Z · comments (0)

Why is there Nothing rather than Something?
Logan Zoellner (logan-zoellner) · 2024-10-26T12:37:50.204Z · comments (3)

SAE features for refusal and sycophancy steering vectors
neverix · 2024-10-12T14:54:48.022Z · comments (4)

[link] Arithmetic Models: Better Than You Think
kqr · 2024-10-26T09:42:07.185Z · comments (4)

Sleeping on Stage
jefftk (jkaufman) · 2024-10-22T00:50:07.994Z · comments (3)

[link] Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024)
mattmacdermott · 2024-09-01T07:46:26.647Z · comments (0)

LessWrong email subscriptions?
Raemon · 2024-08-27T21:59:56.855Z · comments (6)

[link] Care Doesn't Scale
stavros · 2024-10-28T11:57:38.742Z · comments (1)

[link] Conventional footnotes considered harmful
dkl9 · 2024-10-01T14:54:01.732Z · comments (16)

[question] When can I be numerate?
FinalFormal2 · 2024-09-12T04:05:27.710Z · answers+comments (3)

Fun With The Tabula Muris (Senis)
sarahconstantin · 2024-09-20T18:20:01.901Z · comments (0)

You're Playing a Rough Game
jefftk (jkaufman) · 2024-10-17T19:20:06.251Z · comments (2)

[link] Beware the science fiction bias in predictions of the future
Nikita Sokolsky (nikita-sokolsky) · 2024-08-19T05:32:47.372Z · comments (20)

[link] Fictional parasites very different from our own
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-08T14:59:39.080Z · comments (0)

[link] Introduction to Super Powers (for kids!)
Shoshannah Tekofsky (DarkSym) · 2024-09-20T17:17:27.070Z · comments (0)

The case for more Alignment Target Analysis (ATA)
Chi Nguyen · 2024-09-20T01:14:41.411Z · comments (13)

[link] A primer on the next generation of antibodies
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-01T22:37:59.207Z · comments (0)

[link] SB 1047 gets vetoed
ryan_b · 2024-09-30T15:49:38.609Z · comments (1)

[link] Death notes - 7 thoughts on death
Nathan Young · 2024-10-28T15:01:13.532Z · comments (1)

Proving the Geometric Utilitarian Theorem
StrivingForLegibility · 2024-08-07T01:39:10.920Z · comments (0)

A Triple Decker for Elfland
jefftk (jkaufman) · 2024-10-11T01:50:02.332Z · comments (0)

AXRP Episode 36 - Adam Shai and Paul Riechers on Computational Mechanics
DanielFilan · 2024-09-29T05:50:02.531Z · comments (0)

[link] what becoming more secure did for me
Chipmonk · 2024-08-22T17:44:48.525Z · comments (5)

Improving Model-Written Evals for AI Safety Benchmarking
Sunishchal Dev (sunishchal-dev) · 2024-10-15T18:25:08.179Z · comments (0)

I didn't think I'd take the time to build this calibration training game, but with websim it took roughly 30 seconds, so here it is!
mako yass (MakoYass) · 2024-08-02T22:35:21.136Z · comments (2)

Trying to be rational for the wrong reasons
Viliam · 2024-08-20T16:18:06.385Z · comments (8)

Seeking Mechanism Designer for Research into Internalizing Catastrophic Externalities
c.trout (ctrout) · 2024-09-11T15:09:48.019Z · comments (2)

[link] Altruism and Vitalism Aren't Fellow Travelers
Arjun Panickssery (arjun-panickssery) · 2024-08-09T02:01:11.361Z · comments (2)

[link] "25 Lessons from 25 Years of Marriage" by honorary rationalist Ferrett Steinmetz
CronoDAS · 2024-10-02T22:42:30.509Z · comments (2)

the Daydication technique
chaosmage · 2024-10-18T21:47:46.448Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

johnswentworth on Three Notions of "Power"

Good guess, but that's not cruxy for me. Yes, LDT/FDT-style things are one possibility. But even if those fail, I still expect non-hierarchical coordination mechanisms among highly capable agents.

Gesturing more at where the intuition comes from: compare hierarchical management to markets, as a control mechanism. Markets require clean factorization - a production problem needs to be factored into production of standardized, verifiable intermediate goods in order for markets to handle the production pipeline well. If that can be done, then markets scale very well, they pass exactly the information and incentives people need (in the form of prices). Hierarchies, in contrast, scale very poorly. They provide basically-zero built-in mechanisms for passing the right information between agents, or for providing precise incentives to each agent. They're the sort of thing which can work ok at small scale, where the person at the top can track everything going on everywhere, but quickly become extremely bottlenecked on the top person as you scale up. And you can see this pretty clearly at real-world companies: past a very small size, companies are usually extremely bottlenecked on the attention of top executives, because lower-level people lack the incentives/information to coordinate on their own across different parts of the company.

(Now, you might think that an AI in charge of e.g. a company could make the big hierarchy work efficiently by just being capable enough to track everything themselves. But at that point, I wouldn't expect to see an hierarchy at all; the AI can just do everything itself and not have multiple agents in the first place. Unlike humans, AIs will not be limited by their number of hands. If there is to be some arrangement involving multiple agents coordinating in the first place, then it shouldn't be possible for one mind to just do everything itself.)

On the other hand, while dominance relations scale very poorly as a coordination mechanism, they are algorithmically relatively simple. Thus my claim from the post that dominance seems like a hack for low-capability agents, and higher-capability agents will mostly rely on some other coordination mechanism.

warty on avturchin's Shortform

burning the dog defense commons 😔

zach-stein-perlman on Zach Stein-Perlman's Shortform

Some not-super-ambitious asks for labs (in progress):

Do evals on on dangerous-capabilities-y and agency-y tasks; look at the score before releasing or internally deploying the model
Have a high-level capability threshold at which securing model weights is very important
Do safety research at all
Have a safety plan at all; talk publicly about safety strategy at all
- Have a post like Core Views on AI Safety
- Have a post like The Checklist [AF · GW]
- On the website, clearly describe a worst-case plausible outcome from AI and state its credence in such outcomes (perhaps unconditionally, conditional on no major government action, and/or conditional on leading labs being unable or unwilling to pay a large alignment tax [EA · GW]).
Whistleblower protections?
- Not sure what the ask is. Right to Warn is a starting point. In particular, an anonymous internal reporting pipeline for failures to comply with safety policies is clearly good (but likely inadequate).
Publicly explain the processes and governance structures that determine deployment decisions
- (And ideally make those processes and structures good)

sharmake-farah on Three Notions of "Power"

Maybe, but I'm not sure it's even necessary to invoke LDT/FDT/UDT, and instead argue that coordinating even through solely causal methods is very cheap for AIs to the point where coordination, and as a side effect, interfaces become quite a lot less of a bottleneck compared to today.

In essence, I think the diff between John's models and tailcalled's models is plausibly in how easy coordination in a more general sense can ever be for AIs, and whether AIs have much better ability to coordinate compared to humans today, where John thinks that coordination is a taut constraint for humans but not for AI, but tailcalled thinks it's hard to coordinate for both AIs and humans due to fundamental limits.

tailcalled on Three Notions of "Power"

LDT/FDT is a central example of rationalist-Gnostic heresy [LW · GW].

d0themath on Three Notions of "Power"

I am going to guess that the diff between you and John's models here is that John thinks LDT/FDT solves this, and you don't.

sharmake-farah on How might we solve the alignment problem? (Part 1: Intro, summary, ontology)

The real answer to Sorites paradox for intelligence is that memory is an issue, and if a subject requires you to learn more information than can be stored in memory, you can't learn it no matter how much time you invest into the subject, and differing intelligences also usually differ in memory capacity.

However, assuming memory isn't an issue, than the answer to the question is that intelligence really is a continuum, and the better metric is rate of learning per time step say, since any difference in intelligence is solely a difference in time, so no hard barriers exists.

I think that memory will not be the bottleneck in an intelligence explosion for understanding AI, and instead time will be the bottleneck.

sharmake-farah on Noosphere89's Shortform

This also BTW explains why we cannot rely on economic arguments on AI to make the future go well.

avturchin on avturchin's Shortform

Yes. It is important point.

sharmake-farah on Noosphere89's Shortform

Here's a underrated frame for why AI alignment is likely necessary for the future to go very well under human values, even though in our current society we don't need human to human alignment to make modern capitalism be good and can rely on selfishness instead.

The reason is because there's a real likelihood that human labor, and more generally human existence is not economically valuable or even have negative economic value, say where the addition of a human to the AI company makes that company worse in the near future.

The reason this matters is that once labor is much easier to scale than capital, as is likely in an AI future, it's now economically viable or even beneficial to break a lot of the rules that help humans survive, contra Matthew Barnett's view, and this is even more incentivized by the fact that an unaligned AI released into society would likely not be punishable/incentivizable by mere humans, solely due to controlling robotic armies and robotic workforces that allow it to dispense with societal constraints humans have to accept.

dr_s talks about the equilibrium that is totally valid under AI automation economics that is very bad for humans, and avoiding these sorts of equilibriums can't be done through economic forces, because of the fact that the companies doing this are too powerful to have any real incentives work on them, since they can either neutralize or turn the attempted boycott/shopping around to their own benefit, and thus avoiding this outcome requires alignment to your values, and can't work with selfishness:

Consider a scenario in which AGI and human-equivalent robotics are developed and end up owned (via e.g. controlling exclusively the infrastructure that runs it, and being closed source) by a group of, say, 10,000 people overall who have some share in this automation capital. If these people have exclusive access to it, a perfectly functional equilibrium is "they trade among peers goods produced by their automated workers and leave everyone else to fend for themselves".

This framing of the alignment problem, of how to get an AI that values humans such that this outcome is prevented, also has an important implication:

It's not enough to solve the technical problem of alignment, absent modeling the social situation, because of suffering risk issues plaus catastrophic risk issues, and also means the level of alignment of AI needs to be closer to the fictional benevolent angels than it is to humans in relationship to other humans, so it motivates a more ambitious version of the alignment objectives than making AIs merely not break the law or steal from humans..

I'm actually reasonably hopeful the more ambitious versions of alignment are possible, and think there's a realistic chance we can actually do them.

But we actually need to do the work, and AI that automates everything might come in your lifetime, so we should prepare the foundations soon.