LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[question] What Other Lines of Work are Safe from AI Automation?
RogerDearnaley (roger-d-1) · 2024-07-11T10:01:12.616Z · answers+comments (35)

[link] An Interactive Shapley Value Explainer
James Stephen Brown (james-brown) · 2024-09-28T05:01:21.169Z · comments (2)

DIY RLHF: A simple implementation for hands on experience
Mike Vaiana (mike-vaiana) · 2024-07-10T12:07:03.047Z · comments (0)

[link] New blog: Expedition to the Far Lands
Connor Leahy (NPCollapse) · 2024-08-17T11:07:48.537Z · comments (3)

Reading More Each Day: A Simple $35 Tool
aysajan · 2024-07-24T13:54:04.290Z · comments (2)

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
Daniel Lee (daniel-lee) · 2024-09-06T02:28:41.954Z · comments (0)

[link] Video Intro to Guaranteed Safe AI
Mike Vaiana (mike-vaiana) · 2024-07-11T17:53:47.630Z · comments (0)

[question] Me & My Clone
SimonBaars (simonbaars) · 2024-07-18T16:25:40.770Z · answers+comments (22)

[Linkpost] Play with SAEs on Llama 3
Tom McGrath · 2024-09-25T22:35:44.824Z · comments (1)

[link] If-Then Commitments for AI Risk Reduction [by Holden Karnofsky]
habryka (habryka4) · 2024-09-13T19:38:53.194Z · comments (0)

[link] Predicting Influenza Abundance in Wastewater Metagenomic Sequencing Data
jefftk (jkaufman) · 2024-09-23T17:25:58.380Z · comments (0)

[link] AI Safety at the Frontier: Paper Highlights, August '24
gasteigerjo · 2024-09-03T19:17:24.850Z · comments (0)

Cheap Whiteboards!
Johannes C. Mayer (johannes-c-mayer) · 2024-08-08T13:52:59.627Z · comments (2)

Deceptive agents can collude to hide dangerous features in SAEs
Simon Lermen (dalasnoin) · 2024-07-15T17:07:33.283Z · comments (0)

[question] What prevents SB-1047 from triggering on deep fake porn/voice cloning fraud?
ChristianKl · 2024-09-26T09:17:39.088Z · answers+comments (18)

Response to Dileep George: AGI safety warrants planning ahead
Steven Byrnes (steve2152) · 2024-07-08T15:27:07.402Z · comments (7)

The case for more Alignment Target Analysis (ATA)
Chi Nguyen · 2024-09-20T01:14:41.411Z · comments (13)

[link] ML Safety Research Advice - GabeM
Gabe M (gabe-mukobi) · 2024-07-23T01:45:42.288Z · comments (2)

Links and brief musings for June
Kaj_Sotala · 2024-07-06T10:10:03.344Z · comments (0)

The causal backbone conjecture
tailcalled · 2024-08-17T18:50:14.577Z · comments (0)

Optimizing Repeated Correlations
SatvikBeri · 2024-08-01T17:33:23.823Z · comments (1)

LessWrong email subscriptions?
Raemon · 2024-08-27T21:59:56.855Z · comments (6)

Just because an LLM said it doesn't mean it's true: an illustrative example
dirk (abandon) · 2024-08-21T21:05:59.691Z · comments (12)

[link] Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024)
mattmacdermott · 2024-09-01T07:46:26.647Z · comments (0)

[link] Positive visions for AI
L Rudolf L (LRudL) · 2024-07-23T20:15:26.064Z · comments (4)

[link] Introduction to Super Powers (for kids!)
Shoshannah Tekofsky (DarkSym) · 2024-09-20T17:17:27.070Z · comments (0)

Proving the Geometric Utilitarian Theorem
StrivingForLegibility · 2024-08-07T01:39:10.920Z · comments (0)

[link] Beware the science fiction bias in predictions of the future
Nikita Sokolsky (nikita-sokolsky) · 2024-08-19T05:32:47.372Z · comments (20)

Housing Roundup #9: Restricting Supply
Zvi · 2024-07-17T12:50:05.321Z · comments (8)

[link] what becoming more secure did for me
Chipmonk · 2024-08-22T17:44:48.525Z · comments (5)

[link] Announcing Open Philanthropy's AI governance and policy RFP
Julian Hazell (julian-hazell) · 2024-07-17T02:02:39.933Z · comments (0)

Evaluating Sparse Autoencoders with Board Game Models
Adam Karvonen (karvonenadam) · 2024-08-02T19:50:21.525Z · comments (1)

[link] MIRI's July 2024 newsletter
Harlan · 2024-07-15T21:28:17.343Z · comments (2)

[link] Fictional parasites very different from our own
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-08T14:59:39.080Z · comments (0)

Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?
scasper · 2024-07-30T14:57:06.807Z · comments (0)

[link] A primer on the next generation of antibodies
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-01T22:37:59.207Z · comments (0)

A Visual Task that's Hard for GPT-4o, but Doable for Primary Schoolers
Lennart Finke (l-f) · 2024-07-26T17:51:28.202Z · comments (4)

An experiment on hidden cognition
Olli Järviniemi (jarviniemi) · 2024-07-22T03:26:05.564Z · comments (2)

[question] What's the Deal with Logical Uncertainty?
Ape in the coat · 2024-09-16T08:11:43.588Z · answers+comments (21)

Using an LLM perplexity filter to detect weight exfiltration
Adam Karvonen (karvonenadam) · 2024-07-21T18:18:05.612Z · comments (11)

[link] Truth is Universal: Robust Detection of Lies in LLMs
Lennart Buerger · 2024-07-19T14:07:25.162Z · comments (3)

[question] What percent of the sun would a Dyson Sphere cover?
Raemon · 2024-07-03T17:27:50.826Z · answers+comments (26)

Fun With The Tabula Muris (Senis)
sarahconstantin · 2024-09-20T18:20:01.901Z · comments (0)

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs
Kola Ayonrinde (kola-ayonrinde) · 2024-08-23T18:52:31.019Z · comments (3)

[link] Altruism and Vitalism Aren't Fellow Travelers
Arjun Panickssery (arjun-panickssery) · 2024-08-09T02:01:11.361Z · comments (2)

Alignment by default: the simulation hypothesis
gb (ghb) · 2024-09-25T16:26:00.552Z · comments (17)

I didn't think I'd take the time to build this calibration training game, but with websim it took roughly 30 seconds, so here it is!
mako yass (MakoYass) · 2024-08-02T22:35:21.136Z · comments (2)

Superintelligence Can't Solve the Problem of Deciding What You'll Do
Vladimir_Nesov · 2024-09-15T21:03:28.077Z · comments (11)

[link] Robert Caro And Mechanistic Models In Biography
adamShimi · 2024-07-14T10:56:42.763Z · comments (5)

Seeking Mechanism Designer for Research into Internalizing Catastrophic Externalities
c.trout (ctrout) · 2024-09-11T15:09:48.019Z · comments (2)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

sharmake-farah on 'Empiricism!' as Anti-Epistemology

I basically endorse argument 1, and one other update you haven't mentioned but which is important is that the values of a human turn out to be less complicated and fragile, and more generalizable than people thought (this is because human values data is likely a small part of GPT-4, and yet it can correctly answer a lot of morality questions, and I think LLMs are genuinely learning new regularities here, so they can generalize from their training data).

Implications for AI risk of course abound.

dw2 on Cryonics is free

In this recent podcast interview, Jordan Sparks, the founder and executive director of Oregon Brain Preservation (OBP), gives more information about the low-cost services OBP provides https://londonfuturists.buzzsprout.com/2028982/episodes/15517037-the-low-cost-future-of-preserving-brains-with-jordan-sparks

tailcalled on tailcalled's Shortform

Thesis: The motion of the planets are the strongest governing factor for life on Earth.

Reasoning: Time-series data often shows strong changes with the day and night cycle, and sometimes also with the seasons. The daily cycle and the seasonal cycle are governed by the relationship between the Earth and the sun. The Earth is a planet, and so its movement is part of the motion of the planets.

ryan_greenblatt on COT Scaling implies slower takeoff speeds

I don't think "continuous" is self-evident or consistently used to refer to "a longer gap from human-expert level AI to very superhuman AI". For instance, in the very essay you link, Tom argues that "continuous" (and fairly predictable) doesn't imply that this gap is long!

ryan_greenblatt on 'Empiricism!' as Anti-Epistemology

What are the close-by arguments that are actually reasonable? Here is a list of close-by arguments (not necessarily endorsed by me!):

On empirical updates from current systems: If current AI systems are broadly pretty easy to steer and there is good generalization of this steering, that should serve as some evidence that future more powerful AI systems will also be relatively easier to steer. This will help prevent concerns like scheming from arising in the first place or make these issues easier to remove.
- This argument holds to some extent regardless of whether current AIs are smart enough to think through and successfully execute scheming strategies. For instance, imagine we were in a world where steering current AIs was clearly extremely hard: AIs would quickly overfit and goodhart training processes, RLHF was finicky and had terrible sample efficiency, and AIs were much worse at sample efficiently updating on questions about human deontological constraints relative to questions about how to successfully accomplish other tasks. In such a world, I think we should justifiably be more worried about future systems.
- And in fact, people do argue about how hard it is to steer current systems and what this implies. For an example of a version of an argument like this, see here [LW(p) · GW(p)], though note that I disagree with various things.
- It's pretty unclear what predictions Eliezer made about the steerability of future AI systems and he should lose some credit for not making clear predictions. Further, my sense is his implied predictions don't look great. (Paul's predictions as of about 2016 seem pretty good from my understanding, though not that clearly laid out, and are consistent with his threat models.)
On unfalsifiability: It should be possible to empirically produce evidence for or against scheming prior to it being too late. The fact that MIRI-style doom views often don't discuss predictions about experimental evidence and also don't make reasonably convincing arguments that it will be very hard to produce evidence in test beds is concerning. It's a bad sign if advocates for a view don't try hard to make it falsifiable prior to that view implying aggressive action.
- My view is that we'll probably get a moderate amount of evidence on scheming prior to catastrophe (perhaps 3x update either way), with some chance that scheming will basically be confirmed in an earlier model. And, it is in principle possible to obtain certainty either way about scheming using experiments, though this might be very tricky and a huge amount of work for various reasons.
On empiricism: There isn't good empirical evidence for scheming and instead the case for scheming depends on dubious arguments. Conceptual arguments have a bad track record, so to estimate the probability of scheming we should mostly guess based on the most basic and simple conceptual arguments and weight more complex arguments very little. If you do this, scheming looks unlikely.
- I roughly half agree with this argument, but I'd note that you also have to discount conceptual arguments against scheming in the same way and that the basic and simple conceptual arguments seem to indicate that scheming isn't that unlikely. (I'd say around 25%.)

ryan_greenblatt on 'Empiricism!' as Anti-Epistemology

I find this essay interesting as a case study in discourse and argumentation norms. Particularly as a case study of issues with discourse around AI risk.

When I first skimmed this essay when it came out, I thought it was ok, but mostly uninteresting or obvious. Then, on reading the comments and looking back at the body, I thought it did some pretty bad strawmanning.

I reread the essay yesterday and now I feel quite differently. Parts (i), (ii), and (iv) which don't directly talk about AI are actually great and many of the more subtle points are pretty well executed. The connection to AI risk in part (iii) is quite bad and notably degrades the essay as a whole. I think a well-executed connection to AI risk would have been good. Part (iii) seems likely to contribute to AI risk being problematically politicized and negatively polarized (e.g. low quality dunks and animosity). Further, I think this is characteristic of problems I have with the current AI risk discourse.

In parts (i), (ii), and (iv), it is mostly clear that the Spokesperson is an exaggerated straw person who doesn't correspond to any particular side of an issue. This seems like a reasonable rhetorical move to better explain a point. However, part (iii) has big issues in how it connects the argument to AI risk. Eliezer ends up defeating a specific and weak argument against AI risk. This is an argument that actually does get made, but unfortunately, he both associates this argument with the entire view of AI risk skepticism (in the essay, the "AI-permitting faction") and he fails to explain that the debunking doesn't apply to many common arguments which sound similar but are actually reasonable. Correspondingly, the section suffers from an ethnic tension style issue: in practice, it attacks an entire view by associating it with a bad argument for that view (but of course, reversed stupidity is not intelligence [LW · GW]). This issue is made worse because the argument Eliezer attacks is also very similar to various more reasonable arguments that can't be debunked in the same way and Eliezer doesn't clearly call this out. Thus, these more reasonable arguments are attacked in association. It seems reasonable for Eliezer to connect the discussion to AI risk directly, but I think the execution was poor.

I think my concerns are notably similar to the issues people had with The Sun is big, but superintelligences will not spare Earth a little sunlight [LW(p) · GW(p)] and I've encountered similar issues in many things written by Eliezer and Nate.

How could Eliezer avoid these issues? I think when debunking or arguing against bad arguments, you should explain that you're attacking bad arguments and that there exist other commonly made arguments which are better or at least harder to debunk. It also helps to disassociate the specific bad arguments from a general cause or view as much as possible. This essay seems to associate the bad "Empiricism!" argument with AI risk skepticism nearly as much as possible. Whenever you push back against an argument which is similar or similar-sounding to other arguments, but the push back doesn't apply in the same way to those other arguments, it's useful to explicitly spell out the limitations of the push back.^[1] One possible bar to try to reach is that people who disagree strongly on the topic due to other more reasonable arguments should feel happy to endorse the push back against those specific bad arguments.

There is perhaps a bit of a motte-and-bailey with this essay where Eliezer can strongly defend debunking a specific bad argument (the motte), but there is an implication that the argument also pushes against more complex and less clearly bad arguments (the bailey). (I'm not saying that Eliezer actively engages in motte-and-bailey in the essay, just that this essay probably has this property to some extent in practice.) That said there is also perhaps a motte-and-bailey for many of the arguments that Eliezer argues against where the motte is a more complex and narrow argument and the bailey is "Empiricism! is why AI risk is fake".

A thoughtful reader can recognize and avoid their own cognitive biases and correctly think though the exact correct implications of the arguments made here. But, I wish Eliezer did this work for the reader to reduce negative polarization.

When debunking bad arguments, it's useful to be clear about what you aren't covering or implying.

Part (iv) helps to explain the scope of the broader point, but doesn't explain the limitations in specifically the AI case. ↩︎

raemon on Chapter 7: Reciprocation

This one.

kaj_sotala on "Slow" takeoff is a terrible term for "maybe even faster takeoff, actually"

IMO, soft/smooth/gradual still convey wrong impressions. They still sound like "slow takeoff", they sound like the progress would be steady enough that normal people would have time to orient to what's happening, keep track, and exert control.

That is exactly the meaning that I'd thought was standard for "soft takeoff" (and which I assumed was synonymous with "slow takeoff"), e.g. as I wrote in 2012:

Bugaj and Goertzel (2007) consider three kinds of AGI scenarios: capped intelligence, soft takeoff, and hard takeoff. In a capped intelligence scenario, all AGIs are prevented from exceeding a predetermined level of intelligence and remain at a level roughly comparable with humans. In a soft takeoff scenario, AGIs become far more powerful than humans, but on a timescale which permits ongoing human interaction during the ascent. Time is not of the essence, and learning proceeds at a relatively human-like pace. In a hard takeoff scenario, an AGI will undergo an extraordinarily fast increase in power, taking effective control of the world within a few years or less. [Footnote: Bugaj and Goertzel defined hard takeoff to refer to a period of months or less. We have chosen a somewhat longer time period, as even a few years might easily turn out to be too little time for society to properly react.] In this scenario, there is little time for error correction or a gradual tuning of the AGI’s goals.

(B&G didn't actually invent soft/hard takeoff, but it was the most formal-looking cite we could find.)

guido-bergman on Avoiding jailbreaks by discouraging their representation in activation space

Thanks, that is a great suggestion! I will add it to the potential solutions to the problem.

norimori1992 on you should probably eat oatmeal sometimes

Most people's experience with oatmeal has been from one of:
packets of instant oatmeal that have low-quality cheap flavoring and might have gone stale
quick-cooking rolled oats without any flavoring

Those are my only experiences with oats, but I like both of those experiences. I love instant oatmeal, and I love quick oats boiled on the stove, though of course in the latter case I have to supply my own flavouring. Even just sugar is enough to make it great. But adding raisins takes it to another level; especially adding them while the oats are still boiling so that they rehydrate a little. Dried cranberries are also great. Sometimes I like to add some apple juice to the pot while the oats are boiling. This narrow range of possibilities is already plenty of variety for my narrow palate; and if I'm already making oatmeal, those additions are basically zero added effort. If I didn't have crippling executive dysfunction, I'd eat quick oats much more often than I actually do. (Unfortunately, the activation energy required to make anything on the stove is more than I usually reach; and I find the microwave to be a poor substitute, not to mention that it doesn't even solve the part that I find aversive, which is measuring out the oats and water. Also, raisins and dried cranberries are expensive.)