LessWrong 2.0 Reader

View: New · Old · Top

← previous page (newer posts) · next page (older posts) →

Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy
Buck · 2023-07-26T17:02:56.456Z · comments (18)

[link] Neuronpedia
Johnny Lin (hijohnnylin) · 2023-07-26T16:29:28.884Z · comments (51)

[link] Frontier Model Forum
Zach Stein-Perlman · 2023-07-26T14:30:02.018Z · comments (0)

[link] Podcasts: Future of Life Institute, Breakthrough Science Summit panel
jasoncrawford · 2023-07-26T14:28:04.462Z · comments (0)

Llama We Doing This Again?
Zvi · 2023-07-26T13:00:06.703Z · comments (3)

[link] Frontier Model Security
Vaniver · 2023-07-26T04:48:02.215Z · comments (1)

[link] The First Room-Temperature Ambient-Pressure Superconductor
Annapurna (jorge-velez) · 2023-07-26T02:27:51.760Z · comments (28)

Underwater Torture Chambers: The Horror Of Fish Farming
omnizoid · 2023-07-26T00:27:15.490Z · comments (49)

[link] Contra Alexander on the Bitter Lesson and IQ
Andrew Keenan Richardson (qemqemqem) · 2023-07-26T00:07:53.904Z · comments (1)

Overcoming the MWC
Mark Freed (mark-freed) · 2023-07-25T17:31:35.658Z · comments (0)

Russian parliamentarian: let's ban personal computers and the Internet
RomanS · 2023-07-25T17:30:20.871Z · comments (6)

[link] AISN #16: White House Secures Voluntary Commitments from Leading AI Labs and Lessons from Oppenheimer
Corin Katzke (corin-katzke) · 2023-07-25T16:58:44.528Z · comments (0)

"The Universe of Minds" - call for reviewers (Seeds of Science)
rogersbacon · 2023-07-25T16:53:44.775Z · comments (0)

Thoughts on Loss Landscapes and why Deep Learning works
beren · 2023-07-25T16:41:39.562Z · comments (4)

Should you work at a leading AI lab? (including in non-safety roles)
Benjamin Hilton (80000hours) · 2023-07-25T16:29:39.371Z · comments (0)

[link] Whisper's Word-Level Timestamps are Out
Varshul Gupta · 2023-07-25T14:32:28.671Z · comments (2)

[link] AIS 101: Task decomposition for scalable oversight
Charbel-Raphaël (charbel-raphael-segerie) · 2023-07-25T13:34:58.507Z · comments (0)

Anthropic Observations
Zvi · 2023-07-25T12:50:03.178Z · comments (1)

Autonomous Alignment Oversight Framework (AAOF)
Justausername · 2023-07-25T10:25:03.090Z · comments (0)

How LLMs are and are not myopic
janus · 2023-07-25T02:19:44.949Z · comments (14)

Secure Hand Holding
jefftk (jkaufman) · 2023-07-25T01:40:01.553Z · comments (43)

[link] Open problems in activation engineering
TurnTrout · 2023-07-24T19:46:08.733Z · comments (2)

Subdivisions for Useful Distillations?
Sharat Jacob Jacob (sharat-jacob-jacob) · 2023-07-24T18:55:05.801Z · comments (2)

[link] Optimizing For Approval And Disapproval
Thoth Hermes (thoth-hermes) · 2023-07-24T18:46:15.223Z · comments (0)

An Opinionated Guide to Computability and Complexity (Post #0)
Noosphere89 (sharmake-farah) · 2023-07-24T17:53:18.551Z · comments (10)

Slowing down AI progress is an underexplored alignment strategy
Norman Borlaug · 2023-07-24T16:56:25.604Z · comments (27)

Anticipation in LLMs
derek shiller (derek-shiller) · 2023-07-24T15:53:07.076Z · comments (0)

[link] The cone of freedom (or, freedom might only be instrumentally valuable)
dkl9 · 2023-07-24T15:38:54.687Z · comments (6)

A reformulation of Finite Factored Sets
Matthias G. Mayer (matthias-georg-mayer) · 2023-07-24T13:02:25.382Z · comments (1)

Brain Efficiency Cannell Prize Contest Award Ceremony
Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2023-07-24T11:30:10.602Z · comments (12)

[link] [Crosspost] An AI Pause Is Humanity's Best Bet For Preventing Extinction (TIME)
otto.barten (otto-barten) · 2023-07-24T10:07:40.473Z · comments (0)

Cryonics and Regret
MvB (martin-von-berg) · 2023-07-24T09:16:01.456Z · comments (34)

Rationality !== Winning
Raemon · 2023-07-24T02:53:59.764Z · comments (49)

[question] Which rationality posts are begging for further practical development?
LoganStrohl (BrienneYudkowsky) · 2023-07-23T22:22:04.389Z · answers+comments (17)

[link] Please speak unpredictably
dkl9 · 2023-07-23T22:09:09.035Z · comments (16)

QAPR 5: grokking is maybe not *that* big a deal?
Quintin Pope (quintin-pope) · 2023-07-23T20:14:33.405Z · comments (15)

[link] My favorite AI governance research this year so far
Zach Stein-Perlman · 2023-07-23T16:30:00.558Z · comments (1)

"Justice, Cherryl."
Zack_M_Davis · 2023-07-23T16:16:40.835Z · comments (20)

Supplementary Alignment Insights Through a Highly Controlled Shutdown Incentive
Justausername · 2023-07-23T16:08:32.886Z · comments (1)

Autogynephilia discourse is so absurdly bad on all sides
tailcalled · 2023-07-23T13:12:07.982Z · comments (24)

Examples of Prompts that Make GPT-4 Output Falsehoods
scasper · 2023-07-22T20:21:39.730Z · comments (5)

Think like a consultant not a salesperson
Adam Zerner (adamzerner) · 2023-07-22T19:31:48.676Z · comments (5)

Optimization, loss set at variance in RL
Clairstan · 2023-07-22T18:25:31.773Z · comments (1)

Compute Thresholds: proposed rules to mitigate risk of a “lab leak” accident during AI training runs
davidad · 2023-07-22T18:09:03.816Z · comments (2)

Apollo Neuro Follow Up
Elizabeth (pktechgirl) · 2023-07-22T17:20:09.893Z · comments (0)

Expert trap – Ways out (Part 3 of 3)
Paweł Sysiak (pawel-sysiak) · 2023-07-22T13:06:14.617Z · comments (0)

GPTs' ability to keep a secret is weirdly prompt-dependent
Mateusz Bagiński (mateusz-baginski) · 2023-07-22T12:21:26.175Z · comments (0)

Replacing the Big Air Purifier
jefftk (jkaufman) · 2023-07-22T12:10:01.050Z · comments (0)

[question] I'm consistently overwhelmed by basic obligations. Are there any paradigm shifts or other rationality-based tips that would be helpful?
Benjamin Hendricks (benjamin-hendricks) · 2023-07-21T21:10:21.543Z · answers+comments (37)

Fundamentally Fuzzy Concepts Can't Have Crisp Definitions: Cooperation and Alignment vs Math and Physics
VojtaKovarik · 2023-07-21T21:03:21.501Z · comments (18)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

seth-herd on Instruction-following AGI is easier and more likely than value aligned AGI

I read your linked shortform thread. I agreed with pretty most of your arguments against some common AGI takeover arguments. I agree that they won't coordinate against us and won't have "collective grudges" against us.

But I don't think the arguments for continued stability are very thorough, either. I think we just don't know how it will play out. And I think there's a reason to be concerned that takeover will be rational for AGIs, where it's not for humans.

The central difference in logic is the capacity for self-improvement. In your post, you addressed self-improvement by linking a Christiano piece on slow takeoff. But he noted at the start that he wasn't arguing against self-improvement, only that the pace of self improvement would be more modest. But the potential implications for a balance of power in the world remain.

Humans are all locked to a similar level of cognitive and physical capabilities. That has implications for game theory where all of the competitors are humans. Cooperation often makes more sense for humans. But the same isn't necessarily true of AGI. Their cognitive and physical capacities can potentially be expanded on. So it's (very loosely) like the difference between game theory in chess, and chess where one of the moves is to add new capabilities to your pieces. We can't learn much about the new game from theory of the old, particularly if we don't even know all of the capabilities that a player might add to their pieces.

More concretely: it may be quite rational for a human controlling an AGI to tell it to try to self-improve and develop new capacities, strategies and technologies to potentially take over the world. With a first-mover advantage, such a takeover might be entirely possible. Its capacities might remain ahead of the rest of the world's AI/AGIs if they hadn't started to aggressively self-improve and develop the capacities to win conflicts. This would be particularly true if the aggressor AGI was willing to cause global catastrophe (e.g., EMPs, bringing down power grids).

The assumption of a stable balance of power in the face of competitors that can improve their capacities in dramatic ways seems unlikely to be true by default, and at the least, worthy of close inspection. Yet I'm afraid it's the default assumption for many.

Your shortform post is more on-topic for this part of the discussion, so I'm copying this comment there and will continue there if you want. It's worth more posts; I hope to write one myself if time allows.

eggsyntax on Language Models Model Us

I'm aware of the paper because of the impact it had. I might personally not have chosen to draw their attention to the issue, since the main effect seems to be making some research significantly more difficult, and I haven't heard of any attempts to deliberately exfiltrate weights that this would be preventing.

bec-hawk on Ilya Sutskever and Jan Leike resign from OpenAI [updated]

Noting that while Sam describes the provision as being about “about potential equity cancellation”, the actual wording says ‘shall be cancelled’ not ‘may be cancelled’, as per this tweet from Kelsey Piper: https://x.com/KelseyTuoc/status/1791584341669396560

eggsyntax on Language Models Model Us

Interesting! Tough to test at scale, though, or score in any automated way (which is something I'm looking for in my approaches, although I realize you may not be).

bec-hawk on Stephen Fowler's Shortform

Instances in history in which private companies (or any individual humans) have intentionally turned down huge profits and power are the exception, not the rule.

OpenAI wasn’t a private company (ie for-profit) at the time of the OP grant though.

seth-herd on Instruction-following AGI is easier and more likely than value aligned AGI

In the near term AI and search are blurred, but that's a separate topic. This post was about AGI as distinct from AI. There's no sharp line between but there are important distinctions, and I'm afraid we're confused as a group because of that blurring. More above [LW(p) · GW(p)], and it's worth its own post and some sort of new clarifying terminology. The term AGI has been watered down to include LLMs that are fairly general, rather than the original and important meaning of AI that can think about anything, implying the ability to learn, and therefore almost necessarily to have explicit goals and agency. This was about that type of "real" AGI, which is still hypothetical even though increasingly plausible in the near term.

alex_altair on Fund me please - I Work so Hard that my Feet start Bleeding and I Need to Infiltrate University

Hey Johannes, I don't quite know how to say this, but I think this post is a red flag about your mental health. "I work so hard that I ignore broken glass and then walk on it" is not healthy.

I've been around the community a long time and have seen several people have psychotic episodes. This is exactly the kind of thing I start seeing before they do.

I'm not saying it's 90% likely, or anything. Just that it's definitely high enough for me to need to say something. Please try to seek out some resources to get you more grounded.

arthur-conmy on Language Models Model Us

They emailed some people about this: https://x.com/brianryhuang/status/1763438814515843119

The reason is that it may allow unembedding matrix weight stealing: https://arxiv.org/abs/2403.06634

seth-herd on Instruction-following AGI is easier and more likely than value aligned AGI

Yes, we do see such "values" now, but that's a separate issue IMO.

There's an interesting thing happening in which we're mixing discussions of AI safety and AGI x-risk. There's no sharp line, but I think they are two importantly different things. This post was intended to be about AGI, as distinct from AI. Most of the economic and other concerns relative to the "alignment" of AI are not relevant to the alignment of AGI.

This thesis could be right or wrong, but let's keep it distinct from theories about AI in the present and near future. My thesis here (and a common thesis) is that we should be most concerned about AGI that is an entity with agency and goals, like humans have. AI as a tool is a separate thing. It's very real and we should be concerned with it, but not let it blur into categorically distinct, goal-directed, self-aware AGI.

Whether or not we actually get such AGI is an open question that should be debated, not assumed. I think the answer is very clearly that we will, and soon; as soon as tool AI is smart enough, someone will make it agentic, because agents can do useful work, and they're interesting. So I think we'll get AGI with real goals, distinct from the pseudo-goals implicit in current LLMs behavior.

The post addresses such "real" AGI that is self-aware and agentic, but that has the sole goal of doing what people want is pretty much a third thing that's somewhat counterintuitive.

bec-hawk on Ilya Sutskever and Jan Leike resign from OpenAI [updated]

Is that not what Altman is referring to when he talks about vested equity? My understanding was employees had no other form of equity besides PPUs, in which case he’s talking non-misleadingly about the non-narrow case of vested PPUs, ie the thing people were alarmed about, right?