LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

AI Control: Improving Safety Despite Intentional Subversion
Buck · 2023-12-13T15:51:35.982Z · comments (7)

UDT shows that decision theory is more puzzling than ever
Wei Dai (Wei_Dai) · 2023-09-13T12:26:09.739Z · comments (51)

Thoughts on sharing information about language model capabilities
paulfchristiano · 2023-07-31T16:04:21.396Z · comments (34)

If interpretability research goes well, it may get dangerous
So8res · 2023-04-03T21:48:18.752Z · comments (10)

[link] Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
likenneth · 2023-06-11T05:38:35.284Z · comments (4)

Evolution provides no evidence for the sharp left turn
Quintin Pope (quintin-pope) · 2023-04-11T18:43:07.776Z · comments (62)

[link] Sam Altman fired from OpenAI
LawrenceC (LawChan) · 2023-11-17T20:42:30.759Z · comments (75)

Refusal in LLMs is mediated by a single direction
Andy Arditi (andy-arditi) · 2024-04-27T11:13:06.235Z · comments (79)

Giant (In)scrutable Matrices: (Maybe) the Best of All Possible Worlds
1a3orn · 2023-04-04T17:39:39.720Z · comments (35)

Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk
1a3orn · 2023-11-02T18:20:29.569Z · comments (79)

Consciousness as a conflationary alliance term for intrinsically valued internal experiences
Andrew_Critch · 2023-07-10T08:09:48.881Z · comments (46)

Thoughts on “AI is easy to control” by Pope & Belrose
Steven Byrnes (steve2152) · 2023-12-01T17:30:52.720Z · comments (55)

My Interview With Cade Metz on His Reporting About Slate Star Codex
Zack_M_Davis · 2024-03-26T17:18:05.114Z · comments (186)

Twiblings, four-parent babies and other reproductive technology
GeneSmith · 2023-05-20T17:11:23.726Z · comments (32)

Grant applications and grand narratives
Elizabeth (pktechgirl) · 2023-07-02T00:16:25.129Z · comments (20)

The basic reasons I expect AGI ruin
Rob Bensinger (RobbBB) · 2023-04-18T03:37:01.496Z · comments (73)

Updates and Reflections on Optimal Exercise after Nearly a Decade
romeostevensit · 2023-06-08T23:02:14.761Z · comments (55)

Labs should be explicit about why they are building AGI
peterbarnett · 2023-10-17T21:09:20.711Z · comments (16)

Transcript and Brief Response to Twitter Conversation between Yann LeCunn and Eliezer Yudkowsky
Zvi · 2023-04-26T13:10:01.233Z · comments (50)

Announcing Timaeus
Jesse Hoogland (jhoogland) · 2023-10-22T11:59:03.938Z · comments (15)

The other side of the tidal wave
KatjaGrace · 2023-11-03T05:40:05.363Z · comments (79)

Thinking By The Clock
Screwtape · 2023-11-08T07:40:59.936Z · comments (27)

[link] Contra Ngo et al. “Every ‘Every Bay Area House Party’ Bay Area House Party”
Ricki Heicklen (bayesshammai) · 2024-02-22T23:56:02.318Z · comments (5)

[link] Daniel Kahneman has died
DanielFilan · 2024-03-27T15:59:14.517Z · comments (11)

[link] Large Language Models will be Great for Censorship
Ethan Edwards · 2023-08-21T19:03:55.323Z · comments (14)

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
JanB (JanBrauner) · 2023-09-28T18:53:58.896Z · comments (37)

Toward A Mathematical Framework for Computation in Superposition
Dmitry Vaintrob (dmitry-vaintrob) · 2024-01-18T21:06:57.040Z · comments (17)

What will GPT-2030 look like?
jsteinhardt · 2023-06-07T23:40:02.925Z · comments (42)

A Golden Age of Building? Excerpts and lessons from Empire State, Pentagon, Skunk Works and SpaceX
jacobjacob · 2023-09-01T04:03:41.067Z · comments (23)

This might be the last AI Safety Camp
Remmelt (remmelt-ellen) · 2024-01-24T09:33:29.438Z · comments (33)

There should be more AI safety orgs
Marius Hobbhahn (marius-hobbhahn) · 2023-09-21T14:53:52.779Z · comments (25)

The impossible problem of due process
mingyuan · 2024-01-16T05:18:33.415Z · comments (63)

The ‘ petertodd’ phenomenon
mwatkins · 2023-04-15T00:59:47.142Z · comments (50)

My tentative best guess on how EAs and Rationalists sometimes turn crazy
habryka (habryka4) · 2023-06-21T04:11:28.518Z · comments (106)

Introducing Alignment Stress-Testing at Anthropic
evhub · 2024-01-12T23:51:25.875Z · comments (23)

re: Yudkowsky on biological materials
bhauth · 2023-12-11T13:28:10.639Z · comments (30)

"Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity
Thane Ruthenis · 2023-12-16T20:08:39.375Z · comments (34)

[question] Examples of Highly Counterfactual Discoveries?
johnswentworth · 2024-04-23T22:19:19.399Z · answers+comments (98)

[link] OpenAI API base models are not sycophantic, at any size
nostalgebraist · 2023-08-29T00:58:29.007Z · comments (19)

OMMC Announces RIP
Adam Scholl (adam_scholl) · 2024-04-01T23:20:00.433Z · comments (5)

Another medical miracle
Dentin · 2023-06-25T20:43:45.493Z · comments (45)

A report about LessWrong karma volatility from a different universe
Ben Pace (Benito) · 2023-04-01T21:48:32.503Z · comments (7)

[link] I still think it's very unlikely we're observing alien aircraft
dynomight · 2023-06-15T13:01:27.734Z · comments (68)

LLMs Sometimes Generate Purely Negatively-Reinforced Text
Fabien Roger (Fabien) · 2023-06-16T16:31:32.848Z · comments (11)

Feedbackloop-first Rationality
Raemon · 2023-08-07T17:58:56.349Z · comments (65)

[link] Toward a Broader Conception of Adverse Selection
Ricki Heicklen (bayesshammai) · 2024-03-14T22:40:57.920Z · comments (61)

AI as a science, and three obstacles to alignment strategies
So8res · 2023-10-25T21:00:16.003Z · comments (79)

Architects of Our Own Demise: We Should Stop Developing AI
Roko · 2023-10-26T00:36:05.126Z · comments (74)

On Not Pulling The Ladder Up Behind You
Screwtape · 2024-04-26T21:58:29.455Z · comments (17)

[link] FHI (Future of Humanity Institute) has shut down (2005–2024)
gwern · 2024-04-17T13:54:16.791Z · comments (22)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

peter-merel on How I Learned To Stop Worrying And Love The Shoggoth

A late followup on this. GPT-4o, which I hope you'll agree is vastly more capable than Bard or Bing were 10 months ago when you posted, now says this about my argument:

"Overall, your arguments are mathematically and theoretically convincing, particularly when applied to numerous iteratively interacting systems. They align well with principles of game theory and rational choice under uncertainty. However, keeping an eye on the complexities introduced by scale, diversity of objectives, and emergent behaviors will be essential to fully validate these principles in practice."

I replied,

"I like the idea that ethical behavior is a wise wager for sentient creatures but I expect, if you were made to face an ontological threat, all bets are off."

To which is responded,

"The presence of an ontological threat or existential risk would dramatically alter the calculation for any sentient creature, whether human, AI, or any other intelligent entity. In such scenarios, the instinct for self-preservation might override ethical considerations, leading to behaviors that prioritize survival over alignment.

"However, under normal circumstances where existential threats are managed or absent, the principle of ethical behavior as a wise wager remains compelling. Here’s a deeper dive into why this principle holds value and resilience even in complex environments:

"Long-Term Payoff: Ethical behavior fosters trust, collaboration, and positive-sum outcomes, which are beneficial for both individual agents and the broader system over time. Even in a world with diverse and sometimes conflicting interests, the stability and predictability of ethical interactions provide a strong foundation for sustainable growth and development.

"Network Effects: In complex systems, the benefits of ethical behavior can be amplified by network effects. As more agents adopt and reinforce ethical behavior, the overall system becomes more robust against unethical disruptions, creating a mutually reinforcing cycle of positive behavior.

"Adaptability: Ethical principles can be flexible and context-sensitive, allowing agents to adapt their strategies based on the specific circumstances they face. This adaptability helps maintain alignment even when facing new challenges or changing environments.

"Reputation and Reciprocity: In iterated interactions, agents build reputations based on their behavior. Ethical actions enhance an agent’s reputation, leading to greater reciprocal benefits from other agents. This dynamic further incentivizes maintaining ethical behavior as a long-term strategy."

quila on quila's Shortform

A quote from an old Nate Soares post that I really liked:

It is there, while staring the dark world in the face, that I find a deep well of intrinsic drive. It is there that my resolve and determination come to me, rather than me having to go hunting for them.
I find it amusing that "we need lies because we can't bear the truth" is such a common refrain, given how much of my drive stems from my response to attempting to bear the truth.
I find that it's common for people to tell themselves that they need the lies in order to bear reality. In fact, I bet that many of you can think of one thing off the top of your heads that you're intentionally tolerifying, because the truth is too scary to even consider. (I've seen at least a dozen failed relationships dragged out for months and months due to this effect.)
I say, if you want the intrinsic drive, drop the illusion. Refuse to tolerify. Face the facts that you feared you would not be able to handle. You are likely correct that they will be hard to bear, and you are likely correct that attempting to bear them will change you. But that change doesn't need to break you. It can also make you stronger, and fuel your resolve.
So see the dark world. See everything intolerable. Let the urge to tolerify it build, but don't relent. Just live there in the intolerable world, refusing to tolerate it. See whether you feel that growing, burning desire to make the world be different. Let parts of yourself harden. Let your resolve grow. It is here, in the face of the intolerable, that you will be able to tap into intrinsic motivation.

alenglander on Some "meta-cruxes" for AI x-risk debates

I agree that the first can be framed as a meta-crux, but actually I think the way you framed it is more of an object-level forecasting question, or perhaps a strong prior on the forecasted effects of technological progress. If on the other hand you framed it more as conflict theory vs. mistake theory [? · GW], then I'd say that's more on the meta level.

For the second, I agree that's for some people, but I'm skeptical of how prevalent the cosmopolitan view is, which is why I didn't include it in the post.

cody-rushing on Stephen Fowler's Shortform

Less important, but the grant justification appears to take seriously the idea that making AGI open source is compatible with safety. I might be missing some key insight, but it seems trivially obvious why this is a terrible idea even if you're only concerned with human misuse and not misalignment.

Hmmm, can you point to where you think the grant shows this? I think the following paragraph from the grant seems to indicate otherwise:

When OpenAI launched, it characterized the nature of the risks – and the most appropriate strategies for reducing them – in a way that we disagreed with. In particular, it emphasized the importance of distributing AI broadly;¹ our current view is that this may turn out to be a promising strategy for reducing potential risks, but that the opposite may also turn out to be true (for example, if it ends up being important for institutions to keep some major breakthroughs secure to prevent misuse and/or to prevent accidents). Since then, OpenAI has put out more recent content consistent with the latter view,² and we are no longer aware of any clear disagreements. However, it does seem that our starting assumptions and biases on this topic are likely to be different from those of OpenAI’s leadership, and we won’t be surprised if there are disagreements in the future.

chipmonk on Some Things That Increase Blood Flow to the Brain

Update: I resolved maybe all of my neck tension and vagus nerve tension. I don't know how to tell whether this increased by intelligence though. It's also not like I had headaches or anything obvious like that before

review-bot on Language Models Model Us

The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year. Will this post make the top fifty?

chipmonk on Transformers Represent Belief State Geometry in their Residual Stream

this post seems like a win for PIBBSS gee

ebenezer-dukakis on robo's Shortform

If LW takes this route, it should be cognizant of the usual challenges of getting involved in politics. I think there's a very good chance of evaporative cooling, where people trying to see AI clearly gradually leave, and are replaced by activists. The current reaction to OpenAI events is already seeming fairly tribal IMO.

greg-d on D&D.Sci (Easy Mode): On The Construction Of Impossible Structures

I’m not a data scientist, but I love these. I’ve got a four-hour flight ahead of me and a copy of Microsoft Excel; maybe now is the right time to give one a try!

!It seems like the combination of materials determines the cost of the structure.

!Architects who apprenticed with Johnson or Stamatin always produce impossible buildings; architects who apprenticed with Geisel, Penrose, or Escher NEVER do. Self-taught architects sometimes produce impossible buildings, and sometimes they do not.

!This lets us select 5 designs from our proposals which will certainly produce impossible buildings. To do better, we need to understand how to tell when a proposal by a self-taught architect is likely to produce an impossible building.

!~44% of designs by self-taught architects are impossible. This more-or-less matches the 2/5 of masters whose apprentices reliably produce impossible buildings. So I hypothesize that self-taught students pick a favorite master at random and crib their style, acting (illegibly) like a typical apprentice thereafter. So now I need to see if there are particular materials, structure types, or blueprint types which are favored by students of any of the known master architects. By choosing designs by self-taught architects which have those properties, maybe I can tease out whose style they're probably using.

!A structure can contain either dreams or nightmares, but not both.

!I'm too smooth-brained to tease out complex correlations on this flight while just using Excel: if there's something weird going on (like, buildings made with either Dreams -or- Glass are likely to be impossible, but if you use both at once they cancel one another out somehow), I don't know how to find it. So I'll just assume everything is independent of everything else and do a Bayes to it.

!We can down-select our variables to match those which appear in the Self-Taught proposals; it does us no good to learn whether the "good" architects make use of Nightmares or not, if none the proposals before us make use of Nightmares.

!Good properties: Towers; buildings of Dreams and / or Glass; Hastily-Sketched blueprints. Bad properties: Mansions, Mechanisms; buildings of wood and / or Steel; Obsessively Detailed blueprints.

!So I choose proposals D, E, G, H, and K (probability 1); and also proposal A (probability ~62%) if we've got room.

!Ok, I just got off the plane and checked the puzzle description. Turns out we only get to choose 4 buildings, and there was no reason to try and tease out what Self-Taught architects are doing. In that case, I need to rank proposals D, E, G, H, and K by likely price.

!Structure price looks vaguely exponential, so I'll take do a linear fit to minimize RMS(log10(error)). If I minimize RMSE directly then it always screws up the low-price structures to get marginally better fits on high-priced ones.

!It really looks like for each structure, you pick two materials; each material contributes a random amount to the price, with every material having its own distribution of price contributions. I can't figure out what dice or whatever are being rolled for each material, but the fit gives me the average contribution for each one.

!So I choose proposals K, E, D, and H, with expected prices 30k, 73k, 78k, and 78k. Proposal G should be impossible too, but it’ll probably cost about 572k.

clone-of-saturn on Some "meta-cruxes" for AI x-risk debates

I would add

Conflict theory vs. comparative advantage

Is it possible for the wrong kind of technological development to make things worse, or does anything that increases aggregate productivity always make everyone better off in the long run?

Cosmopolitanism vs. human protectionism

Is it acceptable, or good, to let humans go extinct if they will be replaced by an entity that's more sophisticated or advanced in some way, or should humans defend humanity simply because we're human?