LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[question] When can I be numerate?
FinalFormal2 · 2024-09-12T04:05:27.710Z · answers+comments (3)

The case for more Alignment Target Analysis (ATA)
Chi Nguyen · 2024-09-20T01:14:41.411Z · comments (13)

Fun With The Tabula Muris (Senis)
sarahconstantin · 2024-09-20T18:20:01.901Z · comments (0)

The new ruling philosophy regarding AI
Mitchell_Porter · 2024-11-11T13:28:24.476Z · comments (0)

[question] When engaging with a large amount of resources during a literature review, how do you prevent yourself from becoming overwhelmed?
corruptedCatapillar · 2024-11-01T07:29:49.262Z · answers+comments (2)

[link] A Theory of Equilibrium in the Offense-Defense Balance
Maxwell Tabarrok (maxwell-tabarrok) · 2024-11-15T13:51:33.376Z · comments (3)

[link] Death notes - 7 thoughts on death
Nathan Young · 2024-10-28T15:01:13.532Z · comments (1)

Improving Model-Written Evals for AI Safety Benchmarking
Sunishchal Dev (sunishchal-dev) · 2024-10-15T18:25:08.179Z · comments (0)

A Triple Decker for Elfland
jefftk (jkaufman) · 2024-10-11T01:50:02.332Z · comments (0)

Abstractions are not Natural
Alfred Harwood · 2024-11-04T11:10:09.023Z · comments (21)

AXRP Episode 36 - Adam Shai and Paul Riechers on Computational Mechanics
DanielFilan · 2024-09-29T05:50:02.531Z · comments (0)

[link] UK AISI: Early lessons from evaluating frontier AI systems
Zach Stein-Perlman · 2024-10-25T19:00:21.689Z · comments (0)

[link] Conventional footnotes considered harmful
dkl9 · 2024-10-01T14:54:01.732Z · comments (16)

Thoughts after the Wolfram and Yudkowsky discussion
Tahp · 2024-11-14T01:43:12.920Z · comments (13)

[link] SB 1047 gets vetoed
ryan_b · 2024-09-30T15:49:38.609Z · comments (1)

How to put California and Texas on the campaign trail!
Yair Halberstadt (yair-halberstadt) · 2024-11-06T06:08:25.673Z · comments (4)

[link] Linkpost: "Imagining and building wise machines: The centrality of AI metacognition" by Johnson, Karimi, Bengio, et al.
Chris_Leong · 2024-11-11T16:13:26.504Z · comments (6)

A suite of Vision Sparse Autoencoders
Louka Ewington-Pitsos (louka-ewington-pitsos) · 2024-10-27T04:05:20.377Z · comments (0)

You're Playing a Rough Game
jefftk (jkaufman) · 2024-10-17T19:20:06.251Z · comments (2)

Testing for consequence-blindness in LLMs using the HI-ADS unit test.
David Scott Krueger (formerly: capybaralet) (capybaralet) · 2023-11-24T23:35:29.560Z · comments (2)

Proving the Geometric Utilitarian Theorem
StrivingForLegibility · 2024-08-07T01:39:10.920Z · comments (0)

Clipboard Filtering
jefftk (jkaufman) · 2024-04-14T20:50:02.256Z · comments (1)

D&D.Sci Hypersphere Analysis Part 4: Fine-tuning and Wrapup
aphyer · 2024-01-18T03:06:39.344Z · comments (5)

The Wisdom of Living for 200 Years
Martin Sustrik (sustrik) · 2024-06-28T04:44:10.609Z · comments (3)

Decent plan prize announcement (1 paragraph, $1k)
lemonhope (lcmgcd) · 2024-01-12T06:27:44.495Z · comments (19)

[question] How to Model the Future of Open-Source LLMs?
Joel Burget (joel-burget) · 2024-04-19T14:28:00.175Z · answers+comments (9)

Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?
scasper · 2024-07-30T14:57:06.807Z · comments (0)

Housing Roundup #9: Restricting Supply
Zvi · 2024-07-17T12:50:05.321Z · comments (8)

Beta Tester Request: Rallypoint Bounties
lukemarks (marc/er) · 2024-05-25T09:11:11.446Z · comments (4)

[link] Announcing Open Philanthropy's AI governance and policy RFP
Julian Hazell (julian-hazell) · 2024-07-17T02:02:39.933Z · comments (0)

[link] Executive Dysfunction 101
DaystarEld · 2024-05-23T12:43:13.785Z · comments (1)

[link] **In defence of Helen Toner, Adam D'Angelo, and Tasha McCauley**
mrtreasure · 2023-12-06T02:02:32.004Z · comments (3)

A Review of In-Context Learning Hypotheses for Automated AI Alignment Research
alamerton · 2024-04-18T18:29:33.892Z · comments (4)

[link] what becoming more secure did for me
Chipmonk · 2024-08-22T17:44:48.525Z · comments (5)

[link] Structured Transparency: a framework for addressing use/mis-use trade-offs when sharing information
habryka (habryka4) · 2024-04-11T18:35:44.824Z · comments (0)

$250K in Prizes: SafeBench Competition Announcement
ozhang (oliver-zhang) · 2024-04-03T22:07:41.171Z · comments (0)

Best-of-n with misaligned reward models for Math reasoning
Fabien Roger (Fabien) · 2024-06-21T22:53:21.243Z · comments (0)

Foresight Institute: 2023 Progress & 2024 Plans for funding beneficial technology development
Allison Duettmann (allison-duettmann) · 2023-11-22T22:09:16.956Z · comments (1)

Population ethics and the value of variety
cousin_it · 2024-06-23T10:42:21.402Z · comments (11)

Evolution did a surprising good job at aligning humans...to social status
Eli Tyre (elityre) · 2024-03-10T19:34:52.544Z · comments (37)

[link] The Living Planet Index: A Case Study in Statistical Pitfalls
Jan_Kulveit · 2024-06-24T10:05:55.101Z · comments (0)

[link] Altruism and Vitalism Aren't Fellow Travelers
Arjun Panickssery (arjun-panickssery) · 2024-08-09T02:01:11.361Z · comments (2)

Paper Summary: The Koha Code - A Biological Theory of Memory
jakej (jake-jenks) · 2023-12-30T22:37:13.865Z · comments (2)

Distinctions when Discussing Utility Functions
ozziegooen · 2024-03-09T20:14:03.592Z · comments (7)

Building Trust in Strategic Settings
StrivingForLegibility · 2023-12-28T22:12:24.024Z · comments (0)

[link] Scenario planning for AI x-risk
Corin Katzke (corin-katzke) · 2024-02-10T00:14:11.934Z · comments (12)

[link] Secret US natsec project with intel revealed
Nathan Helm-Burger (nathan-helm-burger) · 2024-05-25T04:22:11.624Z · comments (0)

My Alignment "Plan": Avoid Strong Optimisation and Align Economy
VojtaKovarik · 2024-01-31T17:03:34.778Z · comments (9)

I didn't think I'd take the time to build this calibration training game, but with websim it took roughly 30 seconds, so here it is!
mako yass (MakoYass) · 2024-08-02T22:35:21.136Z · comments (2)

An evaluation of Helen Toner’s interview on the TED AI Show
PeterH · 2024-06-06T17:39:40.800Z · comments (2)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

viliam on Alexander Gietelink Oldenziel's Shortform

Imagine that a magically powerful AI decides to set a new political system for humans and create a "Constitution of Earth" that will be perfectly enforced by local smaller AIs, while the greatest one travels away to explore other galaxies.

The AI decides that the most fair way to create the constitution is randomly. It will choose a length, for example 10000 words of English text. Then it will generate all possible combinations of 10000 English words. (It is magical, so let's not worry about how much compute that would actually take.) Out of the generated combinations, it will remove the ones that don't make any sense (an overwhelming majority of them) and the ones that could not be meaningfully interpreted as "a constitution" of a country (this is kinda subjective, but the AI does not mind reading them all, evaluating each of them patiently using the same criteria, and accepting only the ones that pass a certain threshold). Out of the remaining ones, the AI will choose the "Constitution of Earth" randomly, using a fair quantum randomness generator.

Shortly before the result is announced, how optimistic would you feel about your future life, as a citizen of Earth?

dakara on If we solve alignment, do we die anyway?

I also have a more meta-level layman concern (sorry if it will sound unusual). There seem to be a large number of jailbreaking strategies that all succeed against current models. To mitigate them, I can conceptually see 2 paths: 1) trying to come up with a different niche technical solution to each and every one of them individually or 2) trying to come up with a fundamentally new framework that happens to avoid all of them collectively.

Strategy 1 seems logistically impossible, as developers at leading labs (which are most likely to produce AGI) have to be aware of all of them (and they are often reported in relatively unknown papers). Furthermore, even if they somehow manage to monitor all reported jailbreaks, they would have to come up with so many different solutions, that it seems very unlikely to succeed.

Strategy 2 seems conceptually correct, but there seems to be no sign of it as even newer models are getting jailbreaked.

What do you think?

viliam on Neutrality

Saying the (hopefully) obvious, just to avoid potential misunderstanding: There is absolutely nothing wrong with writing something for a smaller group of people ("people working in this space"), but naturally such articles get less karma, because the number of people interested in the topic is smaller.

Karma is not a precise tool to measure the quality of content. If there were more than a handful of votes, the direction (positive or negative) usually means something, but the magnitude is more about how many people felt that the article was written for them (therefore highest karma goes to well written topics aimed at the general audience).

My suggestion is to mostly ignore these things. Positive karma is good, but bigger karma is not necessarily better.

mateusz-baginski on Why not tool AI?

https://gwern.net/doc/existential-risk/2011-05-10-givewell-holdenkarnofskyjaantallinn.doc

xpym on Making a conservative case for alignment

A more likely explanation, it seems to me, is that a large part of early LW/sequences was explicit militant atheism, with religion being the primary example of the low "sanity waterline", and this hasn't been explicitly disclaimed since, at best de-emphasized. So this space did its best to repel conservatives much earlier than pronouns and other trans issues entered the picture.

luigipagani on Will Orion/Gemini 2/Llama-4 outperform o1

I agree it’s not very clear. The focus focus of my question would like to be on reasoning benchmarks—specifically in areas like mathematics, coding, and logical reasoning—while disregarding aspects like agency. When it comes to the "next frontier" of models, I’d only consider entries like Orion, Claude 3.5 Opus (or Claude 4 Opus, depending on its eventual naming), Llama 4 (big), and Gemini 2 . A good way to identify it would be by the price per million tokens, for example the new Sonnet is much less expensive than o1 and also of Opus, so it doesn't count as next-frontier model. Of course, the increasingly confusing naming conventions these companies adopt make it harder to define and categorize these "frontier models" clearly. I am editing the answer to make it clearer. Thanks a lot for the feedback!

viliam on Making a conservative case for alignment

I apologize. I spent some time digging for ancient evidence... and then decided against [? · GW] publishing it.

Short version is that someone said something that was kinda inappropriate back then, and would probably get an instant ban these days, with most people applauding.

richard_kennaway on Reducing x-risk might be actively harmful

What if keeping humanity alive and flourishing actually risks spreading suffering further and faster—through advanced technologies, colonization of space, or systems we can’t yet foresee? And what if our very efforts to safeguard the future have unintended consequences that exacerbate suffering in ways we can't predict?

It's up to those future people to solve their own problems. It is enough that we make a future for them to use as they please. Parents must let their children go, or what was the point of making them?

dakara on If we solve alignment, do we die anyway?

Looking more generally, there seems to be a ton of papers that develop sophisticated jailbreak attacks (that succeed against current models). Probably more than I can even list here. Are there any fundamentally new defense techniques that can protect LLMs against these attacks (since the existing ones seem to be insufficient)?

ejt on 4. Existing Writing on Corrigibility

Yes, that's a good summary. The one thing I'd say is that you can characterize preferences in terms of choices and get useful predictions about what the agent will do in other circumstances if you say something about the objects of preference. See my reply to Lucius above.