LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

The Third Fundamental Question
Screwtape · 2024-11-15T04:01:33.770Z · comments (7)

AI Craftsmanship
abramdemski · 2024-11-11T22:17:01.112Z · comments (7)

[Intuitive self-models] 8. Rooting Out Free Will Intuitions
Steven Byrnes (steve2152) · 2024-11-04T18:16:26.736Z · comments (16)

Perils of Generalizing from One's Social Group
localdeity · 2024-11-24T15:31:18.332Z · comments (1)

SAEs are highly dataset dependent: a case study on the refusal direction
Connor Kissane (ckkissane) · 2024-11-07T05:22:18.807Z · comments (4)

[link] Drexler's Nanotech Software
PeterMcCluskey · 2024-12-02T04:55:20.432Z · comments (6)

Training AI agents to solve hard problems could lead to Scheming
Marius Hobbhahn (marius-hobbhahn) · 2024-11-19T00:10:55.522Z · comments (12)

Why imperfect adversarial robustness doesn't doom AI control
Buck · 2024-11-18T16:05:06.763Z · comments (27)

A case for donating to AI risk reduction (including if you work in AI)
tlevin (trevor) · 2024-12-02T19:05:06.658Z · comments (2)

[question] Could orcas be (trained to be) smarter than humans? 
Towards_Keeperhood (Simon Skade) · 2024-11-04T23:29:26.677Z · answers+comments (20)

U.S.-China Economic and Security Review Commission pushes Manhattan Project-style AI initiative
Phib · 2024-11-19T18:42:43.296Z · comments (7)

[link] The Evals Gap
Marius Hobbhahn (marius-hobbhahn) · 2024-11-11T16:42:46.287Z · comments (7)

Neuroscience of human social instincts: a sketch
Steven Byrnes (steve2152) · 2024-11-22T16:16:52.552Z · comments (0)

Win/continue/lose scenarios and execute/replace/audit protocols
Buck · 2024-11-15T15:47:24.868Z · comments (2)

[link] a space habitat design
bhauth · 2024-11-25T17:28:48.481Z · comments (13)

A Qualitative Case for LTFF: Filling Critical Ecosystem Gaps
Linch · 2024-12-03T21:57:23.597Z · comments (2)

A Conflicted Linkspost
Screwtape · 2024-11-21T00:37:54.035Z · comments (0)

Metastatic Cancer Treatment Since 2010: The Success Stories
sarahconstantin · 2024-11-04T22:50:09.386Z · comments (2)

[link] Active Recall and Spaced Repetition are Different Things
Saul Munn (saul-munn) · 2024-11-08T20:14:56.092Z · comments (2)

An alternative approach to superbabies
Towards_Keeperhood (Simon Skade) · 2024-11-05T22:56:15.740Z · comments (19)

Looking back on the Future of Humanity Institute - Asterisk
jakeeaton · 2024-11-19T00:44:40.928Z · comments (0)

Dave Kasten's AGI-by-2027 vignette
davekasten · 2024-11-26T23:20:47.212Z · comments (8)

Secular Solstice Round Up 2024
dspeyer · 2024-11-21T10:49:36.682Z · comments (13)

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback
Marcus Williams · 2024-11-07T15:39:06.854Z · comments (6)

AI #91: Deep Thinking
Zvi · 2024-11-21T14:30:06.930Z · comments (9)

Which evals resources would be good?
Marius Hobbhahn (marius-hobbhahn) · 2024-11-16T14:24:48.012Z · comments (4)

The Shallow Bench
Karl Faulks (karl-faulks) · 2024-11-05T05:07:27.357Z · comments (5)

[link] What Ketamine Therapy Is Like
Sable · 2024-11-11T11:09:08.602Z · comments (8)

Estimates of GPU or equivalent resources of large AI players for 2024/5
CharlesD · 2024-11-28T23:01:58.522Z · comments (7)

[link] Epistemic status: poetry (and other poems)
Richard_Ngo (ricraz) · 2024-11-21T18:13:17.194Z · comments (5)

[link] Analyzing how SAE features evolve across a forward pass
bensenberner · 2024-11-07T22:07:02.827Z · comments (0)

[link] Dangerous capability tests should be harder
LucaRighetti (Error404Dinosaur) · 2024-11-21T17:20:50.610Z · comments (3)

[link] The Choice Transition
owencb · 2024-11-18T12:30:56.198Z · comments (4)

Cognitive Biases Contributing to AI X-risk — a deleted excerpt from my 2018 ARCHES draft
Andrew_Critch · 2024-12-03T09:29:49.745Z · comments (2)

[link] Literacy Rates Haven't Fallen By 20% Since the Department of Education Was Created
Maxwell Tabarrok (maxwell-tabarrok) · 2024-11-22T20:53:59.007Z · comments (0)

Reading RFK Jr so that you don’t have to
braces · 2024-11-22T00:59:19.583Z · comments (0)

Monthly Roundup #24: November 2024
Zvi · 2024-11-18T13:20:06.086Z · comments (14)

AI #89: Trump Card
Zvi · 2024-11-07T16:30:05.684Z · comments (12)

[link] Intrinsic Power-Seeking: AI Might Seek Power for Power’s Sake
TurnTrout · 2024-11-19T18:36:20.721Z · comments (5)

How to use bright light to improve your life.
Nat Martin (nat-martin) · 2024-11-18T19:32:10.667Z · comments (10)

[link] College technical AI safety hackathon retrospective - Georgia Tech
yix (Yixiong Hao) · 2024-11-15T00:22:53.159Z · comments (2)

Signaling with Small Orange Diamonds
jefftk (jkaufman) · 2024-11-07T20:20:08.026Z · comments (1)

[question] Are You More Real If You're Really Forgetful?
Thane Ruthenis · 2024-11-24T19:30:55.233Z · answers+comments (24)

[link] FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
Tamay · 2024-11-14T06:13:22.042Z · comments (0)

ARENA 4.0 Impact Report
Chloe Li (chloe-li-1) · 2024-11-27T20:51:54.844Z · comments (2)

AXRP Episode 39 - Evan Hubinger on Model Organisms of Misalignment
DanielFilan · 2024-12-01T06:00:06.345Z · comments (0)

Book a Time to Chat about Interp Research
Logan Riggs (elriggs) · 2024-12-03T17:27:46.808Z · comments (2)

Doing Research Part-Time is Great
casualphysicsenjoyer (hatta_afiq) · 2024-11-22T19:01:15.542Z · comments (7)

Intent alignment as a stepping-stone to value alignment
Seth Herd · 2024-11-05T20:43:24.950Z · comments (4)

Cross-context abduction: LLMs make inferences about procedural training data leveraging declarative facts in earlier training data
Sohaib Imran (sohaib-imran) · 2024-11-16T23:22:21.857Z · comments (11)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

geoffrey-wood on Accidentally Load Bearing

I think this adds depth to the Chesterton's Fence concept.

danielfilan on The 2023 LessWrong Review: The Basic Ask

Wait I'm a moron and the thing I checked was actually whether it was an exponential function, sorry.

habryka4 on The 2023 LessWrong Review: The Basic Ask

How are the triangle numbers not quadratic?

Sure looks quadratic to me.

sunwillrise on dirk's Shortform

The 3 most important paragraphs, extracted to save readers the trouble of clicking on a link:

The Anduril and OpenAI strategic partnership will focus on improving the nation’s counter-unmanned aircraft systems (CUAS) and their ability to detect, assess and respond to potentially lethal aerial threats in real-time.
[...]
The accelerating race between the United States and China to lead the world in advancing AI makes this a pivotal moment. If the United States cedes ground, we risk losing the technological edge that has underpinned our national security for decades.
[...]
These models, which will be trained on Anduril’s industry-leading library of data on CUAS threats and operations, will help protect U.S. and allied military personnel and ensure mission success.

kabir-kumar on Kabir Kumar's Shortform

ok, options.
- Review of 108 ai alignment plans
- write-up of Beyond Distribution - planned benchmark for alignment evals beyond a models distribution, send to the quant who just joined the team who wants to make it
- get familiar with the TPUs I just got access to
- run hhh and it's variants, testing the idea behind Beyond Distribution, maybe make a guide on itr
- continue improving site design

- fill out the form i said i was going to fill out and send today
- make progress on cross coders - would prob need to get familiar with those tpus
- writeup of ai-plans, the goal, the team, what we're doing, what we've done, etc
- writeup of the karma/voting system
- the video on how to do backprop by hand
- tutorial on how to train an sae

think Beyond Distribution writeup. he's waiting and i feel bad.

matthew-barnett on Evaluating the historical value misspecification argument

While I did agree that Linch's comment reasonably accurately summarized my post, I don't think a large part of my post was about the idea that we should now think that human values are much simpler than Yudkowsky portrayed them to be. Instead, I believe this section from Linch's comment does a better job at conveying what I intended to be the main point,

Suppose in 2000 you were told that a100-line Python program (that doesn't abuse any of the particular complexities embedded elsewhere in Python) can provide a perfect specification of human values. Then you should rationally conclude that human values aren't actually all that complex (more complex than the clean mathematical statement, but simpler than almost everything else).
In such a world, if inner alignment is solved, you can "just" train a superintelligent AI to "optimize for the results of that Python program" and you'd get a superintelligent AI with human values.
Notably, alignment isn't solved by itself. You still need to get the superintelligent AI to actually optimize for that Python program and not some random other thing that happens to have low predictive loss in training on that program.
Well, in 2023 we have that Python program, with a few relaxations:
The answer isn't embedded in 100 lines of Python, but in a subset of the weights of GPT-4
Notably the human value function (as expressed by GPT-4) is necessarily significantly simpler than the weights of GPT-4, as GPT-4 knows so much more than just human values.
What we have now isn't a perfect specification of human values, but instead roughly the level of understanding of human values that a 85th percentile human can come up with.

The primary point I intended to emphasize is not that human values are fundamentally simple, but rather that we now have something else important: an explicit, and cheaply computable representation of human values that can be directly utilized in AI development. This is a major step forward because it allows us to incorporate these values into programs in a way that provides clear and accurate feedback during processes like RLHF. This explicitness and legibility are critical for designing aligned AI systems, as they enable developers to work with a tangible and faithful specification of human values rather than relying on poor proxies that clearly do not track the full breadth and depth of what humans care about.

The fact that the underlying values may be relatively simple is less important than the fact that we can now operationalize them, in a way that reflects human judgement fairly well. Having a specification that is clear, structured, and usable means we are better equipped to train AI systems to share those values. This representation serves as a foundation for ensuring that the AI optimizes for what we actually care about, rather than inadvertently optimizing for proxies or unrelated objectives that merely correlate with training signals. In essence, the true significance lies in having a practical, actionable specification of human values that can actively guide the creation of future AI, not just in observing that these values may be less complex than previously assumed.

abandon on dirk's Shortform

OpenAI is partnering with Anduril to develop models for aerial defense: https://www.anduril.com/article/anduril-partners-with-openai-to-advance-u-s-artificial-intelligence-leadership-and-protect-u-s/

deluks917 on How can I convince my cryptobro friend that S&P500 is efficient?

The market isnt efficient. Which isn't to say it is easy to beat. Your friends strategies don't sound promising. It also seems strange to me he is obsessed with crypto and thinks it will do well but isn't a crypto investor. Sounds pretty inconsistent with his beliefs.

It's worth remembering many versions of ',,the market is efficient' are almost or totally unfalsifiable.

simon on Do simulacra dream of digital sheep?

You can also disambiguate between

a) computation that actually interacts in a comprehensible way with the real world and

b) computation that has the same internal structure at least momentarily but doesn't interact meaningfully with the real world.

I expect that (a) can usually be uniquely pinned down to a specific computation (probably in both senses (1) and (2)), while (b) can't.

But I also think it's possible that the interactions, while important for establishing the disambiguated computation that we interact with, are not actually crucial to internal experience, so that the multiple possible computations of type (b) may also be associated with internal experiences - similar to Boltzmann brains.

(I think I got this idea from "Good and Real" by Gary L. Drescher. Edit: yes, see sections "2.3 The Problematic Arbitrariness of Representation" and "7.2.3 Consciousness and Subjunctive Reciprocity")

thecookielab on How can I convince my cryptobro friend that S&P500 is efficient?

“Market efficiency” is a useful model but one shouldn’t confuse it with reality.