LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

AI research assistants competition 2024Q3: Tie between Elicit and You.com
Elizabeth (pktechgirl) · 2024-10-12T15:10:05.417Z · comments (4)

Brief analysis of OP Technical AI Safety Funding
22tom (thomas-barnes) · 2024-10-25T19:37:41.674Z · comments (5)

Another argument against maximizer-centric alignment paradigms
Fiora from Rosebloom · 2024-09-22T07:28:27.856Z · comments (39)

What is "True Love"?
johnswentworth · 2024-08-18T16:05:47.358Z · comments (11)

SAEs are highly dataset dependent: a case study on the refusal direction
Connor Kissane (ckkissane) · 2024-11-07T05:22:18.807Z · comments (4)

[link] Pay-on-results personal growth: first success
Chipmonk · 2024-09-14T03:39:12.975Z · comments (6)

[question] Is cybercrime really costing trillions per year?
Fabien Roger (Fabien) · 2024-09-27T08:44:07.621Z · answers+comments (28)

[link] RL, but don't do anything I wouldn't do
Gunnar_Zarncke · 2024-12-07T22:54:50.714Z · comments (5)

[link] A primer on why computational predictive toxicology is hard
Abhishaike Mahajan (abhishaike-mahajan) · 2024-08-19T17:16:37.735Z · comments (2)

Book Review: On the Edge: The Future
Zvi · 2024-09-27T14:00:05.279Z · comments (1)

[Intuitive self-models] 6. Awakening / Enlightenment / PNSE
Steven Byrnes (steve2152) · 2024-10-22T13:23:08.836Z · comments (8)

[link] Dario Amodei — Machines of Loving Grace
Matrice Jacobine · 2024-10-11T21:43:31.448Z · comments (26)

[link] on bacteria, on teeth
bhauth · 2024-09-30T15:56:56.830Z · comments (9)

[link] Slightly More Than You Wanted To Know: Pregnancy Length Effects
JustisMills · 2024-10-21T01:26:02.030Z · comments (4)

[link] Anthropic leadership conversation
Zach Stein-Perlman · 2024-12-20T22:00:45.229Z · comments (16)

[link] Electrostatic Airships?
DaemonicSigil · 2024-10-27T04:32:34.852Z · comments (13)

MATS Alumni Impact Analysis
utilistrutil · 2024-09-30T02:35:57.273Z · comments (7)

Why imperfect adversarial robustness doesn't doom AI control
Buck · 2024-11-18T16:05:06.763Z · comments (26)

Cognitive Work and AI Safety: A Thermodynamic Perspective
Daniel Murfet (dmurfet) · 2024-12-08T21:42:17.023Z · comments (7)

[link] Zen and The Art of Semiconductor Manufacturing
Recurrented (rachel-farley) · 2024-12-09T17:19:35.236Z · comments (2)

Showing SAE Latents Are Not Atomic Using Meta-SAEs
Bart Bussmann (Stuckwork) · 2024-08-24T00:56:46.048Z · comments (9)

What is SB 1047 *for*?
Raemon · 2024-09-05T17:39:39.871Z · comments (8)

[link] electric turbofans
bhauth · 2024-11-02T22:50:59.807Z · comments (2)

A case for donating to AI risk reduction (including if you work in AI)
tlevin (trevor) · 2024-12-02T19:05:06.658Z · comments (2)

Training AI agents to solve hard problems could lead to Scheming
Marius Hobbhahn (marius-hobbhahn) · 2024-11-19T00:10:55.522Z · comments (12)

AI #78: Some Welcome Calm
Zvi · 2024-08-22T14:20:10.812Z · comments (15)

AI #95: o1 Joins the API
Zvi · 2024-12-19T15:10:05.196Z · comments (1)

Checking in on Scott's composition image bet with imagen 3
Dave Orr (dave-orr) · 2024-12-22T19:04:17.495Z · comments (0)

Base LLMs refuse too
Connor Kissane (ckkissane) · 2024-09-29T16:04:21.343Z · comments (20)

Toward Safety Cases For AI Scheming
Mikita Balesni (mykyta-baliesnyi) · 2024-10-31T17:20:06.019Z · comments (1)

Pollsters Should Publish Question Translations
jefftk (jkaufman) · 2024-09-08T22:10:04.932Z · comments (3)

AI Alignment via Slow Substrates: Early Empirical Results With StarCraft II
Lester Leong (lester-leong) · 2024-10-14T04:05:05.096Z · comments (9)

Interdictor Ship
lsusr · 2024-08-19T04:59:18.487Z · comments (9)

Against empathy-by-default
Steven Byrnes (steve2152) · 2024-10-16T16:38:49.926Z · comments (24)

Why our politicians aren't Median
Yair Halberstadt (yair-halberstadt) · 2024-11-03T14:03:33.779Z · comments (15)

[link] Linkpost: Memorandum on Advancing the United States’ Leadership in Artificial Intelligence
Nisan · 2024-10-25T04:37:00.828Z · comments (2)

AI #81: Alpha Proteo
Zvi · 2024-09-12T13:00:07.958Z · comments (3)

Intricacies of Feature Geometry in Large Language Models
7vik (satvik-golechha) · 2024-12-07T18:10:51.375Z · comments (0)

Rationalists are missing a core piece for agent-like structure (energy vs information overload)
tailcalled · 2024-08-17T09:57:19.370Z · comments (9)

How you can help pass important AI legislation with 10 minutes of effort
ThomasW · 2024-09-14T22:10:50.386Z · comments (2)

[Intuitive self-models] 5. Dissociative Identity (Multiple Personality) Disorder
Steven Byrnes (steve2152) · 2024-10-15T13:31:46.157Z · comments (7)

Mira Murati leaves OpenAI/ OpenAI to remove non-profit control
Sodium · 2024-09-25T21:15:17.315Z · comments (4)

The Geometry of Feelings and Nonsense in Large Language Models
7vik (satvik-golechha) · 2024-09-27T17:49:27.420Z · comments (10)

AI #86: Just Think of the Potential
Zvi · 2024-10-17T15:10:06.552Z · comments (8)

AI #87: Staying in Character
Zvi · 2024-10-29T07:10:08.212Z · comments (3)

Referendum Mechanics in a Marketplace of Ideas
Martin Sustrik (sustrik) · 2024-08-25T08:30:01.901Z · comments (2)

[link] Congressional Insider Trading
Maxwell Tabarrok (maxwell-tabarrok) · 2024-08-30T13:32:57.264Z · comments (6)

On the UBI Paper
Zvi · 2024-09-03T14:50:08.647Z · comments (6)

[link] The Alignment Trap: AI Safety as Path to Power
crispweed · 2024-10-29T15:21:26.545Z · comments (17)

[question] Could orcas be (trained to be) smarter than humans? 
Towards_Keeperhood (Simon Skade) · 2024-11-04T23:29:26.677Z · answers+comments (20)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

gordon-seidoh-worley on Why don't we currently have AI agents?

I can't help but wonder if part of the answer is that they seem dangerous and people are selecting out of producing them.

Like I'm not an expert but creating AI agents seems extremely fun and appealing, and I'm intentionally working on it none because it seems safer not to build them. (Whether you think my contributions to trying to build them would matter or not is another question.)

mr-hire on A Three-Layer Model of LLM Psychology

This is great, matches my experience a lot

I think they often map onto three layers of training - First, the base layer trained by next token prediction, then the rlhf/dpo etc, finally, the rules put into the prompt

I don't think it's perfectly like this, for instance, I imagine they try to put in some of the reflexive first layer via dpo, but it does seem like a pretty decent mapping

avturchin on A Three-Layer Model of LLM Psychology

I observed similar effects when experimented with my mind's model (sideload) running on LLM. My sideload is a character and it claims, for example, that it has consciousness. But the same LLM without the sideload's prompt claims that it doesn't have consciousness.

sunwillrise on The Field of AI Alignment: A Postmortem, and What To Do About It

(Prefatory disclaimer that, admittedly as an outsider to this field, I absolutely disagree with the labeling of prosaic [LW · GW] AI work as useless streetlighting, for reasons building upon what many commenters wrote in response to the very posts you linked here as assumed background material. But in the spirit of your post, I shall ignore that moving forward.)

The "What to Do About It" [LW · GW] section dances around but doesn't explicitly name one of the core challenges of theoretical agent-foundations [LW · GW] work that aims to solve the "hard bits" [LW · GW] of the alignment challenge, namely the seeming lack of reliable feedback loops [LW · GW] that give you some indication that you are pushing towards something practically useful in the end instead of just a bunch of cool math that nonetheless resides alone in its separate magisterium. As Conor Leahy concisely put it [LW(p) · GW(p)]:

Humans are really, really bad at doing long chains of abstract reasoning without regular contact with reality, so in practice imo good philosophy has to have feedback loops with reality, otherwise you will get confused.

He was talking about philosophy in particular at that juncture, in response to Wei Dai's concerns [LW · GW] over metaphilosophical competence, but this point seems to me to generalize to a whole bunch of other areas as well. Indeed, I have talked about this before [LW(p) · GW(p)].

... and in my experience, there are people who can get traction on the core hard problems. Most notably physicists - when they grok the hard parts, they tend to immediately see footholds, rather than a blank impassable wall.

Do they get traction on "core hard problems" because of how Inherently Awesome they are as researchers, or do they do so because the types of physics problems we mostly care about currently are such that, while the generation of (worthwhile) grand mathematical theories is hard, verifying them is (comparatively) easier because we can run a bunch of experiments (or observe astronomical data etc., in the super-macro scale) to see if the answers they spit out comply with reality? I am aware of your general perspective [LW · GW] on this matter, but I just... still completely disagree, for reasons other people have pointed out [LW(p) · GW(p)] (see also Vanessa Kosoy's comment here [LW · GW]). Is this also supposed to be an implicitly assumed bit of background material?

And when we don't have those verifying experiments at hand, do we not get stuff like string theory, where the math is beautiful and exquisite (in the domains it has been extended do) but debate by "physics postdocs" over whether it's worthwhile to keep funding and pursuing it keeps raging on as a Theory of Everything keeps eliding our grasp? I'm sure people with more object-level expertise on this can correct my potential misconceptions if need be.

Idk man, some days I'm half-tempted to believe that all non-prosaic alignment work is a bunch of "streetlighting." Yeah, it doesn't result in the kind of flashy papers full of concrete examples about current models that typically get associated with the term-in-scare-quotes. But it sure seems to cover itself in a veneer of respectability by giving a (to me) entirely unjustified appearance of rigor and mathematical precision and robustness [LW(p) · GW(p)] to claims about what will happen [LW(p) · GW(p)] in the real world based on nothing more than a bunch of vibing about toy models that assume away the burdensome real-world details [LW(p) · GW(p)] serving as evidence whether the approaches are even on the right track [LW(p) · GW(p)]. A bunch of models that seem both woefully underpowered for the Wicked Problems [LW · GW] they must solve and also destined to underfit their target, for they (currently) all exist and supposedly apply independently of the particular architecture, algorithms, training data, scaffolding etc., that will result in the first patch of really powerful AIs. The contents and success stories of Vanessa Kosoy's desiderata [LW · GW], or of your own search for natural abstractions [LW · GW], or of Alex Altair's essence of agent foundations [LW · GW], or of Orthogonal's QACI [LW · GW], etc., seem entirely insensitive to the fact that we are currently dealing with multimodal LLMs combined with RL instead of some other paradigm, which in my mind [LW(p) · GW(p)] almost surely disqualifies them as useful-in-the-real-world when the endgame [LW · GW] hits.

There's a famous Eliezer quote about how for every correct answer to a precisely-stated problem, there are a million times more wrong answers one could have given instead. I would build on that to say that for every powerfully predictive, but lossy and reductive [LW(p) · GW(p)] mathematical model of a complex real-world system, there are a million times more similar-looking mathematical models that fail to capture the essence of the problem and ultimately don't generalize well at all. And it's only by grounding yourself to reality and hugging the query tight [LW · GW] by engaging with real-world empirics that you can figure out if the approach you've chosen is in the former category as opposed to the latter.

(I'm briefly noting that I don't fully endorse everything I said in the previous 2 paragraphs, and I realize that my framing is at least a bit confrontational and unfair. Separately, I acknowledge the existence of arguably-non-prosaic and mostly theoretical alignment approaches like davidad's Guaranteed Safe AI [LW · GW], CHAI's CIRL [LW · GW] and utility uncertainty [LW · GW], Steve Byrnes's work on brain-like AGI safety [? · GW], etc., and that don't necessarily appear to fit this mold. I have varying opinions on the usefulness and viability of such approaches.)

matthew-a-clarke on Are Sparse Autoencoders a good idea for AI control?

I think that your post as currently written assumes too strongly that SAE are and will remain good 'mind readers' as techniques to obfuscate latents and thus frustrate SAEs exist https://arxiv.org/html/2412.09565v1

donatas-luciunas on Terminal goal vs Intelligence

How does this work with the fact that future is unpredictable?

It seems you didn't try to answer this question.

The agent will reason:

Future is unpredictable
It is possible that my terminal goal will be different by the time I get outcomes of my actions
Should I take that into account when choosing actions?
1. If I don't take that into account, I'm not really intelligent, because I am aware of these risks and I ignore them.
2. If I take that into account, I'm not really aligned with my terminal goal.

tsvibt on The Field of AI Alignment: A Postmortem, and What To Do About It

Cf. https://www.lesswrong.com/posts/QzQQvGJYDeaDE4Cfg/talent-needs-of-technical-ai-safety-teams?commentId=BNkpTqwcgMjLhiC8L [LW(p) · GW(p)]

https://www.lesswrong.com/posts/unCG3rhyMJpGJpoLd/koan-divining-alien-datastructures-from-ram-activations?commentId=apD6dek5zmjaqeoGD [LW(p) · GW(p)]

https://www.lesswrong.com/posts/HbkNAyAoa4gCnuzwa/wei-dai-s-shortform?commentId=uMaQvtXErEqc67yLj [LW(p) · GW(p)]

mr-hire on Why don't we currently have AI agents?

When you start trying to make an agent, you realize how much your feedback, rerolls, etc are making chat based llms useful

the error correction mechanism is you in a chat based llms, and in the absence of that, it's quite easy for agents to get off track

you can of course add error correction mechanism like multiple llms checking each other, multiple chains of thought, etc, but the cost can quickly get out of hand

sharmake-farah on johnswentworth's Shortform

I don't think we know yet whether it will succeed in practice, or whether it training costs make it infeasibble to do.

sharmake-farah on A shot at the diamond-alignment problem

I am willing to do a dialogue, if you are interested @sunwillrise [LW · GW].