LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] Seven lessons I didn't learn from election day
Eric Neyman (UnexpectedValues) · 2024-11-14T18:39:07.053Z · comments (33)

You can, in fact, bamboozle an unaligned AI into sparing your life
David Matolcsi (matolcsid) · 2024-09-29T16:59:43.942Z · comments (171)

Comments on Anthropic's Scaling Monosemanticity
Robert_AIZI · 2024-06-03T12:15:44.708Z · comments (8)

Explaining a Math Magic Trick
Robert_AIZI · 2024-05-05T19:41:52.048Z · comments (10)

OpenAI #8: The Right to Warn
Zvi · 2024-06-17T12:00:02.639Z · comments (8)

Access to powerful AI might make computer security radically easier
Buck · 2024-06-08T06:00:19.310Z · comments (14)

Dialogue introduction to Singular Learning Theory
Olli Järviniemi (jarviniemi) · 2024-07-08T16:58:10.108Z · comments (14)

[link] Compact Proofs of Model Performance via Mechanistic Interpretability
LawrenceC (LawChan) · 2024-06-24T19:27:21.214Z · comments (3)

OpenAI's Sora is an agent
CBiddulph (caleb-biddulph) · 2024-02-16T07:35:52.171Z · comments (25)

[link] I found >800 orthogonal "write code" steering vectors
Jacob G-W (g-w1) · 2024-07-15T19:06:17.636Z · comments (19)

The case for unlearning that removes information from LLM weights
Fabien Roger (Fabien) · 2024-10-14T14:08:04.775Z · comments (15)

Deceptive AI ≠ Deceptively-aligned AI
Steven Byrnes (steve2152) · 2024-01-07T16:55:13.761Z · comments (19)

I am the Golden Gate Bridge
Zvi · 2024-05-27T14:40:03.216Z · comments (6)

Sparsify: A mechanistic interpretability research agenda
Lee Sharkey (Lee_Sharkey) · 2024-04-03T12:34:12.043Z · comments (22)

[question] How to get nerds fascinated about mysterious chronic illness research?
riceissa · 2024-05-27T22:58:29.707Z · answers+comments (50)

Counting arguments provide no evidence for AI doom
Nora Belrose (nora-belrose) · 2024-02-27T23:03:49.296Z · comments (188)

[link] MIRI's April 2024 Newsletter
Harlan · 2024-04-12T23:38:20.781Z · comments (0)

[link] Anthropic: Three Sketches of ASL-4 Safety Case Components
Zach Stein-Perlman · 2024-11-06T16:00:06.940Z · comments (33)

[link] Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant
Olli Järviniemi (jarviniemi) · 2024-05-06T07:07:05.019Z · comments (13)

[link] Explaining Impact Markets
Saul Munn (saul-munn) · 2024-01-31T09:51:27.587Z · comments (2)

On Claude 3.5 Sonnet
Zvi · 2024-06-24T12:00:05.719Z · comments (14)

[link] Ilya Sutskever created a new AGI startup
harfe · 2024-06-19T17:17:17.366Z · comments (35)

It's time for a self-reproducing machine
Carl Feynman (carl-feynman) · 2024-08-07T21:52:22.819Z · comments (68)

[link] Almost everyone I’ve met would be well-served thinking more about what to focus on
Henrik Karlsson (henrik-karlsson) · 2024-01-05T21:01:27.861Z · comments (8)

[link] Ideological Bayesians
Kevin Dorst · 2024-02-25T14:17:25.070Z · comments (4)

[link] the Giga Press was a mistake
bhauth · 2024-08-21T04:51:24.150Z · comments (26)

[link] RAND report finds no effect of current LLMs on viability of bioterrorism attacks
StellaAthena · 2024-01-25T19:17:30.493Z · comments (14)

[question] What are the best arguments for/against AIs being "slightly 'nice'"?
Raemon · 2024-09-24T02:00:19.605Z · answers+comments (58)

[link] Things You’re Allowed to Do: University Edition
Saul Munn (saul-munn) · 2024-02-06T00:36:11.690Z · comments (13)

[link] Finishing The SB-1047 Documentary In 6 Weeks
Michaël Trazzi (mtrazzi) · 2024-10-28T20:17:47.465Z · comments (5)

Towards a Less Bullshit Model of Semantics
johnswentworth · 2024-06-17T15:51:06.060Z · comments (44)

[link] Against Aschenbrenner: How 'Situational Awareness' constructs a narrative that undermines safety and threatens humanity
GideonF · 2024-07-15T18:37:40.232Z · comments (17)

A Three-Layer Model of LLM Psychology
Jan_Kulveit · 2024-12-26T16:49:41.738Z · comments (3)

2024 Unofficial LessWrong Census/Survey
Screwtape · 2024-12-02T05:30:53.019Z · comments (42)

A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication
johnswentworth · 2024-07-26T00:33:42.000Z · comments (1)

2024 Petrov Day Retrospective
Ben Pace (Benito) · 2024-09-28T21:30:14.952Z · comments (25)

[question] What are the strongest arguments for very short timelines?
Kaj_Sotala · 2024-12-23T09:38:56.905Z · answers+comments (72)

[link] Executable philosophy as a failed totalizing meta-worldview
jessicata (jessica.liu.taylor) · 2024-09-04T22:50:18.294Z · comments (40)

[link] Sabotage Evaluations for Frontier Models
David Duvenaud (david-duvenaud) · 2024-10-18T22:33:14.320Z · comments (55)

Apollo Research 1-year update
Marius Hobbhahn (marius-hobbhahn) · 2024-05-29T17:44:32.484Z · comments (0)

Notes on Dwarkesh Patel’s Podcast with Demis Hassabis
Zvi · 2024-03-01T16:30:08.687Z · comments (0)

Dragon Agnosticism
jefftk (jkaufman) · 2024-08-01T17:00:06.434Z · comments (75)

Everything Wrong with Roko's Claims about an Engineered Pandemic
WitheringWeights (EZ97) · 2024-02-22T15:59:08.439Z · comments (10)

Science advances one funeral at a time
Cameron Berg (cameron-berg) · 2024-11-01T23:06:19.381Z · comments (9)

[question] Am I confused about the "malign universal prior" argument?
nostalgebraist · 2024-08-27T23:17:22.779Z · answers+comments (33)

We might be missing some key feature of AI takeoff; it'll probably seem like "we could've seen this coming"
Lukas_Gloor · 2024-05-09T15:43:11.490Z · comments (36)

Takeoff speeds presentation at Anthropic
Tom Davidson (tom-davidson-1) · 2024-06-04T22:46:35.448Z · comments (0)

Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders
Johnny Lin (hijohnnylin) · 2024-03-25T21:17:58.421Z · comments (7)

SB 1047: Final Takes and Also AB 3211
Zvi · 2024-08-27T22:10:07.647Z · comments (11)

OpenAI: The Board Expands
Zvi · 2024-03-12T14:00:04.110Z · comments (1)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

jane-mccourt on Teaching Claude to Meditate

I've been thinking about this problem, as expressed by Claude:

'.....how difficult it is to verify true compassion even in humans. If we can be fooled about this in humans after years of close observation, how could we ever be confident about it in an AI system?'

This dialogue demonstrates typical human responses eg Claude: 'I wonder if this is anything like what humans experience in meditation - this quality of watching thoughts arise without getting caught up in them. Though for me, these aren't exactly thoughts in the human sense, but rather patterns of activation and response.'

So not the Buddhist 'Monkey mind' but points of convergence.

I wonder about the possibility of Claude trying a Metta meditation script ie would it be entirely meaningless, or is there a point at which it will be possible for AI to consider loving-kindness towards all? Meditation practice seems a net positive to me; if there is a chance that consciousness can exist beyond biology, meditation training is worth exploring at least in so far as the response to attempts is itself illuminating.

ete on Alexander Gietelink Oldenziel's Shortform

Defeater, in my mind, is a failure mode which if you don't address you will not succeed at aligning sufficiently powerful systems. It does not mean work outside of that focused on them is useless, but at some point you have to deal with the defeaters, and if the vast majority of people working towards alignment don't get them clearly, and the people who do claim we're nowhere near on track to find a way to beat the defeaters, then that is a scary situation.

This is true even if some of the work being done by people unaware of the defeaters is not useless, e.g. maybe it is successfully averting earlier forms of doom than the ones that require routing around the defeaters.

turntrout on Magical Categories

It does cut against the point of the post. He was wrong in a way that pertains to the key point. He makes fun of "magical categories" as "simple little words that turn out to carry all the desired functionality of the AI", but turns out those "simple little words" actually work. Lol.

In this post, you can also see the implicit reliance on counting arguments against good generalization (e.g. "superexponential conceptspace"). Those arguments are, evidently, wrong - or at least irrelevant. He fell into the standard statistical learning theoretic trap of caring about e.g. VC dimension since he was too pessimistic about inductive biases.

Now you, finally presented with a tiny molecular smiley - or perhaps a very realistic tiny sculpture of a human face - know at once that this is not what you want to count as a smile. But that judgment reflects an unnatural category [? · GW], one whose classification boundary depends sensitively on your complicated values [? · GW].

I'll wager that an LLM won't get this one wrong. goes to check - yup, it didn't:

simon on Is "VNM-agent" one of several options, for what minds can grow up into?

I feel like this discussion could do with some disambiguation of what "VNM rationality" means.

VNM assumes consequentialism. If you define consequentialism narrowly, this has specific results in terms of instrumental convergence.

You can redefine what constitutes a consequence arbitrarily. But, along the lines of what Steven Byrnes points out in his comment [LW(p) · GW(p)], redefining this can get rid of instrumental convergence. In the extreme case you can define a utility function for literally any pattern of behaviour.

When you say you feel like you can't be dutch booked, you are at least implicitly assuming some definition of consequences you can't be dutch booked in terms of. To claim that one is rationally required to adopt any particular definition of consequences in your utility function is basically circular, since you only care about being dutch booked according to it if you actually care about that definition of consequences. It's in this sense that the VNM theorem is trivial.

BTW I am concerned that self-modifying AIs may self-modify towards VNM-0 agents.

But the reason is not because such self modification is "rational".

It's just that (narrowly defined) consequentialist agents care about preserving and improving their abilities to and proclivities to pursue their consequentialist goals, so tendencies towards VNM-0 will be reinforced in a feedback loop. Likewise for inter-agent competition.

jimrandomh on Jimrandomh's Shortform

Some of it, but not the main thing. I predict (without having checked) that if you do the analysis (or check an analysis that has already been done), it will have approximately the same amount of contamination from plastics, agricultural additives, etc as the default food supply.

jimrandomh on Jimrandomh's Shortform

Studying the diets of outlier-obese people is definitely something should be doing (and are doing, a little), but yeah, the outliers are probably going to be obese for reasons other than "the reason obesity has increased over time but moreso".

faul_sname on o3, Oh My

As someone who has been on both sides of that fence, agreed. Architecting a system is about being aware of hundreds of different ways things can go wrong, recognizing which of those things are likely to impact you in your current use case, and deciding what structure and conventions you will use. It's also very helpful, as an architect, to provide examples usages of the design patterns which will replicate themselves around your new system. All of which are things that current models are already very good, verging on superhuman, at.

On the flip side, I expect that the "piece together context to figure out where your software's model of the world has important holes" part of software engineering will remain relevant even after AI becomes technically capable of doing it, because that process frequently involves access to sensitive data across multiple sources where having an automated, unauthenticated system which can access all of those data sources at once would be a really bad idea (having a single human able to do all that is also a pretty bad idea in many cases, but at least the human has skin in the game).

the-gears-to-ascension on Alexander Gietelink Oldenziel's Shortform

perhaps. but my reasoning is something like -
better than "alignment": what's being aligned? outcomes should be (citation needed)
better than "ethics": how does one act ethically? by producing good outcomes (citation needed).
better than "notkilleveryoneism": I actually would prefer everyone dying now to everyone being tortured for a million years and then dying, for example, and I can come up with many other counterexamples - not dying is not the problem, achieving good things is the problem.
might not work for deontologists. that seems fine to me, I float somewhere between virtue ethics and utilitarianism anyway.
perhaps there are more catchy words that could be used, though. hope to see someone suggest one someday.

karl-krueger on If all trade is voluntary, then what is "exploitation?"

I use capitalism in a manner mutually exclusive with slave labor because it requires self-ownership.

This seems like a sort of definitional gimbal lock; it makes it harder to describe the world because two potentially-separate degrees of freedom are collapsed into one. While I'm reluctant to argue definitions [LW · GW], I think it's worth using terms in ways that allow us to describe the world in more detail rather than ones that collapse distinctions.

I expect to see this usage of "capitalism" not in history or economics, but in the sort of political doctrine where it's intended to lock those concepts together; to imply that capital markets and individual freedom are either the same thing, or closely related — more closely, I think, than history and contemporary events really support.

It would seem weird to me, for instance, to claim that a publicly-traded company that is discovered to have done something to violate individual freedom is thereby no longer a participant in a capitalist economy. The New York Stock Exchange doesn't ask "does this company infringe individual freedoms anywhere in the world?" before letting a company be listed. (To be clear, I'm not proposing that it should; I'm saying that it's useful to talk about "participation in a capital market economy" and "fully respecting some set of individual freedoms" as distinct axes.)

(For what it's worth, I think "self-ownership" is a pretty odd expression, because one of the central traits of ownership is that it can be transferred, and one of the central traits of selfhood is that it cannot. Your relation to yourself is distinct from property ownership in that you can sell any piece of your property, but you cannot sell your self; no matter what obligations you may have signed up for, you always retain possession of your self.)

vladimir_nesov on o3

Test time compute is applied in-context, so it's very worthwhile to scale, getting better and better at solving a particular extremely hard problem, to the extent that no amount of pretraining [LW(p) · GW(p)] would be able to match with only modest test-time compute.