LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Picking Mentors For Research Programmes
Raymond D · 2023-11-10T13:01:14.197Z · comments (8)

Danger, AI Scientist, Danger
Zvi · 2024-08-15T22:40:06.715Z · comments (9)

The first future and the best future
KatjaGrace · 2024-04-25T06:40:04.510Z · comments (12)

Skills I'd like my collaborators to have
Raemon · 2024-02-09T08:20:37.686Z · comments (9)

Why I'm doing PauseAI
Joseph Miller (Josephm) · 2024-04-30T16:21:54.156Z · comments (16)

New LessWrong feature: Dialogue Matching
jacobjacob · 2023-11-16T21:27:16.763Z · comments (22)

[link] A Chess-GPT Linear Emergent World Representation
Adam Karvonen (karvonenadam) · 2024-02-08T04:25:15.222Z · comments (14)

Charbel-Raphaël and Lucius discuss interpretability
Mateusz Bagiński (mateusz-baginski) · 2023-10-30T05:50:34.589Z · comments (7)

On the future of language models
owencb · 2023-12-20T16:58:28.433Z · comments (17)

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
Lucius Bushnaq (Lblack) · 2024-05-20T17:53:25.985Z · comments (4)

In favour of exploring nagging doubts about x-risk
owencb · 2024-06-25T23:52:01.322Z · comments (2)

Scaling and evaluating sparse autoencoders
leogao · 2024-06-06T22:50:39.440Z · comments (6)

[question] What convincing warning shot could help prevent extinction from AI?
Charbel-Raphaël (charbel-raphael-segerie) · 2024-04-13T18:09:29.096Z · answers+comments (18)

New LessWrong review winner UI ("The LeastWrong" section and full-art post pages)
kave · 2024-02-28T02:42:05.801Z · comments (64)

SAE reconstruction errors are (empirically) pathological
wesg (wes-gurnee) · 2024-03-29T16:37:29.608Z · comments (16)

TOMORROW: the largest AI Safety protest ever!
Holly_Elmore · 2023-10-20T18:15:18.276Z · comments (26)

[link] A case for AI alignment being difficult
jessicata (jessica.liu.taylor) · 2023-12-31T19:55:26.130Z · comments (56)

[link] Transformer Circuit Faithfulness Metrics Are Not Robust
Joseph Miller (Josephm) · 2024-07-12T03:47:30.077Z · comments (5)

Deception Chess: Game #1
Zane · 2023-11-03T21:13:55.777Z · comments (19)

Nonlinear’s Evidence: Debunking False and Misleading Claims
KatWoods (ea247) · 2023-12-12T13:16:12.008Z · comments (171)

Apply for MATS Winter 2023-24!
utilistrutil · 2023-10-21T02:27:34.350Z · comments (6)

Backdoors as an analogy for deceptive alignment
Jacob_Hilton · 2024-09-06T15:30:06.172Z · comments (2)

[link] The Witness
Richard_Ngo (ricraz) · 2023-12-03T22:27:16.248Z · comments (5)

I turned decision theory problems into memes about trolleys
Tapatakt · 2024-10-30T20:13:29.589Z · comments (20)

Key takeaways from our EA and alignment research surveys
Cameron Berg (cameron-berg) · 2024-05-03T18:10:41.416Z · comments (10)

Dreams of AI alignment: The danger of suggestive names
TurnTrout · 2024-02-10T01:22:51.715Z · comments (59)

[link] Carl Sagan, nuking the moon, and not nuking the moon
eukaryote · 2024-04-13T04:08:50.166Z · comments (8)

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
L Rudolf L (LRudL) · 2024-07-08T22:24:38.441Z · comments (28)

[link] Poker is a bad game for teaching epistemics. Figgie is a better one.
rossry · 2024-07-08T06:05:20.459Z · comments (47)

LLMs can learn about themselves by introspection
Felix J Binder (fjb) · 2024-10-18T16:12:51.231Z · comments (38)

What happens if you present 500 people with an argument that AI is risky?
KatjaGrace · 2024-09-04T16:40:03.562Z · comments (7)

Refactoring cryonics as structural brain preservation
Andy_McKenzie · 2024-09-11T18:36:30.285Z · comments (14)

Value systematization: how values become coherent (and misaligned)
Richard_Ngo (ricraz) · 2023-10-27T19:06:26.928Z · comments (48)

LLM Applications I Want To See
sarahconstantin · 2024-08-19T21:10:03.101Z · comments (5)

Lsusr's Rationality Dojo
lsusr · 2024-02-13T05:52:03.757Z · comments (17)

Response to nostalgebraist: proudly waving my moral-antirealist battle flag
Steven Byrnes (steve2152) · 2024-05-29T16:48:29.408Z · comments (29)

Scissors Statements for President?
AnnaSalamon · 2024-11-06T10:38:21.230Z · comments (31)

On Dwarksh’s Podcast with Leopold Aschenbrenner
Zvi · 2024-06-10T12:40:03.348Z · comments (7)

[link] Notes from a Prompt Factory
Richard_Ngo (ricraz) · 2024-03-10T05:13:39.384Z · comments (19)

Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small
Joseph Bloom (Jbloom) · 2024-02-02T06:54:53.392Z · comments (37)

General Thoughts on Secular Solstice
Jeffrey Heninger (jeffrey-heninger) · 2024-03-23T18:48:43.940Z · comments (60)

[link] Advice for Activists from the History of Environmentalism
Jeffrey Heninger (jeffrey-heninger) · 2024-05-16T18:40:02.064Z · comments (8)

A simple model of math skill
Alex_Altair · 2024-07-21T18:57:33.697Z · comments (16)

[link] Advice for journalists
Nathan Young · 2024-10-07T16:46:40.929Z · comments (53)

[link] LessOnline (May 31—June 2, Berkeley, CA)
Ben Pace (Benito) · 2024-03-26T02:34:00.000Z · comments (24)

Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming
Buck · 2024-10-10T13:36:53.810Z · comments (4)

On the Executive Order
Zvi · 2023-11-01T14:20:01.657Z · comments (4)

[link] The Minority Coalition
Richard_Ngo (ricraz) · 2024-06-24T20:01:27.436Z · comments (7)

What's up with "Responsible Scaling Policies"?
habryka (habryka4) · 2023-10-29T04:17:07.839Z · comments (8)

Why comparative advantage does not help horses
Sherrinford · 2024-09-30T22:27:57.450Z · comments (10)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

notfnofn on D0TheMath's Shortform

Let's back up here and clarify definitions before invoking any theorems. In the language of set theory, we have a countably infinite set of finite statements. Some statements imply other statements. A subset of these statements is said to be consistent if they can all be assigned to true such that, when following the basic rules of logic, one does not arrive at a contradiction.

The compactness theorem is helpful when $A$ is an infinite set. $Z F C$ is a finite set of axioms, so let's ignore everything about finite subsets of $A$ and the compactness theorem; it's not relevant.

I'll now rewrite your last sentence as:

ZFC + not Consistent(ZFC) has no model <-> not Consistent(ZFC + not Consistent(ZFC))

This is true but irrelevant. Assuming ZFC is consistent, ZFC will not be able to prove its own consistency so [not Consistent(ZFC)] can be added as an axiom without affecting its consistency. This means that ZFC + [not Consistent(ZFC)] would indeed have a model; I forget how this goes but I think it's something like "start with a model of ZFC, throw in a $c$ that's treated as a natural number and corresponds to the contradiction found in ZFC, then close". I think $c$ is automatically treated as greater than every "actual" natural number (and the way to show that this can be added without issue (I think) involves the compactness theorem).

sharmake-farah on Lao Mein's Shortform

Maybe there's a case there, but I'd doubt it get past a jury, let alone result in any guilty verdicts.

sharmake-farah on o1 is a bad idea

Oh, now I understand.

And AIs have already been superhuman at chess for very long, yet that domain gives very little incentive for very strong instrumental convergence.

I am claiming that for practical AIs, the results of training them in the real world with goals will give them instrumental convergence, but without further incentives, will not give them so much instrumental convergence that it leads to power-seeking to disempower humans by default.

jbash on OpenAI Email Archives (from Musk v. Altman)

I used AI assistance to generate this, which might have introduced errors.

Resulting in a strong downvote and, honestly, outright anger on my part.

Check the original source to make sure it's accurate before you quote it: https://www.courtlistener.com/docket/69013420/musk-v-altman/ [1]

If other people have to check it before they quote it, why is it OK for you not to check it before you post it?

bogdan-ionut-cirstea on johnswentworth's Shortform

Would the prediction also apply to inference scaling (laws) - and maybe more broadly various forms of scaling post-training, or only to pretraining scaling?

mondsemmel on Lao Mein's Shortform

What if whistleblowers and internal documents corroborated that they think what they're doing could destroy the world?

sharmake-farah on Lao Mein's Shortform

Notably, no law I know of allows you to take legal action on a hunch that they might destroy the world based on your probability of them destroying the world being high without them doing any harmful actions (and no, building AI doesn't count here.)

mondsemmel on Lao Mein's Shortform

Ilya is demonstrably not in on that mission, since his step immediately after leaving OpenAI was to found an additional AGI company and thus increase x-risk.

mondsemmel on Lao Mein's Shortform

I don't understand the reference to assassination. Presumably there are already laws on the books that outlaw trying to destroy the world (?), so it would be enough to apply those to AGI companies.

joe-rogero on What are Emotions?

What happens then when a non-thinking thing feels happy? Is that happiness valued? To whom? Or do you think this is impossible?

When a baby feels happy, it feels happy. Nothing else happens.

There are differences among wanting, liking, and endorsing [LW · GW] something.

A happy blob may like feeling happy, and might even feel a desire to experience more of it, but it cannot endorse things if it doesn't have agency. Human fulfillment and wellbeing typically involves some element of all three.

An unthinking being cannot value even its own happiness, because the concept traditionally meant by "values" refers to the goals that an agent points itself at, and an unthinking being isn't agentic - it does not make plans to steer the world in any particular direction.

Then if you also say that happiness is good, and that good implies value, one must ask, who or what is valuing the happiness? The rock? The universe?

I am. When I say "happiness is good", this is isomorphic with "I value happiness". It is a statement about the directions in which I attempt to steer the world.

Like there must be some physical process by which happiness is valued. Maybe a dimension by which emotional value is expressed?

The physical process that implements "valuing happiness" is the firing of neurons in a brain. It could in theory be implemented in silicon as well, but it's near-certainly not implemented by literal rocks.

something that is challenging, and requires a certain kind of problem solving, where the solution is beautiful in some way

Yep, that makes sense. I notice, however, that these things do not appear to be emotions. And that's fine! It is okay to innately value things that are not emotions! Like "having a model of the world that is as accurate as possible", i.e. truth-seeking. Many people (especially here on LW) value knowledge for its own sake. There are emotions associated with this goal, but the emotions are ancillary. There are also instrumental reasons to seek truth, but they don't always apply. The actual goal is "improving one's world-model" or something similar. It bottoms out there. Emotions need not apply.

The key piece though is that regardless, as tslarm says, "emotions are accompanied by (or identical with, depending on definitions) valenced qualia". They always have some value.

First off, I'm not wholly convinced this is true. I think emotions are usually accompanied by valenced qualia, but (as with my comments about curiosity) not necessarily always. Sure, if you define "emotion" so that it excludes all possible counterexamples, then it will exclude all possible counterexamples, but also you will no longer be talking about the same concept as other people using the word "emotion".

Second, there is an important difference between "accompanied by valenced qualia" and "has value". There is no such thing as "inherent value", absent a thinking being to do the evaluation. Again, things like values and goals are properties of agents; they reflect the directions in which those agents steer.

Finally, more broadly, there's a serious problem with terminally valuing only the feeling of emotions. Imagine a future scenario: all feeling beings are wired to an enormous switchboard, which is in turn connected to their emotional processors. The switchboard causes them to feel an optimal mixture of emotions at all times (whatever you happen to think that means) and they experience nothing else. Is this a future you would endorse? Does something important seem to be missing?