LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] Contra Ngo et al. “Every ‘Every Bay Area House Party’ Bay Area House Party”
Ricki Heicklen (bayesshammai) · 2024-02-22T23:56:02.318Z · comments (5)

Thinking By The Clock
Screwtape · 2023-11-08T07:40:59.936Z · comments (27)

Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer
johnswentworth · 2024-04-18T00:27:43.451Z · comments (21)

Humming is not a free $100 bill
Elizabeth (pktechgirl) · 2024-06-06T20:10:02.457Z · comments (6)

[link] Daniel Kahneman has died
DanielFilan · 2024-03-27T15:59:14.517Z · comments (11)

Introducing Alignment Stress-Testing at Anthropic
evhub · 2024-01-12T23:51:25.875Z · comments (23)

The Hopium Wars: the AGI Entente Delusion
Max Tegmark (MaxTegmark) · 2024-10-13T17:00:29.033Z · comments (53)

[link] Why I’m not a Bayesian
Richard_Ngo (ricraz) · 2024-10-06T15:22:45.644Z · comments (89)

Safety consultations for AI lab employees
Zach Stein-Perlman · 2024-07-27T15:00:27.276Z · comments (4)

"Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity
Thane Ruthenis · 2023-12-16T20:08:39.375Z · comments (34)

re: Yudkowsky on biological materials
bhauth · 2023-12-11T13:28:10.639Z · comments (30)

Contra papers claiming superhuman AI forecasting
nikos (followtheargument) · 2024-09-12T18:10:50.582Z · comments (16)

Every "Every Bay Area House Party" Bay Area House Party
Richard_Ngo (ricraz) · 2024-02-16T18:53:28.567Z · comments (6)

[link] Toward a Broader Conception of Adverse Selection
Ricki Heicklen (bayesshammai) · 2024-03-14T22:40:57.920Z · comments (61)

Cryonics is free
Mati_Roy (MathieuRoy) · 2024-09-29T17:58:17.108Z · comments (35)

Skills from a year of Purposeful Rationality Practice
Raemon · 2024-09-18T02:05:58.726Z · comments (18)

[link] FHI (Future of Humanity Institute) has shut down (2005–2024)
gwern · 2024-04-17T13:54:16.791Z · comments (22)

[question] Why is o1 so deceptive?
abramdemski · 2024-09-27T17:27:35.439Z · answers+comments (23)

Struggling like a Shadowmoth
Raemon · 2024-09-24T00:47:05.030Z · comments (38)

WTH is Cerebrolysin, actually?
gsfitzgerald (neuroplume) · 2024-08-06T20:40:53.378Z · comments (23)

Effective Aspersions: How the Nonlinear Investigation Went Wrong
TracingWoodgrains (tracingwoodgrains) · 2023-12-19T12:00:23.529Z · comments (170)

Critical review of Christiano's disagreements with Yudkowsky
Vanessa Kosoy (vanessa-kosoy) · 2023-12-27T16:02:50.499Z · comments (40)

Timaeus's First Four Months
Jesse Hoogland (jhoogland) · 2024-02-28T17:01:53.437Z · comments (6)

This is already your second chance
Malmesbury (Elmer of Malmesbury) · 2024-07-28T17:13:57.680Z · comments (13)

Did Christopher Hitchens change his mind about waterboarding?
Isaac King (KingSupernova) · 2024-09-15T08:28:09.451Z · comments (22)

'Empiricism!' as Anti-Epistemology
Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2024-03-14T02:02:59.723Z · comments (90)

Three Subtle Examples of Data Leakage
abstractapplic · 2024-10-01T20:45:27.731Z · comments (16)

2023 Unofficial LessWrong Census/Survey
Screwtape · 2023-12-02T04:41:51.418Z · comments (81)

[link] Recommendation: reports on the search for missing hiker Bill Ewasko
eukaryote · 2024-07-31T22:15:03.174Z · comments (28)

Reconsider the anti-cavity bacteria if you are Asian
Lao Mein (derpherpize) · 2024-04-15T07:02:02.655Z · comments (43)

The 'Neglected Approaches' Approach: AE Studio's Alignment Agenda
Cameron Berg (cameron-berg) · 2023-12-18T20:35:01.569Z · comments (21)

Many arguments for AI x-risk are wrong
TurnTrout · 2024-03-05T02:31:00.990Z · comments (86)

[link] Boycott OpenAI
PeterMcCluskey · 2024-06-18T19:52:42.854Z · comments (26)

How useful is mechanistic interpretability?
ryan_greenblatt · 2023-12-01T02:54:53.488Z · comments (54)

Announcing ILIAD — Theoretical AI Alignment Conference
Nora_Ammann · 2024-06-05T09:37:39.546Z · comments (18)

You can remove GPT2’s LayerNorm by fine-tuning for an hour
StefanHex (Stefan42) · 2024-08-08T18:33:38.803Z · comments (11)

Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI
Jeremy Gillen (jeremy-gillen) · 2024-01-26T07:22:06.370Z · comments (60)

The likely first longevity drug is based on sketchy science. This is bad for science and bad for longevity.
BobBurgers · 2023-12-12T02:42:18.559Z · comments (34)

Is being sexy for your homies?
Valentine · 2023-12-13T20:37:02.043Z · comments (92)

[link] Sycophancy to subterfuge: Investigating reward tampering in large language models
Carson Denison (carson-denison) · 2024-06-17T18:41:31.090Z · comments (22)

[link] Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data
Johannes Treutlein (Johannes_Treutlein) · 2024-06-21T15:54:41.430Z · comments (13)

And All the Shoggoths Merely Players
Zack_M_Davis · 2024-02-10T19:56:59.513Z · comments (57)

Vote on Interesting Disagreements
Ben Pace (Benito) · 2023-11-07T21:35:00.270Z · comments (129)

DeepMind's "Frontier Safety Framework" is weak and unambitious
Zach Stein-Perlman · 2024-05-18T03:00:13.541Z · comments (14)

[link] Making every researcher seek grants is a broken model
jasoncrawford · 2024-01-26T16:06:26.688Z · comments (41)

Most People Don't Realize We Have No Idea How Our AIs Work
Thane Ruthenis · 2023-12-21T20:02:00.360Z · comments (42)

EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024
scasper · 2024-05-21T20:15:36.502Z · comments (16)

[link] Masterpiece
Richard_Ngo (ricraz) · 2024-02-13T23:10:35.376Z · comments (21)

[link] Succession
Richard_Ngo (ricraz) · 2023-12-20T19:25:03.185Z · comments (48)

What’s up with LLMs representing XORs of arbitrary features?
Sam Marks (samuel-marks) · 2024-01-03T19:44:33.162Z · comments (61)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

remmelt-ellen on If we had known the atmosphere would ignite

For an overview of why such a guarantee would turn out impossible, suggest taking a look at Will Petillo's post Lenses of Control [AF · GW].

remmelt-ellen on If we had known the atmosphere would ignite

Defining alignment (sufficiently rigorous so that a formal proof of (im)possibility of alignment is conceivable) is a hard thing!

It's less hard than you think, if you use a minimal-threshold definition of alignment:

That "AGI" continuing to exist, in some modified form, does not result eventually in changes to world conditions/contexts that fall outside the ranges that existing humans could survive under.

christiankl on Does the "ancient wisdom" argument have any validity? If a particular teaching or tradition is old, to what extent does this make it more trustworthy?

When it comes to Buddhist practice, it's worth noting that practicing techniques by the book is not how Buddhism was practiced for most of the time in the last 2500 years. It was mostly an oral tradition and as such the knowledge that's passed down from teacher to student evolves over time in various ways.

Many modern Buddhist tradition put much more emphasis on meditation in contrast to ritualized behavior.

In Buddhism (and in Christanity for that matter) for thousands of years meditation was largely done in monasteries and not by lay-people. In many Buddhist communities "lay-people aren't supposed to meditate" is something you could call "ancient wisdom".

In someone convinces you in a Western context that following some practice is ancient wisdom, they are likely doing a lot of picking and choosing in a way that does not make it clear how ancient the thing they are promoting actually happens to be.

gunnar_zarncke on [Intuitive self-models] 2. Conscious Awareness

I think your explanation in section 8.5.2 resolves our disagreement nicely. You refer to S(X) thoughts that "spawn up" successive thoughts that eventually lead to X (I'd say X') actions shortly after (or much later). While I was referring to S(X) that cannot give rise to X immediately. I think the difference was that you are more lenient with what X can be, such that S(X) can be about an X that is happening much later, which wouldn't work in my model of thoughts.

Explicit (self-reflective) desire
Statement: “I want to be inside.”
Intuitive model underlying that statement: There’s a frame (§2.2.3 [LW · GW]) “X wants Y” (§3.3.4 [LW · GW]). This frame is being invoked, with X as the homunculus, and Y as the concept of “inside” as a location / environment.
How I describe what’s happening using my framework: There’s a systematic pattern (in this particular context), call it P, where self-reflective thoughts concerning the inside, like “myself being inside” or “myself going inside”, tend to trigger positive valence. That positive valence is why such thoughts arise in the first place, and it’s also why those thoughts tend to lead to actual going-inside behavior.
In my framework, that’s really the whole story. There’s this pattern P. And we can talk about the upstream causes of P—something involving innate drives and learned heuristics in the brain. And we can likewise talk about the downstream effects of P—P tends to spawn behaviors like going inside, brainstorming how to get inside, etc. But “what’s really going on” (in the “territory” of my brain algorithm) is a story about the pattern P, not about the homunculus. The homunculus only arises secondarily, as the way that I perceive the pattern P (in the “map” of my intuitive self-model).

remmelt-ellen on If we had known the atmosphere would ignite

Yes, I think there is a more general proof available. This proof form would combine limits to predictability and so on, with a lethal dynamic that falls outside those limits.

remmelt-ellen on If we had known the atmosphere would ignite

The question is more if it can ever be truly proved at all, or if it doesn't turn out to be an undecidable problem.

Control limits can show that it is an undecidable problem.

A limited scope of control can in turn be used to prove that a dynamic convergent on human-lethality is uncontrollable. That would be a basis for an impossibility proof by contradiction (cannot control AGI effects to stay in line with human safety).

remmelt-ellen on If we had known the atmosphere would ignite

Awesome directions. I want to bump this up.

This might include AGI predicting its own future behaviour, which is kind of essential for it to stick to a reliably aligned course of action.

There is a simple way of representing this problem that already shows the limitations.

Assume that AGI continues to learn new code from observations (inputs from the world) – since learning is what allows the AGI to stay autonomous and adaptable in acting across changing domains of the world.

Then in order for AGI code to be run to make predictions about relevant functioning of its future code:

Current code has to predict what future code will be learned from future unknown inputs (there would be no point in learning then if the inputs were predictable and known ahead of time).
Also, current code has to predict how the future code will compute subsequent unknown inputs into outputs, presumably using some shortcut algorithm that can infer relevant behavioural properties across the span of possible computationally-complex code.
Further, current code would have to predict how the outputs would result in relevant outside effects (where relevant to sticking to a reliably human-aligned course of action)
- Where it is relevant how some of those effects could feed back into sensor inputs (and therefore could cause drifts in the learned code and the functioning of that code).
- Where other potential destabilising feedback loops are also relevant, particularly that of evolutionary selection.

yair-halberstadt on Could orcas be (trained to be) smarter than humans? 

I imagine that part of the difference is because Orcas are hunters, and need much more sophisticated sensors + controls.

I gigantic jellyfish wouldn't have the same number of neurons as a similarly sized whale, so it's not just about size, but how you use that size.

remmelt-ellen on If we had known the atmosphere would ignite

Just found your insightful comment. I've been thinking about this for three years. Some thoughts expanding on your ideas:

my idea is more about whether alignment could require that the AGI is able to predict its own results and effects on the world (or the results and effects of other AGIs like it, as well as humans)...

In other words, alignment requires sufficient control. Specifically, it requires AGI to have a control system with enough capacity to detect, model, simulate, evaluate, and correct outside effects propagated by the AGI's own components.

... and that proved generally impossible such that even an aligned AGI can only exist in an unstable equilibrium state in which there exist situations in which it will become unrecoverably misaligned, and we just don't know which.

For example, what if AGI is in some kind of convergence basin [LW · GW] where the changing situations/conditions tend to converge outside the ranges humans can survive under?

so we can assume that they will have to be somehow interpreted by the AGI itself who is supposed to hold them

There's a problem you are pointing of somehow mapping the various preferences – expressed over time by diverse humans from within their (perceived) contexts – onto reference values. This involves making (irreconcilable) normative assumptions of how to map the dimensionality of the raw expressions of preferences onto internal reference values. Basically, you're dealing with NP-complex combinatorics such as encountered with the knapsack problem.

Further, it raises the question of how to make comparisons across all the possible concrete outside effects of the machinery against the internal reference values, such to identify misalignments/errors to correct. Ie. just internalising and holding abstract values is not enough – there would have to be some robust implementation process that translates the values into concrete effects.

dusandnesic on Live Machinery: Interface Design Workshop for AI Safety @ EA Hotel

I'm very sad I cannot attend at that time, but I am hyped about this and believe it to be valuable, so I am writing this endorsement as a signal to others. I've also recommended this to some of my friends, but alas UK visa is hard to get on such short notice. When you run it in Serbia, we'll have more folks from the eastern bloc represented ;)