LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] Seven lessons I didn't learn from election day
Eric Neyman (UnexpectedValues) · 2024-11-14T18:39:07.053Z · comments (33)

Reasons for and against working on technical AI safety at a frontier AI lab
bilalchughtai (beelal) · 2025-01-05T14:49:53.529Z · comments (12)

The "Think It Faster" Exercise
Raemon · 2024-12-11T19:14:10.427Z · comments (13)

[question] What are the strongest arguments for very short timelines?
Kaj_Sotala · 2024-12-23T09:38:56.905Z · answers+comments (74)

[link] Steering Gemini with BiDPO
TurnTrout · 2025-01-31T02:37:55.839Z · comments (5)

Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models
Andrew Mack (andrew-mack) · 2024-12-03T21:19:42.333Z · comments (7)

[link] Anthropic: Three Sketches of ASL-4 Safety Case Components
Zach Stein-Perlman · 2024-11-06T16:00:06.940Z · comments (33)

Comment on "Death and the Gorgon"
Zack_M_Davis · 2025-01-01T05:47:30.730Z · comments (32)

We probably won't just play status games with each other after AGI
Matthew Barnett (matthew-barnett) · 2025-01-15T04:56:38.330Z · comments (20)

The subset parity learning problem: much more than you wanted to know
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-03T09:13:59.245Z · comments (18)

LLMs Look Increasingly Like General Reasoners
eggsyntax · 2024-11-08T23:47:28.886Z · comments (45)

Introducing Squiggle AI
ozziegooen · 2025-01-03T17:53:42.915Z · comments (15)

Zvi’s Thoughts on His 2nd Round of SFF
Zvi · 2024-11-20T13:40:08.092Z · comments (2)

Anvil Shortage
Screwtape · 2024-11-13T22:57:41.974Z · comments (16)

Thoughts on the conservative assumptions in AI control
Buck · 2025-01-17T19:23:38.575Z · comments (5)

A very strange probability paradox
notfnofn · 2024-11-22T14:01:36.587Z · comments (26)

The Rising Sea
Jesse Hoogland (jhoogland) · 2025-01-25T20:48:52.971Z · comments (2)

[link] Should you be worried about H5N1?
gw · 2024-12-05T21:11:06.996Z · comments (2)

Tips On Empirical Research Slides
James Chua (james-chua) · 2025-01-08T05:06:44.942Z · comments (4)

AIs Will Increasingly Fake Alignment
Zvi · 2024-12-24T13:00:07.770Z · comments (0)

Is "VNM-agent" one of several options, for what minds can grow up into?
AnnaSalamon · 2024-12-30T06:36:20.890Z · comments (54)

Agent Foundations 2025 at CMU
Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2025-01-19T23:48:22.569Z · comments (10)

Matryoshka Sparse Autoencoders
Noa Nabeshima (noa-nabeshima) · 2024-12-14T02:52:32.017Z · comments (15)

(Salt) Water Gargling as an Antiviral
Elizabeth (pktechgirl) · 2024-11-22T18:00:02.765Z · comments (6)

[link] Five Recent AI Tutoring Studies
Arjun Panickssery (arjun-panickssery) · 2025-01-19T03:53:47.714Z · comments (0)

Parable of the vanilla ice cream curse (and how it would prevent a car from starting!)
Mati_Roy (MathieuRoy) · 2024-12-08T06:57:45.783Z · comments (21)

Circling as practice for “just be yourself”
Kaj_Sotala · 2024-12-16T07:40:04.482Z · comments (5)

[link] The Manhattan Trap: Why a Race to Artificial Superintelligence is Self-Defeating
Corin Katzke (corin-katzke) · 2025-01-21T16:57:00.998Z · comments (11)

Scaling Sparse Feature Circuit Finding to Gemma 9B
Diego Caples (diego-caples) · 2025-01-10T11:08:11.999Z · comments (10)

Remap your caps lock key
bilalchughtai (beelal) · 2024-12-15T14:03:33.623Z · comments (18)

Stargate AI-1
Zvi · 2025-01-24T15:20:18.752Z · comments (1)

🇫🇷 Announcing CeSIA: The French Center for AI Safety
Charbel-Raphaël (charbel-raphael-segerie) · 2024-12-20T14:17:13.104Z · comments (2)

[link] Is Deep Learning Actually Hitting a Wall? Evaluating Ilya Sutskever's Recent Claims
garrison · 2024-11-13T17:00:01.005Z · comments (14)

[link] On Eating the Sun
jessicata (jessica.liu.taylor) · 2025-01-08T04:57:20.457Z · comments (92)

Some arguments against a land value tax
Matthew Barnett (matthew-barnett) · 2024-12-29T15:17:00.740Z · comments (39)

[link] Gwern Branwen interview on Dwarkesh Patel’s podcast: “How an Anonymous Researcher Predicted AI's Trajectory”
Said Achmiz (SaidAchmiz) · 2024-11-14T23:53:34.922Z · comments (0)

[question] What are the good rationality films?
Ben Pace (Benito) · 2024-11-20T06:04:56.757Z · answers+comments (54)

AI #92: Behind the Curve
Zvi · 2024-11-28T14:40:05.448Z · comments (7)

Implications of the inference scaling paradigm for AI safety
Ryan Kidd (ryankidd44) · 2025-01-14T02:14:53.562Z · comments (63)

Effective Evil's AI Misalignment Plan
lsusr · 2024-12-15T07:39:34.046Z · comments (9)

[link] SAEBench: A Comprehensive Benchmark for Sparse Autoencoders
Can (Can Rager) · 2024-12-11T06:30:37.076Z · comments (6)

On the OpenAI Economic Blueprint
Zvi · 2025-01-15T14:30:06.773Z · comments (1)

Testing which LLM architectures can do hidden serial reasoning
Filip Sondej · 2024-12-16T13:48:34.204Z · comments (9)

[link] Best-of-N Jailbreaking
John Hughes (john-hughes) · 2024-12-14T04:58:48.974Z · comments (5)

Should there be just one western AGI project?
rosehadshar · 2024-12-03T10:11:17.914Z · comments (72)

I'm offering free math consultations!
Gurkenglas · 2025-01-14T16:30:40.115Z · comments (6)

Human study on AI spear phishing campaigns
Simon Lermen (dalasnoin) · 2025-01-03T15:11:14.765Z · comments (8)

The 2023 LessWrong Review: The Basic Ask
Raemon · 2024-12-04T19:52:40.435Z · comments (25)

2025 Prediction Thread
habryka (habryka4) · 2024-12-30T01:50:14.216Z · comments (19)

LLM chatbots have ~half of the kinds of "consciousness" that humans believe in. Humans should avoid going crazy about that.
Andrew_Critch · 2024-11-22T03:26:11.681Z · comments (53)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

mateusz-baginski on Thread for Sense-Making on Recent Murders and How to Sanely Respond

Eli wrote [LW(p) · GW(p)]:

There are at least types of people that the term "Zizian" might refer to:
Someone who has read Sinceriously.fyi and is generally sympathetic to Ziz's philosophy.
A member of a relatively tightly-coordinated anarchist conspiracy, that has (allegedly) planned and carried out a series of violent crimes.
Octavia is a Zizian in the first sense, but is not (to my knowledge) a Zizian in the second sense. In fact, she seems unaware or disbelieving that a network of Zizians of the second sense exists. She appears to think that there are only 'people who have benefited from reading Ziz's blog', and no coordinated criminal network to speak of.

I would be very surprised if there was no "inner Ziz crew", as inner circles around leaders / prominent figures in a community seem like a default thing that forms in movements/cultural groups.

But is it true that you don't think this inner circle is a coordinated group responsible for the murders?

roman-leventov on Gradual Disempowerment, Shell Games and Flinches

My quick impression is that this is a brutal and highly significant limitation of this kind of research. It's just incredibly expensive for others to read and evaluate, so it's very common for it to get ignored.

I'd predict that if you improved the arguments by 50%, it would lead to little extra uptake.

I think this is wrong. The introduction of the GD paper takes no more than 10 minutes to read and no significant cognitive effort to grasp, really. I don't think there is more than 10% potential of making it any clearer or approachable.

mateusz-baginski on Thread for Sense-Making on Recent Murders and How to Sanely Respond

She's not but to the extent that people put the AI labs in one bucket with LW/EA (TESCREAL or sth), the Annie Altman incident may cause us additional reputational damage.

cstinesublime on Siebe's Shortform

If you want, it would help me learn to write better, for you to list off all the words (or sentences) that confused you.

I would love to render any assistance I can in that regard, but my fear is this is probably more of a me-problem than a general problem with your writing.

What I really need though is a all encompassing, rigid definition of a 'terminal goal' - what is and isn't a terminal goal. Because "it's a goal which is instrumental to no other goal" just makes it feel like the definition ends wherever you want it to. Because, consider a system which is capable of self-modification and changing it's own goals, now the difference between an instrumental goal and a terminal goal erodes.

Never the less some of your formatting was confusing to me, for example a few replies back you wrote:

As for the case of idealized terminal-goal-pursuers, any two terminal goals can be combined into one, e.g. {paperclip-amount×2 + stamp-amount} or {if can create a black hole with p>20%, do so, else maximize stamps}, etc.

The bit " {paperclip-amount×2 + stamp-amount}" and " {if can create a black hole with p>20%, do so, else maximize stamps}" was and is very hard for me to understand. If it was presented in plain English, I'm confident I'd understand it. But using computer-code-esque variables, especially when they are not assigned values introduces a point of failure for my understanding. Because now I need to understand your formatting, and the pseudo-code correctly (and as not a coder, I struggle to read pseudo-code at the best of times) just to understand the allusion you're making.

Also the phrase "idealized terminal-goal-pursuers" underspecifies what you mean by 'idealized'? I can think of at least four possible senses you might be gesturing to:

A. a terminal-goal-pursuer who's terminal goals are "simple" enough to lend themselves as good candidates for a thought experiment - therefore ideal from the point of view of a teacher and a student.

B. ideal as in extremely instrumentally effective in accomplishing their goals,

C. ideal as in they encapsulate the perfect undiluted 'ideal' of a terminal goal (and therefore it is possible to have pseudo-terminial goals) - i.e. a 'platonic ideal/essence' as opposed to a platonic appearance,

D. "idealized" as in that these are purely theoretical beings (at this point in time) - because while humans may have terminal goals, they are not particularly good or pure examples of terminal-goal-havers? The same for any extant system we may ascribe goals to?

E. "idealized" in a combination of A and B which is very specific to entities that have multiple terminal goals, which is unlikely, but for the sake of argument if they did have two or more terminal goals would display certain behaviors.

I'm not sure which you mean. But suspect it's none-of-the-above.

For the record, I know you absolutely don't mean "ideal" as in "moral ideal". Nor in an Aesthetic or Freudian sense, like when a teenager "idealizes" their favourite pop-star and raves on about how perfect they are in every way

But going back to my confusion over terminal goals, and what is or isn't:

For example: "I value paperclips. I also value stamps, but one stamp is only half as valuable as a paperclip to me" → "I have the single value of maximizing this function over the world: {paperclip-amount×2 + stamp-amount}". (It's fine to think of it in either way)

I'm not sure what this statement is saying, because that describes a possibly very human attribute - that we may have two terminal goals in that they are not subservient or means of pursuing anything else. Which is what I understand a 'terminal' goal to mean. The examples in the video describe very "single-minded" entities that have a single terminal goal that they seek to optimize, like a stamp collecting machine.

There's a few assumptions I'm making here: that a terminal goal is "fixed" or permanent. You see when I said sufficiently superintelligent entities would converge on certain values, I was assuming that they would have some kind of self-modification abilities. And therefore their terminal values would look a lot like common convergent instrumental values of other, similarly self-adapting/improving/modifying entities.

However if this is not a terminal goal, then what is a terminal goal? And for a system that is capable of adapting and improving itself, what would be it's terminal goals?

Is terminal goal simply a term of convenience?

zy on Mikhail Samin's Shortform

I wonder if this is

funding - the company need money to perform research on safety alignment (X risks, and assuming they do want to to this), and to get there they need to publish models so that they can 1) make profits from them, 2) attract more funding. A quick look on the funding source shows Amazon, Google, some other ventures, and some other tech companies
empirical approach - they want to take empirical approach to AI safety and would need some limited capable models

But both of the points above are my own speculations

dr_s on The Clueless Sniper and the Principle of Indifference

I think for this specific example the superior is wrong because realistically we can form an expectation of the distribution of those factors. Just because we don't know doesn't mean it's actually necessarily a gaussian - some factors, like the Coriolis force, are systematic. If the distribution was "a ring of 1 m around the aimed point" then you would know for sure you won't hit the terrorist that way, but have no clue whether you'll hit the kid.

Also, even if the distribution was gaussian, if it's broad enough the difference in probability between hitting the terrorist and hitting the kid may simply be too small to matter.

knight-lee on Gradual Disempowerment, Shell Games and Flinches

Exactly! I've also noticed there are so many ideas and theories out there, relative to the available resources to evaluate them and find the best to work on.

A lot of good ideas which I feel deserve a ton of further investigation, seem to be barely talked about after they're introduced. E.g. off the top of my head,

Why Don't We Just... Shoggoth+Face+Paraphraser? [LW · GW]
Applying superintelligence without collusion [LW · GW]
Self-Other Overlap [LW · GW]
Multi-objective homeostasis [LW · GW]
MONA: Managed Myopia with Approval Feedback [LW · GW]

My opinion is that there isn't enough funds and manpower. I have an idea [LW · GW] on increasing that, which ironically also got ignored, yay!

jim-buhler on The Clueless Sniper and the Principle of Indifference

Interesting, thanks!

I guess one could object that in you're even more clueless sniper example, applying the POI between Hit and Not Hit is just as arbitrary as applying it between, e.g., Hit, Hit on his right, and Hit on his left. This is what Greaves (2016) -- and maybe others? -- called the "problem of multiple partitions". In my original scenario, people might argue that there isn't such a problem and that there is only one sensible way to apply POI. So it'd be ok to apply it in my case and not in yours.

I don't know what to make of this objection, though. I'm not sure it makes sense. It feels a bit arbitrary to say "we can apply POI but only when there is one way of applying it that clearly seems more sensible". Maybe this problem of multiple partitions is a reason to reject POI altogether (in situations of what Greaves call "complex cluelessness" at least, like in my sniper example).

zack_m_davis on Yudkowsky on The Trajectory podcast

No one with the money has offered to fund it yet. I'm not even sure they're aware this is happening.

Um, this seems bad. I feel like I should do something, but I don't personally have that kind of money to throw around. @habryka [LW · GW], is this the LTFF's job??

cstinesublime on CstineSublime's Shortform

Can you elaborate further on how Gato is proof that just supplementing the training data is sufficient? I looked on youtube and can't find any videos of task switching.