LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Non-myopia stories
lberglund (brglnd) · 2023-11-13T17:52:31.933Z · comments (10)

Reviewing the Structure of Current AI Regulations
Deric Cheng (deric-cheng) · 2024-05-07T12:34:17.820Z · comments (0)

Impact stories for model internals: an exercise for interpretability researchers
jenny · 2023-09-25T23:15:29.189Z · comments (3)

[link] GDP per capita in 2050
Hauke Hillebrandt (hauke-hillebrandt) · 2024-05-06T15:14:30.934Z · comments (8)

Big-endian is better than little-endian
Menotim · 2024-04-29T02:30:48.053Z · comments (17)

[link] My MATS Summer 2023 experience
James Chua (james-chua) · 2024-03-20T11:26:14.944Z · comments (0)

[question] [link] Is Bjorn Lomborg roughly right about climate change policy?
yhoiseth · 2023-09-27T20:06:30.722Z · answers+comments (14)

AI #61: Meta Trouble
Zvi · 2024-05-02T18:40:03.242Z · comments (0)

[link] Debate helps supervise human experts [Paper]
habryka (habryka4) · 2023-11-17T05:25:17.030Z · comments (6)

[question] Weighing reputational and moral consequences of leaving Russia or staying
spza · 2024-02-18T19:36:40.676Z · answers+comments (24)

Weekly newsletter for AI safety events and training programs
Bryce Robertson (bryceerobertson) · 2024-05-03T00:33:29.418Z · comments (0)

On the 2nd CWT with Jonathan Haidt
Zvi · 2024-04-05T17:30:05.223Z · comments (3)

A Common-Sense Case For Mutually-Misaligned AGIs Allying Against Humans
Thane Ruthenis · 2023-12-17T20:28:57.854Z · comments (7)

The (partial) fallacy of dumb superintelligence
Seth Herd · 2023-10-18T21:25:16.893Z · comments (5)

[link] AI forecasting bots incoming
Dan H (dan-hendrycks) · 2024-09-09T19:14:31.050Z · comments (44)

[question] Where to find reliable reviews of AI products?
Elizabeth (pktechgirl) · 2024-09-17T23:48:25.899Z · answers+comments (4)

Paper Summary: Princes and Merchants: European City Growth Before the Industrial Revolution
Jeffrey Heninger (jeffrey-heninger) · 2024-07-15T21:30:04.043Z · comments (1)

Distinguish worst-case analysis from instrumental training-gaming
Olli Järviniemi (jarviniemi) · 2024-09-05T19:13:34.443Z · comments (0)

Scorable Functions: A Format for Algorithmic Forecasting
ozziegooen · 2024-05-21T04:14:11.749Z · comments (0)

Investigating Bias Representations in LLMs via Activation Steering
DawnLu · 2024-01-15T19:39:14.077Z · comments (4)

Representation Tuning
Christopher Ackerman (christopher-ackerman) · 2024-06-27T17:44:33.338Z · comments (4)

Throughput vs. Latency
alkjash · 2024-01-12T21:37:07.632Z · comments (2)

DPO/PPO-RLHF on LLMs incentivizes sycophancy, exaggeration and deceptive hallucination, but not misaligned powerseeking
tailcalled · 2024-06-10T21:20:11.938Z · comments (13)

Offering Completion
jefftk (jkaufman) · 2024-06-07T01:40:02.137Z · comments (6)

Wholesome Culture
owencb · 2024-03-01T12:08:17.877Z · comments (3)

Aggregative Principles of Social Justice
Cleo Nardo (strawberry calm) · 2024-06-05T13:44:47.499Z · comments (10)

Results from the Turing Seminar hackathon
Charbel-Raphaël (charbel-raphael-segerie) · 2023-12-07T14:50:38.377Z · comments (1)

[link] AI Safety Memes Wiki
plex (ete) · 2024-07-24T18:53:04.977Z · comments (1)

Quick Thoughts on Our First Sampling Run
jefftk (jkaufman) · 2024-05-23T00:20:02.050Z · comments (3)

D&D.Sci (Easy Mode): On The Construction Of Impossible Structures [Evaluation and Ruleset]
abstractapplic · 2024-05-20T09:38:55.228Z · comments (2)

[link] Anthropic: Reflections on our Responsible Scaling Policy
Zac Hatfield-Dodds (zac-hatfield-dodds) · 2024-05-20T04:14:44.435Z · comments (21)

“Clean” vs. “messy” goal-directedness (Section 2.2.3 of “Scheming AIs”)
Joe Carlsmith (joekc) · 2023-11-29T16:32:30.068Z · comments (1)

Deception Chess: Game #2
Zane · 2023-11-29T02:43:22.375Z · comments (17)

But Where do the Variables of my Causal Model come from?
Dalcy (Darcy) · 2024-08-09T22:07:57.395Z · comments (1)

[question] How does it feel to switch from earn-to-give?
Neil (neil-warren) · 2024-03-31T16:27:22.860Z · answers+comments (4)

[LDSL#4] Root cause analysis versus effect size estimation
tailcalled · 2024-08-11T16:12:14.604Z · comments (0)

End-to-end hacking with language models
tchauvin (timot.cool) · 2024-04-05T15:06:53.689Z · comments (0)

[link] The Poker Theory of Poker Night
omark · 2024-04-07T09:47:01.658Z · comments (13)

Glomarization FAQ
Zane · 2023-11-15T20:20:49.488Z · comments (5)

Monthly Roundup #19: June 2024
Zvi · 2024-06-25T12:00:03.333Z · comments (9)

AI #64: Feel the Mundane Utility
Zvi · 2024-05-16T15:20:02.956Z · comments (11)

An explanation of evil in an organized world
KatjaGrace · 2024-05-02T05:20:06.240Z · comments (9)

Aggregative principles approximate utilitarian principles
Cleo Nardo (strawberry calm) · 2024-06-12T16:27:22.179Z · comments (3)

Evaporation of improvements
Viliam · 2024-06-20T18:34:40.969Z · comments (27)

I played the AI box game as the Gatekeeper — and lost
datawitch · 2024-02-12T18:39:35.777Z · comments (52)

Childhood and Education Roundup #6: College Edition
Zvi · 2024-06-26T11:40:03.990Z · comments (8)

Reading More Each Day: A Simple $35 Tool
aysajan · 2024-07-24T13:54:04.290Z · comments (2)

Cicadas, Anthropic, and the bilateral alignment problem
kromem · 2024-05-22T11:09:56.469Z · comments (6)

AI #65: I Spy With My AI
Zvi · 2024-05-23T12:40:02.793Z · comments (7)

Employee Incentives Make AGI Lab Pauses More Costly
nikola (nikolaisalreadytaken) · 2023-12-22T05:04:15.598Z · comments (12)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

benito on Lighthaven Sequences Reading Group #3 (Tuesday 09/24)

I'll come out to the door to let you in if you're not here yet!

rb1 on Lighthaven Sequences Reading Group #3 (Tuesday 09/24)

How do I get inside Lighthaven?

erickball on On the subject of in-house large language models versus implementing frontier models

I would think things are headed toward these companies fine tuning an open source near-frontier LLM. Cheaper than building one from scratch but with most of the advantages.

jkaufman on Editing at the Take Level

Following Emma's advice on FB I exploded the tracks, healed splits, sorted them by take, and then comped manually ignoring reaper's features that are supposed to make this easier. It worked great, and I now have a rough mix ready for Lily's feedback!

faul_sname on Another argument against utility-centric alignment paradigms

uspect[faul_sname] Humans do seem to have strong preferences over immediate actions.
[Jeremy Gillen] I know, sometimes they do. But if they always did then they would be pretty useless. Habits are another example. Robust-to-new-obstacles behavior tends to be is driven by the future goals.

Point of clarification: the type of strong preferences I'm referring to are more deontological-injunction shaped than they are habit-shaped. I expect that a preference not to exhibit the behavior of murdering people would not meaningfully hinder someone whose goal was to get very rich from achieving that goal. One could certainly imagine cases where the preference not to murder caused the person to be less likely to achieve their goals, but I don't expect that it would be all that tightly binding of a constraint in practice, and so I don't think pondering and philosophizing until they realize that they value one murder at exactly -$28,034,771.91 would meaningfully improve that person's ability to get very rich.

I think if you remove "at any cost", it's a more reasonable translation of "going hard". It's just attempting to achieve a long-term goal that is hard to achieve.

I think there's more to the Yudkowsky definition of "going hard" it than "attempting to achieve hard long-term goals". Take for example:

@ESYudkowsky Mossad is much more clever and powerful than novices implicitly imagine a "superintelligence" will be; in the sense that, when novices ask themselves what a "superintelligence" will be able to do, they fall well short of the actual Mossad.
@ESYudkowsky Why? Because Mossad goes hard; and people who don't go hard themselves, have no simple mental motion they can perform -- no simple switch they can access -- to imagine what it is actually like to go hard; and what options become available even to a mere human when you do.

My interpretation of the specific thing that made Mossad's actions an instance of "going hard" here was that they took actions that most people would have thought of as "off limits" in the service of achieving their goal (and that doing so actually helped them achieve their goal (and that it actually worked out for them - we don't generally say that Elizabeth Holmes "went hard" with Theranos). The supply chain attack in question does demonstrate significant technical expertise, but it also demonstrates a willingness to risk provoking parties that were uninvolved in the conflict in order to achieve their goals.

Perhaps instead of "attempting to achieve the goal at any cost" it would be better to say "being willing to disregard conventions and costs imposed on uninvolved parties, if considering those things would get in the way of the pursuit of the goal".

[faul_sname] It does seem to me that "we have a lot of control over the approaches the agent tends to take" is true and becoming more true over time.
[Jeremy Gillen] No!

I suspect we may be talking past each other here. Some of the specific things I observe:

RLHF works pretty well for getting LLMs to output text which is similar to text which has been previously rated as good, and dissimilar to text which has previously been rated as bad. It doesn't generalize perfectly, but it does generalize well enough that you generally have to use adversarial inputs to get it to exhibit undesired behavior - we call them "jailbreaks" not "yet more instances of bomb creation instructions".
Along those lines, RLAIF also seems to Just Work™.
And the last couple of years have been a parade of "the dumbest possible approach works great actually" results, e.g.
- "Sure, fine-tuning works, but what happens if we just pick a few thousand weights and only change those weights, and leave the rest alone?" (Answer: it works great)
- "I want outputs that are more like thing A and less like thing B, but I don't want to spend a lot of compute on fine tuning. Can I just compute both sets of activations and subtract the one from the other?" (Answer: Yep! [LW · GW])
- "Can I ask it to write me a web application from a vague natural language description, and have it make reasonable choices about all the things I didn't specify" (Answer: astonishing amounts of yes)
Take your pick of the top chat-tuned LLMs. If you ask it about a situation and ask what a good course of action would be, it will generally give you pretty sane answers.

So from that, I conclude:

We have LLMs which understand human values, and can pretty effectively judge how good things are according to those values, and output those judgements in a machine-readable format
We are able to tune LLMs to generate outputs that are more like the things we rate as good and less like the things we rate as bad

Put that together and that says that, at least at the level of LLMs, we do in fact have AIs which understand human morality and care about it to the extent that "care about" is even the correct abstraction for the kind of thing they do.

I expect this to continue to be true in the future, and I expect that our toolbox will get better faster than the AIs that we're training get more capable.

I'm talking about stability properties like "doesn't accidentally radically change the definition of its goals when updating its world-model by making observations".

What observations lead you to suspect that this is a likely failure mode?

bogdan-ionut-cirstea on Bogdan Ionut Cirstea's Shortform

Algorithmic progress can use 1. high iteration speed 2. well-defined success metrics (scaling laws) 3.broad knowledge of the whole stack (Cuda to optimization theory to test-time scaffolds) 4. ...
Alignment broadly construed is less engineering and a lot more blue skies, long horizon, and under-defined (obviously for engineering heavy alignment sub-tasks like jailbreak resistance, and some interp work this isn't true).

Generally agree, but I do think prosaic alignment has quite a few advantages vs. prosaic capabilities (e.g. in the extra slides here) and this could be enough [LW(p) · GW(p)] to result in aligned (-enough) automated safety researchers which can be applied to the more blue skies parts of safety research. I would also very much prefer something like a coordinated pause around the time when safety research gets automated.

This line of reasoning is part of why I think it's valuable for any alignment researcher (who can) to focus on bringing the under-defined into a well-defined framework. Shovel-ready tasks will be shoveled much faster by AI shortly anyway.

Agree, I've written about (something related to) this very recently [LW(p) · GW(p)].

askwho on Book Review: On the Edge: The Gamblers

Multi voiced podcast episode for this post in all is two hours and thirty eight minute glory!

(Fun fact, the voice I gave Nate was the one created for contract devils in Planecrash)

https://open.substack.com/pub/dwatvpodcast/p/book-review-on-the-edge-the-gamblers

rhollerith_dot_com on rhollerith_dot_com's Shortform

I appreciate it when people repost here things Eliezer has written on Twitter or Facebook because it makes it easier for me to stay away from Twitter and Facebook.

(OTOH, I'm grateful to Eliezer for participating on Twitter because posting on Twitter has much higher impact than posting here does.)

bogdan-ionut-cirstea on Before smart AI, there will be many mediocre or specialized AIs

This relies on an assumption that you can make up for lack-of-intelligence by numbers or speed. Without that assumption, you could expect that AI research will be dominated by humans until AIs finally “get it”, after which they’ll take over with a huge margin.

I interpret the research program described here as aiming to make this assumption true.

gb on In Praise of the Beatitudes

My personal feeling is that those who emphasize the "spiritual" interpretations are often doing it as a dodge, to avoid the challenge of having to follow the non-spiritual interpretations.

That feels a bit contrived. Do you really suggest that the most natural reading of something like "poor in spirit" is... non-spiritual? Turning away from materialism may sure derive from that, but to claim that it was the main focus seems quite a stretch.