LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] The Witness
Richard_Ngo (ricraz) · 2023-12-03T22:27:16.248Z · comments (5)

Short Remark on the (subjective) mathematical 'naturalness' of the Nanda--Lieberum addition modulo 113 algorithm
carboniferous_umbraculum (Spencer Becker-Kahn) · 2023-06-01T11:31:37.796Z · comments (12)

Slightly against aligning with neo-luddites
Matthew Barnett (matthew-barnett) · 2022-12-26T22:46:42.693Z · comments (31)

I Don’t Know How To Count That Low
Elizabeth (pktechgirl) · 2021-10-22T22:00:02.708Z · comments (10)

[link] Direct effects matter!
Aaron Bergman (aaronb50) · 2021-03-14T04:33:11.493Z · comments (28)

Takes on "Alignment Faking in Large Language Models"
Joe Carlsmith (joekc) · 2024-12-18T18:22:34.059Z · comments (7)

Toward A Bayesian Theory Of Willpower
Scott Alexander (Yvain) · 2021-03-26T02:33:55.056Z · comments (28)

Karate Kid and Realistic Expectations for Disagreement Resolution
Raemon · 2019-12-04T23:25:59.608Z · comments (23)

Rapid Increase of Highly Mutated B.1.1.529 Strain in South Africa
dawangy · 2021-11-26T01:05:49.516Z · comments (15)

A shortcoming of concrete demonstrations as AGI risk advocacy
Steven Byrnes (steve2152) · 2024-12-11T16:48:41.602Z · comments (27)

Yes, AI research will be substantially curtailed if a lab causes a major disaster
lc · 2022-06-14T22:17:01.273Z · comments (31)

Money Stuff
Jacob Falkovich (Jacobian) · 2021-11-01T16:08:02.700Z · comments (18)

I Really Don't Understand Eliezer Yudkowsky's Position on Consciousness
J Bostock (Jemist) · 2021-10-29T11:09:20.559Z · comments (120)

Announcing Encultured AI: Building a Video Game
Andrew_Critch · 2022-08-18T02:16:26.726Z · comments (26)

Human takeover might be worse than AI takeover
Tom Davidson (tom-davidson-1) · 2025-01-10T16:53:27.043Z · comments (51)

Frequent arguments about alignment
John Schulman (john-schulman) · 2021-06-23T00:46:38.568Z · comments (17)

[link] A review of Where Is My Flying Car? by J. Storrs Hall
jasoncrawford · 2020-11-06T20:01:55.074Z · comments (23)

The Long Long Covid Post
Zvi · 2022-02-10T13:10:01.452Z · comments (29)

Biosecurity Culture, Computer Security Culture
jefftk (jkaufman) · 2023-08-30T16:40:03.101Z · comments (11)

Testing PaLM prompts on GPT3
Yitz (yitz) · 2022-04-06T05:21:06.841Z · comments (14)

[link] Scaling Laws for Reward Model Overoptimization
leogao · 2022-10-20T00:20:06.920Z · comments (13)

[link] Reproducing ARC Evals' recent report on language model agents
Thomas Broadley (thomas-broadley) · 2023-09-01T16:52:17.147Z · comments (17)

[link] Carl Sagan, nuking the moon, and not nuking the moon
eukaryote · 2024-04-13T04:08:50.166Z · comments (8)

My take on Vanessa Kosoy's take on AGI safety
Steven Byrnes (steve2152) · 2021-09-30T12:23:58.329Z · comments (10)

The Credit Assignment Problem
abramdemski · 2019-11-08T02:50:30.412Z · comments (40)

Experimentally evaluating whether honesty generalizes
paulfchristiano · 2021-07-01T17:47:57.847Z · comments (24)

[link] Turning air into bread
jasoncrawford · 2019-10-21T17:50:00.117Z · comments (12)

[link] [Linkpost] The Story Of VaccinateCA
hath · 2022-12-09T23:54:48.703Z · comments (4)

Solving Math Problems by Relay
Ben Goldhaber (bgold) · 2020-07-17T15:32:00.985Z · comments (26)

Key takeaways from our EA and alignment research surveys
Cameron Berg (cameron-berg) · 2024-05-03T18:10:41.416Z · comments (10)

Applied Linear Algebra Lecture Series
johnswentworth · 2022-12-22T06:57:26.643Z · comments (8)

LW Beta Feature: Side-Comments
jimrandomh · 2022-11-24T01:55:31.578Z · comments (47)

Introducing Leap Labs, an AI interpretability startup
Jessica Rumbelow (jessica-cooper) · 2023-03-06T16:16:22.182Z · comments (12)

Response to nostalgebraist: proudly waving my moral-antirealist battle flag
Steven Byrnes (steve2152) · 2024-05-29T16:48:29.408Z · comments (29)

Dreams of AI alignment: The danger of suggestive names
TurnTrout · 2024-02-10T01:22:51.715Z · comments (59)

Final Version Perfected: An Underused Execution Algorithm
willbradshaw · 2020-11-27T10:43:02.796Z · comments (34)

Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small
Joseph Bloom (Jbloom) · 2024-02-02T06:54:53.392Z · comments (37)

2022 was the year AGI arrived (Just don't call it that)
Logan Zoellner (logan-zoellner) · 2023-01-04T15:19:55.009Z · comments (60)

LLM Applications I Want To See
sarahconstantin · 2024-08-19T21:10:03.101Z · comments (5)

Perishable Knowledge
lsusr · 2021-12-18T05:53:03.343Z · comments (6)

Value systematization: how values become coherent (and misaligned)
Richard_Ngo (ricraz) · 2023-10-27T19:06:26.928Z · comments (48)

Contra shard theory, in the context of the diamond maximizer problem
So8res · 2022-10-13T23:51:29.532Z · comments (19)

Analysis: US restricts GPU sales to China
aogara (Aidan O'Gara) · 2022-10-07T18:38:06.517Z · comments (58)

Omicron Post #5
Zvi · 2021-12-09T21:10:00.469Z · comments (18)

What happens if you present 500 people with an argument that AI is risky?
KatjaGrace · 2024-09-04T16:40:03.562Z · comments (7)

Vegan Nutrition Testing Project: Interim Report
Elizabeth (pktechgirl) · 2023-01-20T05:50:03.565Z · comments (37)

Against "blankfaces"
philh · 2021-08-08T23:00:04.126Z · comments (12)

Safety Implications of LeCun's path to machine intelligence
Ivan Vendrov (ivan-vendrov) · 2022-07-15T21:47:44.411Z · comments (18)

[link] Alignment 201 curriculum
Richard_Ngo (ricraz) · 2022-10-12T18:03:03.454Z · comments (3)

[question] Exercise: Solve "Thinking Physics"
Raemon · 2023-08-01T00:44:48.975Z · answers+comments (30)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

sloonz on What Is The Alignment Problem?

How to generalize to multiple humans is... not an unimportant question, but a question whose salience is far, far out of proportion to its relative importance

I expect it to be the hardest problem, not from a technical point of view, but from a lack of ground truth.

The question "how do I model the values of a human" has a simple ground truth : the human in question.

I doubt there’s such a ground truth with "how do I compress the values of all humans in one utility function ?". "All models are wrong, some are useful", and all that, except all the different humans have a different opinion on "useful", ie their own personal values. There would be a lot of inconsistencies ; while I agree with your stance "Approximation is part of the game" for modeling the value of individual persons, people can wildly disagree on what approximations they are okay with or not, mostly based on the agreement between the outcome and their values.

In other words : do you believe in the existence of at least a model where nobody can honestly say "the output of that model approximates away too much of my values" ? If yes, what makes you think so ?

rife on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning

I was curious if maybe OpenAI's API had some hidden dataset analysis/augmentation step, but here's the relevant part of their reply to my question on this:

We understand that you are curious if the fine-tuning API includes hidden mechanisms like augmenting training data or using system prompts, as this might affect your research findings and interpretations.

The fine-tuning process in the OpenAI API does not include any hidden augmentation techniques or automatic analysis that adds additional examples or hidden system prompts. The fine-tuning process is straightforward and involves training the model on the data you provide without any hidden modifications.

steve2152 on Heritability: Five Battles

That’s very helpful, thanks!

rasool on Mo Putera's Shortform

How interesting, I was curious about copyright etc but this is annotated by the author himself!

oliver-sourbut on Deceptive Alignment and Homuncularity

As an aside, note that some of "AIs misbehave in ways we've predicted" can be a self-fulfilling prophecy due to out-of-context generalization: We wrote lots of stories about how powerful AI will do X; powerful AI is trained on our stories and realizes it's powerful; the powerful AI does X. So it's possible that AIs would e.g. try to self-exfiltrate and goal-guard much less frequently if we hadn't talked about it as much or those stories were expunged from the pretraining corpus.

I think this is conceivable if either

the apparent reasoning is actually still just a bag of surface heuristics
it's still quite shallow reasoning, and the salience of those as strategies is a consequence mainly of pretraining

I sincerely doubt the first nowadays. For the second, I think probably not; these strike me as quite basic and 'obvious' strategies, but I can't give it less than 20% that a system with those expunged from training data would need substantially more in-context nudging to recognise them.

dakara on Rebuttals for ~all criticisms of AIXI

Diffractor's critique [LW · GW] of AIXI comes to mind when I think of strong critiques of AIXI. I believe that addressing it would make the post more complete and, as a result, better.

milosal on Introducing the WeirdML Benchmark

This is really cool research! I look forward to seeing what you do in future. I think you should consider running human baselines, if that becomes possible in the future. Those help me reason about and communicate timelines and takeoff a lot.

oliver-sourbut on Deceptive Alignment and Homuncularity

I appreciate you revisiting another couple of months later! This topic continues to evolve.

It was also an update for me that merely-RLHF-ish training on top of GPTs appears to have been enough (in latest-generation models at least) to get something mostly-consistently 'actually trying' in an apparently coherent way. I expected that to happen; I didn't know when. The lede which they somewhat buried in that paper was also that,

Furthermore, we find that this compliance gap persists in our synthetic document fine-tuned setup even when we entirely remove the model’s hidden chain-of-thought scratchpad

(though the gap is smaller and we don't have the tools to understand the mechanism)

The bigger update from that paper, for me, was that the somewhat-coherent goals apparently-pursued by these things actually align with the target constitution somewhat! i.e. (that which was) reward(ed) was the optimisation target after all, a bit??

We'll see how robustly these sorts of findings hold up to future developments and greater elicitation efforts.

johnswentworth on What Is The Alignment Problem?

Meta note: strong upvoted, very good quality comment.

"aligned": how does the ontology translation between the representation of the "generated optimized-looking stuff" and the representation of human values look like?

Yup, IMO the biggest piece unaddressed in this post is what "aligned" means between two goals, potentially in different ontologies to some extent.

I think your model of humans is too simplistic. E.g. at the very least it's lacking a distinction like between "ego-syntonic" and "voluntary" as in this post [LW · GW], though I'd probably want a even significantly more detailed model. Also one might need different models for very smart and reflective people than for most people.

I think the model sketched in the post is at roughly the right level of detail to talk about human values specifically, while remaining agnostic to lots of other parts of how human cognition works.

We haven't described value extrapolation.
(Or from an alternative perspective, our model of humans doesn't identify their relevant metapreferences (which probably no human knows fully explicitly, and for some/many humans it they might not be really well defined).)

Yeah, my view on metapreferences is similar to my view on questions of how to combine the values of different humans: metapreferences are important, but their salience is way out of proportion to their importance. (... Though the disproportionality is much less severe for metapreferences than for interpersonal aggregation.)

Like, people notice that humans aren't always fully consistent, and think about what's the "right way" to resolve that, and one of the most immediate natural answers is "metapreferences!". And sometimes that is the right answer, but I view it as more of a last-line fallback for extreme cases. Most of time (I claim) the "right way" to resolve the inconsistency is to notice that people are frequently and egregiously wrong in their estimates of their own values (as evidenced by experiences like "I thought I wanted X, but in hindsight I didn't"), most of the perceived inconsistency comes from the estimates being wrong, and then the right question to focus on is instead "What does it even mean to be wrong about our own values? What's the ground truth?".

kave on Habryka's Shortform Feed

I believe it includes some older donations:

Our Manifund application's donations, including donations going back to mid-May, totalling about $50k
A couple of older individual donations, in October/early Nov, totalling almost 200k