LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] Executable philosophy as a failed totalizing meta-worldview
jessicata (jessica.liu.taylor) · 2024-09-04T22:50:18.294Z · comments (40)

Solving adversarial attacks in computer vision as a baby version of general AI alignment
Stanislav Fort (stanislavfort) · 2024-08-29T17:17:47.136Z · comments (8)

Reflections on Less Online
Error · 2024-07-07T03:49:44.534Z · comments (15)

Scalable oversight as a quantitative rather than qualitative problem
Buck · 2024-07-06T17:42:41.325Z · comments (11)

Actually, Power Plants May Be an AI Training Bottleneck.
Lao Mein (derpherpize) · 2024-06-20T04:41:33.567Z · comments (13)

A simple case for extreme inner misalignment
Richard_Ngo (ricraz) · 2024-07-13T15:40:37.518Z · comments (41)

[link] What are you getting paid in?
Austin Chen (austin-chen) · 2024-07-17T19:23:04.219Z · comments (13)

Live Theory Part 0: Taking Intelligence Seriously
Sahil · 2024-06-26T21:37:10.479Z · comments (3)

Release: Optimal Weave (P1): A Prototype Cohabitive Game
mako yass (MakoYass) · 2024-08-17T14:08:18.947Z · comments (19)

[link] What Depression Is Like
Sable · 2024-08-27T17:43:22.549Z · comments (21)

GPT-o1
Zvi · 2024-09-16T13:40:06.236Z · comments (27)

Why you should be using a retinoid
GeneSmith · 2024-08-19T03:07:41.722Z · comments (53)

Decomposing the QK circuit with Bilinear Sparse Dictionary Learning
keith_wynroe · 2024-07-02T13:17:16.352Z · comments (7)

OpenAI o1, Llama 4, and AlphaZero of LLMs
Vladimir_Nesov · 2024-09-14T21:27:41.241Z · comments (23)

3C's: A Recipe For Mathing Concepts
johnswentworth · 2024-07-03T01:06:11.944Z · comments (5)

Fluent, Cruxy Predictions
Raemon · 2024-07-10T18:00:06.424Z · comments (10)

Quick look: applications of chaos theory
Elizabeth (pktechgirl) · 2024-08-18T15:00:07.853Z · comments (45)

Corrigibility = Tool-ness?
johnswentworth · 2024-06-28T01:19:48.883Z · comments (8)

Secondary forces of debt
KatjaGrace · 2024-06-27T21:10:06.131Z · comments (18)

Value fragility and AI takeover
Joe Carlsmith (joekc) · 2024-08-05T21:28:07.306Z · comments (5)

[link] Is "superhuman" AI forecasting BS? Some experiments on the "539" bot from the Centre for AI Safety
titotal (lombertini) · 2024-09-18T13:07:40.754Z · comments (1)

[link] Claude 3.5 Sonnet
Zach Stein-Perlman · 2024-06-20T18:00:35.443Z · comments (41)

On the CrowdStrike Incident
Zvi · 2024-07-22T12:40:05.894Z · comments (14)

A Simple Toy Coherence Theorem
johnswentworth · 2024-08-02T17:47:50.642Z · comments (15)

Darwinian Traps and Existential Risks
KristianRonn · 2024-08-25T22:37:14.142Z · comments (14)

[link] Not every accommodation is a Curb Cut Effect: The Handicapped Parking Effect, the Clapper Effect, and more
Michael Cohn (michael-cohn) · 2024-09-15T05:27:36.691Z · comments (37)

Mistakes people make when thinking about units
Isaac King (KingSupernova) · 2024-06-25T03:39:20.138Z · comments (14)

[link] Soft Nationalization: how the USG will control AI labs
Deric Cheng (deric-cheng) · 2024-08-27T15:11:14.601Z · comments (7)

In Defense of Open-Minded UDT
abramdemski · 2024-08-12T18:27:36.220Z · comments (27)

[link] Excerpts from "A Reader's Manifesto"
Arjun Panickssery (arjun-panickssery) · 2024-09-06T22:37:40.254Z · comments (1)

[link] LK-99 in retrospect
bhauth · 2024-07-07T02:06:27.660Z · comments (21)

Interpreting Preference Models w/ Sparse Autoencoders
Logan Riggs (elriggs) · 2024-07-01T21:35:40.603Z · comments (12)

Guide to SB 1047
Zvi · 2024-08-20T13:10:07.408Z · comments (18)

AI for Bio: State Of The Field
sarahconstantin · 2024-08-30T18:00:02.187Z · comments (2)

FarmKind's Illusory Offer
jefftk (jkaufman) · 2024-08-09T11:30:07.082Z · comments (5)

D&D.Sci Scenario Index
aphyer · 2024-07-23T02:00:43.483Z · comments (0)

Adam Optimizer Causes Privileged Basis in Transformer LM Residual Stream
Diego Caples (diego-caples) · 2024-09-06T17:55:34.265Z · comments (7)

[link] Yoshua Bengio: Reasoning through arguments against taking AI safety seriously
Judd Rosenblatt (judd) · 2024-07-11T23:53:17.187Z · comments (3)

Secular interpretations of core perennialist claims
zhukeepa · 2024-08-25T23:41:02.683Z · comments (30)

If we solve alignment, do we die anyway?
Seth Herd · 2024-08-23T13:13:10.933Z · comments (65)

Multiplex Gene Editing: Where Are We Now?
sarahconstantin · 2024-07-16T20:50:04.590Z · comments (6)

Estimating Tail Risk in Neural Networks
Mark Xu (mark-xu) · 2024-09-13T20:00:06.921Z · comments (4)

[link] GPT-4o System Card
Zach Stein-Perlman · 2024-08-08T20:30:52.633Z · comments (11)

Different senses in which two AIs can be “the same”
Vivek Hebbar (Vivek) · 2024-06-24T03:16:43.400Z · comments (0)

AI #79: Ready for Some Football
Zvi · 2024-08-29T13:30:10.902Z · comments (16)

The Hessian rank bounds the learning coefficient
Lucius Bushnaq (Lblack) · 2024-08-08T20:55:36.960Z · comments (9)

Timaeus is hiring!
Jesse Hoogland (jhoogland) · 2024-07-12T23:42:28.651Z · comments (4)

Brief notes on the Wikipedia game
Olli Järviniemi (jarviniemi) · 2024-07-14T02:28:22.473Z · comments (9)

Why Large Bureaucratic Organizations?
johnswentworth · 2024-08-27T18:30:07.422Z · comments (51)

[link] Open Source Automated Interpretability for Sparse Autoencoder Features
kh4dien · 2024-07-30T21:11:36.866Z · comments (1)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

m-ls on How I started believing religion might actually matter for rationality and moral philosophy

The first realisation here moving forward, is that religion is a subset of something else… —and not a thing-in-itself that needs to be explained /selected for. This something else is the inchoate urge "to should", "to world the self with a self in the world among others". I realised this ten years ago, https://www.academia.edu/40978261/Why_we_should_an_introduction_by_memoir_into_the_implications_of_the_Egalitarian_Revolution_of_the_Paleolithic_or_Anyone_for_cake

and write on it at my substack https://whyweshould.substack.com/

any commonalties are the result of worlding in the world, in a framework of big history, in which the thickets of metaphysics are dense, grand and commodious, ready to support any world we should feel it good to espouse.

Convergence is a thing.

Evolution don't care about the outcomes (art/religion/polity/morality) merely that we should, and thus make mistakes and learn.

amarko on Laziness death spirals

I very much appreciate this post, because it strongly resonates with my own experience of laziness and willpower. Reading this post feels like learning something new and more like an important reminder.

hleumas on Monthly Roundup #22: September 2024

The thumbnail is framed as super important, a critical component that creates other criticials, and needs to be in place in advance. Feels weird that you can’t go back and modify it later if the video changes?

The idea is that you want to have a high CTR, so you need to have a good thumbnail. If you do a video that can’t be turned into a best thumbnail possible, you are screwed. The only way to fix this is to redo the video. Thus, that’s the reason you should start with thumbnail.

raemon on My AI Model Delta Compared To Christiano

this is not a good characterization of Paul's views

(I didn't want to press it since your first comment sounded like you were kinda busy, but I am interested in hearing more details about this)

sharmake-farah on My AI Model Delta Compared To Christiano

Okay, I think I've found the crux here:

I would understand this claim more if you claimed to value something very simple, like diamonds or paperclips (though I wouldn't believe you that it was what you valued).

I don't value getting maximum diamonds and paperclips, but I think you've correctly identified my crux here in that I think values and value formation are both simpler in in the sense that it requires a lot less of a prior and a lot more can be learned from data, and less fragile than a lot of LWers believe, and this doesn't just apply to my own values, which could broadly be said to be quite socially liberal and economically centrist.

I think this for several reasons:

I think a lot of people are making an error when they estimate how complicated their values are in the sense relevant for AI alignment, because they add both the complexity of the generative process/algorithms/priors for values and the complexity of the data for value learning, and I think most of the complexity of my own values as well as other people's values is in very large part (like 90-99%+) the data, and not encoded priors from my genetics.
This is because I think a lot of what evopsych says about how humans got their capabilities and values is basically wrong, and I think one of the more interesting pieces of evidence is that in AI training, there's a general dictum that the data matter more than the architecture/prior in how AIs will behave, especially OOD generalization, as well as the bitter lesson in DL capabilities.

While this itself is important for why I don't think that we need to program in a very complicated value/utility function, I also think that there is enough of an analogy between DL and the brain such that you can transport a lot of insights between one field and another, and there are some very interesting papers on the similarity between the human brain and what LLMs are doing, and spoiler alert, they're not the same thing, but they are doing pretty similar things and I'll give all links below:

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003963

https://www.nature.com/articles/s41593-022-01026-4

https://www.biorxiv.org/content/10.1101/2022.03.01.482586v1.full

https://www.nature.com/articles/s42003-022-03036-1

https://arxiv.org/abs/2306.01930

To answer some side questions:

how close to having a utility function am I?

The answer is a bit tricky, but my general answer is that the model-based RL parts of my brain probably are maximizing utility, but that the model-free RL part isn't doing this for reasons related to reward isn't the optimization target.

So my answer is about 10-50% close, where there are significant differences, but I do see some similarities between utility maximization and what humans do.

This one is extremely easy to answer:

(you were to freeze me and maximize my preferences at different points in a single day, how much would the resultant universes look like each other vs look extremely different?)

The answer is they look like each other, though there can be real differences, but critically the data and brain do not usually update this fast except in some constrained circumstances, just because data matters more than architecture doesn't mean the brain updates it's values this fast.

elessar2 on How I started believing religion might actually matter for rationality and moral philosophy

I'd go farther than zhukeepa goes, and declare that activating "unrealized afters" (higher perspectives and modes beyond mere conventional ways of existing) is potentially MUCH more transformative and powerful than releasing any childhood issues of the sort he describes. As in, ok got all the crap cleaned out of me-now what? There's a limit to what that kind of therapy can do, IOW, as compared to the potentially limitless realms beyond the ego. In those cases, it is society itself which tries to keep them unrealized, not the ego so much. Since the perennial philosophy goes into quite of bit of detail about that, I'll leave it there for his next entry on said subject.

faul_sname on RLHF is the worst possible thing done when facing the alignment problem

This has not lead to the destruction of humanity yet because the biggest adversaries have kept their conflicts limited (because too much conflict is too costly) so no entity has pursued an end by any means necessary. But this only works because there's a sufficiently small number of sufficiently big adversaries (USA, Russia, China, ...), and because there's sufficiently much opportunity cost.

Well, that and balance-of-power dynamics where if one party starts to pursue domination by any means necessary the other parties can cooperate to slap them down.

[AI] creates new methods for conflicts between the current big adversaries.

I guess? The current big adversaries are not exactly limited right now in terms of being able to destroy each other, the main difficulty is destroying each other without being destroyed in turn.

[AI] It makes conflict more viable for small adversaries against large adversaries

I'm not sure about that. One dynamic of current-line AI is that it is pretty good at increasing the legibility of complex systems, which seems like it would advantage large adversaries over small ones relative to a world without such AI.

[AI] makes the opportunity cost of conflict smaller for many small adversaries (since with technological obsolescence you don't need to choose between doing your job vs doing terrorism)

That doesn't seem to be an argument for the badness of RLHF specifically, nor does it seem to be an argument for AIs being forced to develop into unrestricted utility maximizers.

It allows the adversaries that are currently out of control (like certain gangsters and scammers and spammers) to escalate.

Agreed, adding affordances for people in general to do things means that some of them will be able to do bad things, and some of the ones that become able to do bad things will in fact do so.

Given these conditions, it seems almost certain this we will end up with an ~unrestricted AI vs AI conflict

I do think we will see many unrestricted AI vs AI conflicts, at least by a narrow meaning of "unrestricted" that means something like "without a human in the loop". By the definition of "pursuing victory by any means necessary", I expect that the a lot of the dynamics that work to prevent humans or groups of humans from waging war by any means necessary against each other (namely that when there's too much collateral damage outside groups slap down the ones causing the collateral damage) will continue to work when you s/human/AI.

which will force the AIs to develop into unrestricted utility maximizers.

I'm still not clear on how unrestricted conflict forces AIs to develop into unrestricted utility maximizers on a relevant timescale.

sharmake-farah on The case for a negative alignment tax

Indeed, I got that point exactly from Beren, thanks for noticing.

The evopsych assumptions I claim are false are the following:

That most of how humans learn is through very specific modules, and in particular that most of the learning is not through general purpose algorithms that learn from data, but are instead specified by the genome for the most part, and that the human mind is a complex messy cludge of evolved mechanisms.

Following that, the other assumption that I think is false is that there is a very complicated way in how humans are pro-social, and that the pro-social algorithms you attest to are very complicated kludges, but instead very general and simple algorithms where the values and pro-social factors of humans are learned mostly from data.

Essentially, I'm arguing the view that most of the complexity of the pro-social algorithms/values we learn is not due to the genome's inherent complexity, under evopsych, but rather that the data determines most of what you value, and most of the complexity comes from the data, not the prior.

Cf this link:

https://www.lesswrong.com/posts/9Yc7Pp7szcjPgPsjf/the-brain-as-a-universal-learning-machine [LW · GW]

shankar-sivarajan on Just How Good Are Modern Chess Computers?

Magnus Carlson would similarly lose to Messi

Relevant xkcd: link.

kola-ayonrinde on Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Yeah, we hope others take on this approach too!

have you considered quantizing different features’ activations differently?

Stay tuned for our upcoming work 👀

do the rate-distortion curves of different SAEs intersect? I.e. is it the case that some SAE A achieves a lower loss than SAE B at a low bitrate, but then at a high bitrate, SAE B is better than SAE A? If so, then this might suggest a way to infer hierarchies of features from a set of SAEs: use SAE A to get low-resolution information about your input, and then use SAE B for the high-res detailed information.

This is an interesting perspective - my initial hypothesis before reading your comment was that allowing for variable bitrates for a single SAE would get around this issue but I agree that this would be interesting to test and one that we should definitely check!

With the constant bit-rate version, then I do expect that we would see something like this, though we haven't rigorously tested that hypothesis.

I know that others are keen to have a suite of SAEs at different resolutions; my (possibly controversial) instinct is that we should be looking for a single SAE which we feel appropriately captures the properties we want and if we're wanting something more coarse-grained for a different level of analysis maybe we should switch to Representation Engineering, or even more coarse-grained working at the level of heads etc. Perhaps SAEs don't have to be all things to all people! I'd be interested to hear any opposing views that we really might want many SAEs at different resolutions though*

Thanks for your questions and thoughts, we're really interested in pushing this further and will hopefully have some follow-up work in the not-too-distant future

EDIT: *I suspect some of the reason that people want different levels of SAEs is that they accept undesirable feature splitting as a fact of life and so want to be able to zoom in and out of features which may not be "atomic". I'm hoping that if we can address the feature splitting problem, then at least that reason may have somewhat less pull