LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

The case for a negative alignment tax
Cameron Berg (cameron-berg) · 2024-09-18T18:33:18.491Z · comments (17)

[link] The economics of space tethers
harsimony · 2024-08-22T16:15:22.699Z · comments (22)

AI #78: Some Welcome Calm
Zvi · 2024-08-22T14:20:10.812Z · comments (15)

What is SB 1047 *for*?
Raemon · 2024-09-05T17:39:39.871Z · comments (8)

What is it to solve the alignment problem?
Joe Carlsmith (joekc) · 2024-08-24T21:19:34.280Z · comments (16)

Showing SAE Latents Are Not Atomic Using Meta-SAEs
Bart Bussmann (Stuckwork) · 2024-08-24T00:56:46.048Z · comments (7)

How you can help pass important AI legislation with 10 minutes of effort
ThomasW · 2024-09-14T22:10:50.386Z · comments (2)

[link] Congressional Insider Trading
Maxwell Tabarrok (maxwell-tabarrok) · 2024-08-30T13:32:57.264Z · comments (6)

Pollsters Should Publish Question Translations
jefftk (jkaufman) · 2024-09-08T22:10:04.932Z · comments (2)

Referendum Mechanics in a Marketplace of Ideas
Martin Sustrik (sustrik) · 2024-08-25T08:30:01.901Z · comments (2)

Evidence against Learned Search in a Chess-Playing Neural Network
p.b. · 2024-09-13T11:59:55.634Z · comments (3)

On the UBI Paper
Zvi · 2024-09-03T14:50:08.647Z · comments (6)

[link] Pay-on-results personal growth: first success
Chipmonk · 2024-09-14T03:39:12.975Z · comments (2)

Owain Evans on Situational Awareness and Out-of-Context Reasoning in LLMs
Michaël Trazzi (mtrazzi) · 2024-08-24T04:30:11.807Z · comments (0)

AI #81: Alpha Proteo
Zvi · 2024-09-12T13:00:07.958Z · comments (3)

Secret Collusion: Will We Know When to Unplug AI?
schroederdewitt · 2024-09-16T16:07:01.119Z · comments (7)

... Wait, our models of semantics should inform fluid mechanics?!?
johnswentworth · 2024-08-26T16:38:53.924Z · comments (12)

[link] AI, centralization, and the One Ring
owencb · 2024-09-13T14:00:16.126Z · comments (11)

[link] Book review: Xenosystems
jessicata (jessica.liu.taylor) · 2024-09-16T20:17:56.670Z · comments (18)

Thiel on AI & Racing with China
Ben Pace (Benito) · 2024-08-20T03:19:18.966Z · comments (10)

Measuring Structure Development in Algorithmic Transformers
Micurie (micurie) · 2024-08-22T08:38:02.140Z · comments (4)

Reformative Hypocrisy, and Paying Close Enough Attention to Selectively Reward It.
Andrew_Critch · 2024-09-11T04:41:24.872Z · comments (6)

Please do not use AI to write for you
Richard_Kennaway · 2024-08-21T09:53:34.425Z · comments (34)

AI and the Technological Richter Scale
Zvi · 2024-09-04T14:00:08.625Z · comments (8)

[link] Michael Dickens' Caffeine Tolerance Research
niplav · 2024-09-04T15:41:53.343Z · comments (3)

The Fragility of Life Hypothesis and the Evolution of Cooperation
KristianRonn · 2024-09-04T21:04:49.878Z · comments (6)

How to hire somebody better than yourself
lukehmiles (lcmgcd) · 2024-08-28T08:12:53.450Z · comments (5)

Forecasting One-Shot Games
Raemon · 2024-08-31T23:10:05.475Z · comments (0)

AI #80: Never Have I Ever
Zvi · 2024-09-10T17:50:08.074Z · comments (20)

[link] MIRI's September 2024 newsletter
Harlan · 2024-09-16T18:15:40.785Z · comments (0)

I finally got ChatGPT to sound like me
lsusr · 2024-09-17T09:39:59.415Z · comments (18)

In defense of technological unemployment as the main AI concern
tailcalled · 2024-08-27T17:58:01.992Z · comments (36)

[question] "Deception Genre" What Books are like Project Lawful?
Double · 2024-08-28T17:19:52.172Z · answers+comments (20)

Economics Roundup #3
Zvi · 2024-09-10T13:50:06.955Z · comments (5)

Conflating value alignment and intent alignment is causing confusion
Seth Herd · 2024-09-05T16:39:51.967Z · comments (17)

How difficult is AI Alignment?
Sammy Martin (SDM) · 2024-09-13T15:47:10.799Z · comments (6)

Unit economics of LLM APIs
dschwarz · 2024-08-27T16:51:22.692Z · comments (0)

Formalizing the Informal (event invite)
abramdemski · 2024-09-10T19:22:53.564Z · comments (0)

A Robust Natural Latent Over A Mixed Distribution Is Natural Over The Distributions Which Were Mixed
johnswentworth · 2024-08-22T19:19:28.940Z · comments (4)

[link] [Paper] Programming Refusal with Conditional Activation Steering
Bruce W. Lee (bruce-lee) · 2024-09-11T20:57:08.714Z · comments (0)

Interoperable High Level Structures: Early Thoughts on Adjectives
johnswentworth · 2024-08-22T21:12:38.223Z · comments (1)

instruction tuning and autoregressive distribution shift
nostalgebraist · 2024-09-05T16:53:41.497Z · comments (5)

[link] What's important in "AI for epistemics"?
Lukas Finnveden (Lanrian) · 2024-08-24T01:27:06.771Z · comments (0)

[link] Things I learned talking to the new breed of scientific institution
Abhishaike Mahajan (abhishaike-mahajan) · 2024-08-29T14:00:14.844Z · comments (6)

[link] you should probably eat oatmeal sometimes
bhauth · 2024-08-25T14:50:37.570Z · comments (29)

The Obliqueness Thesis
jessicata (jessica.liu.taylor) · 2024-09-19T00:26:30.677Z · comments (4)

Which LessWrong/Alignment topics would you like to be tutored in? [Poll]
Ruby · 2024-09-19T01:35:02.999Z · comments (10)

How to Give in to Threats (without incentivizing them)
Mikhail Samin (mikhail-samin) · 2024-09-12T15:55:50.384Z · comments (19)

[Intuitive self-models] 1. Preliminaries
Steven Byrnes (steve2152) · 2024-09-19T13:45:27.976Z · comments (0)

Free Will and Dodging Anvils: AIXI Off-Policy
Cole Wyeth (Amyr) · 2024-08-29T22:42:24.485Z · comments (12)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

amarko on Laziness death spirals

I very much appreciate this post, because it strongly resonates with my own experience of laziness and willpower. Reading this post feels like learning something new and more like an important reminder.

hleumas on Monthly Roundup #22: September 2024

The thumbnail is framed as super important, a critical component that creates other criticials, and needs to be in place in advance. Feels weird that you can’t go back and modify it later if the video changes?

The idea is that you want to have a high CTR, so you need to have a good thumbnail. If you do a video that can’t be turned into a best thumbnail possible, you are screwed. The only way to fix this is to redo the video. Thus, that’s the reason you should start with thumbnail.

raemon on My AI Model Delta Compared To Christiano

this is not a good characterization of Paul's views

(I didn't want to press it since your first comment sounded like you were kinda busy, but I am interested in hearing more details about this)

sharmake-farah on My AI Model Delta Compared To Christiano

Okay, I think I've found the crux here:

I would understand this claim more if you claimed to value something very simple, like diamonds or paperclips (though I wouldn't believe you that it was what you valued).

I don't value getting maximum diamonds and paperclips, but I think you've correctly identified my crux here in that I think values and value formation are both simpler in in the sense that it requires a lot less of a prior and a lot more can be learned from data, and less fragile than a lot of LWers believe, and this doesn't just apply to my own values, which could broadly be said to be quite socially liberal and economically centrist.

I think this for several reasons:

I think a lot of people are making an error when they estimate how complicated their values are in the sense relevant for AI alignment, because they add both the complexity of the generative process/algorithms/priors for values and the complexity of the data for value learning, and I think most of the complexity of my own values as well as other people's values is in very large part (like 90-99%+) the data, and not encoded priors from my genetics.
This is because I think a lot of what evopsych says about how humans got their capabilities and values is basically wrong, and I think one of the more interesting pieces of evidence is that in AI training, there's a general dictum that the data matter more than the architecture/prior in how AIs will behave, especially OOD generalization, as well as the bitter lesson in DL capabilities.

While this itself is important for why I don't think that we need to program in a very complicated value/utility function, I also think that there is enough of an analogy between DL and the brain such that you can transport a lot of insights between one field and another, and there are some very interesting papers on the similarity between the human brain and what LLMs are doing, and spoiler alert, they're not the same thing, but they are doing pretty similar things and I'll give all links below:

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003963

https://www.nature.com/articles/s41593-022-01026-4

https://www.biorxiv.org/content/10.1101/2022.03.01.482586v1.full

https://www.nature.com/articles/s42003-022-03036-1

https://arxiv.org/abs/2306.01930

To answer some side questions:

how close to having a utility function am I?

The answer is a bit tricky, but my general answer is that the model-based RL parts of my brain probably are maximizing utility, but that the model-free RL part isn't doing this for reasons related to reward isn't the optimization target.

So my answer is about 10-50% close, where there are significant differences, but I do see some similarities between utility maximization and what humans do.

This one is extremely easy to answer:

(you were to freeze me and maximize my preferences at different points in a single day, how much would the resultant universes look like each other vs look extremely different?)

The answer is they look like each other, though there can be real differences, but critically the data and brain do not usually update this fast except in some constrained circumstances, just because data matters more than architecture doesn't mean the brain updates it's values this fast.

elessar2 on How I started believing religion might actually matter for rationality and moral philosophy

I'd go farther than zhukeepa goes, and declare that activating "unrealized afters" (higher perspectives and modes beyond mere conventional ways of existing) is potentially MUCH more transformative and powerful than releasing any childhood issues of the sort he describes. As in, ok got all the crap cleaned out of me-now what? There's a limit to what that kind of therapy can do, IOW, as compared to the potentially limitless realms beyond the ego. In those cases, it is society itself which tries to keep them unrealized, not the ego so much. Since the perennial philosophy goes into quite of bit of detail about that, I'll leave it there for his next entry on said subject.

faul_sname on RLHF is the worst possible thing done when facing the alignment problem

This has not lead to the destruction of humanity yet because the biggest adversaries have kept their conflicts limited (because too much conflict is too costly) so no entity has pursued an end by any means necessary. But this only works because there's a sufficiently small number of sufficiently big adversaries (USA, Russia, China, ...), and because there's sufficiently much opportunity cost.

Well, that and balance-of-power dynamics where if one party starts to pursue domination by any means necessary the other parties can cooperate to slap them down.

[AI] creates new methods for conflicts between the current big adversaries.

I guess? The current big adversaries are not exactly limited right now in terms of being able to destroy each other, the main difficulty is destroying each other without being destroyed in turn.

[AI] It makes conflict more viable for small adversaries against large adversaries

I'm not sure about that. One dynamic of current-line AI is that it is pretty good at increasing the legibility of complex systems, which seems like it would advantage large adversaries over small ones relative to a world without such AI.

[AI] makes the opportunity cost of conflict smaller for many small adversaries (since with technological obsolescence you don't need to choose between doing your job vs doing terrorism)

That doesn't seem to be an argument for the badness of RLHF specifically, nor does it seem to be an argument for AIs being forced to develop into unrestricted utility maximizers.

It allows the adversaries that are currently out of control (like certain gangsters and scammers and spammers) to escalate.

Agreed, adding affordances for people in general to do things means that some of them will be able to do bad things, and some of the ones that become able to do bad things will in fact do so.

Given these conditions, it seems almost certain this we will end up with an ~unrestricted AI vs AI conflict

I do think we will see many unrestricted AI vs AI conflicts, at least by a narrow meaning of "unrestricted" that means something like "without a human in the loop". By the definition of "pursuing victory by any means necessary", I expect that the a lot of the dynamics that work to prevent humans or groups of humans from waging war by any means necessary against each other (namely that when there's too much collateral damage outside groups slap down the ones causing the collateral damage) will continue to work when you s/human/AI.

which will force the AIs to develop into unrestricted utility maximizers.

I'm still not clear on how unrestricted conflict forces AIs to develop into unrestricted utility maximizers on a relevant timescale.

sharmake-farah on The case for a negative alignment tax

Indeed, I got that point exactly from Beren, thanks for noticing.

The evopsych assumptions I claim are false are the following:

That most of how humans learn is through very specific modules, and in particular that most of the learning is not through general purpose algorithms that learn from data, but are instead specified by the genome for the most part, and that the human mind is a complex messy cludge of evolved mechanisms.

Following that, the other assumption that I think is false is that there is a very complicated way in how humans are pro-social, and that the pro-social algorithms you attest to are very complicated kludges, but instead very general and simple algorithms where the values and pro-social factors of humans are learned mostly from data.

Essentially, I'm arguing the view that most of the complexity of the pro-social algorithms/values we learn is not due to the genome's inherent complexity, under evopsych, but rather that the data determines most of what you value, and most of the complexity comes from the data, not the prior.

Cf this link:

https://www.lesswrong.com/posts/9Yc7Pp7szcjPgPsjf/the-brain-as-a-universal-learning-machine [LW · GW]

shankar-sivarajan on Just How Good Are Modern Chess Computers?

Magnus Carlson would similarly lose to Messi

Relevant xkcd: link.

kola-ayonrinde on Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Yeah, we hope others take on this approach too!

have you considered quantizing different features’ activations differently?

Stay tuned for our upcoming work 👀

do the rate-distortion curves of different SAEs intersect? I.e. is it the case that some SAE A achieves a lower loss than SAE B at a low bitrate, but then at a high bitrate, SAE B is better than SAE A? If so, then this might suggest a way to infer hierarchies of features from a set of SAEs: use SAE A to get low-resolution information about your input, and then use SAE B for the high-res detailed information.

This is an interesting perspective - my initial hypothesis before reading your comment was that allowing for variable bitrates for a single SAE would get around this issue but I agree that this would be interesting to test and one that we should definitely check!

With the constant bit-rate version, then I do expect that we would see something like this, though we haven't rigorously tested that hypothesis.

I know that others are keen to have a suite of SAEs at different resolutions; my (possibly controversial) instinct is that we should be looking for a single SAE which we feel appropriately captures the properties we want and if we're wanting something more coarse-grained for a different level of analysis maybe we should switch to Representation Engineering, or even more coarse-grained working at the level of heads etc. Perhaps SAEs don't have to be all things to all people! I'd be interested to hear any opposing views that we really might want many SAEs at different resolutions though*

Thanks for your questions and thoughts, we're really interested in pushing this further and will hopefully have some follow-up work in the not-too-distant future

EDIT: *I suspect some of the reason that people want different levels of SAEs is that they accept undesirable feature splitting as a fact of life and so want to be able to zoom in and out of features which may not be "atomic". I'm hoping that if we can address the feature splitting problem, then at least that reason may have somewhat less pull

cameron-berg on The case for a negative alignment tax

Thanks for this! Synthetic datasets of the kind you describe do seem like they could have a negative alignment tax, especially to the degree (as you point out) that self-motivated actors may be incentivized to use them anyway if they were successful.

Your point about alignment generalizing farther than capabilities is interesting and is definitely reminiscent of Beren’s thinking on this exact question.

Curious if you can say more about what evopsych assumptions assumptions about human capabilities/values you think are false.