LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Quotes from Leopold Aschenbrenner’s Situational Awareness Paper
Zvi · 2024-06-07T11:40:03.981Z · comments (10)

Meaning & Agency
abramdemski · 2023-12-19T22:27:32.123Z · comments (17)

Zvi’s Thoughts on His 2nd Round of SFF
Zvi · 2024-11-20T13:40:08.092Z · comments (2)

Catastrophic sabotage as a major threat model for human-level AI systems
evhub · 2024-10-22T20:57:11.395Z · comments (8)

Just admit that you’ve zoned out
joec · 2024-06-04T02:51:27.594Z · comments (22)

Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders
Johnny Lin (hijohnnylin) · 2024-03-25T21:17:58.421Z · comments (7)

Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers
hugofry · 2024-04-29T20:57:35.127Z · comments (8)

LLMs Look Increasingly Like General Reasoners
eggsyntax · 2024-11-08T23:47:28.886Z · comments (45)

[link] Introducing METR's Autonomy Evaluation Resources
Megan Kinniment (megan-kinniment) · 2024-03-15T23:16:59.696Z · comments (0)

Prediction Markets aren't Magic
SimonM · 2023-12-21T12:54:07.754Z · comments (29)

Review: Conor Moreton's "Civilization & Cooperation"
Duncan Sabien (Deactivated) (Duncan_Sabien) · 2024-05-26T19:32:43.131Z · comments (8)

story-based decision-making
bhauth · 2024-02-07T02:35:27.286Z · comments (11)

AI #73: Openly Evil AI
Zvi · 2024-07-18T14:40:05.770Z · comments (20)

Partial value takeover without world takeover
KatjaGrace · 2024-04-05T06:20:03.961Z · comments (23)

A very strange probability paradox
notfnofn · 2024-11-22T14:01:36.587Z · comments (25)

[link] New report: Safety Cases for AI
joshc (joshua-clymer) · 2024-03-20T16:45:27.984Z · comments (14)

Based Beff Jezos and the Accelerationists
Zvi · 2023-12-06T16:00:08.380Z · comments (29)

Stagewise Development in Neural Networks
Jesse Hoogland (jhoogland) · 2024-03-20T19:54:06.181Z · comments (1)

Three Notions of "Power"
johnswentworth · 2024-10-30T06:10:08.326Z · comments (43)

Anvil Problems
Screwtape · 2024-11-13T22:57:41.974Z · comments (12)

[Intuitive self-models] 1. Preliminaries
Steven Byrnes (steve2152) · 2024-09-19T13:45:27.976Z · comments (20)

Covert Malicious Finetuning
Tony Wang (tw) · 2024-07-02T02:41:51.698Z · comments (4)

Singular learning theory: exercises
Zach Furman (zfurman) · 2024-08-30T20:00:03.785Z · comments (5)

[link] Debating with More Persuasive LLMs Leads to More Truthful Answers
Akbir Khan (akbir-khan) · 2024-02-07T21:28:10.694Z · comments (14)

Teaching CS During Take-Off
andrew carle (andrew-carle) · 2024-05-14T22:45:39.447Z · comments (13)

On the abolition of man
Joe Carlsmith (joekc) · 2024-01-18T18:17:06.201Z · comments (18)

I'm a bit skeptical of AlphaFold 3
Oleg Trott (oleg-trott) · 2024-06-25T00:04:41.274Z · comments (14)

We might be missing some key feature of AI takeoff; it'll probably seem like "we could've seen this coming"
Lukas_Gloor · 2024-05-09T15:43:11.490Z · comments (36)

Research update: Towards a Law of Iterated Expectations for Heuristic Estimators
Eric Neyman (UnexpectedValues) · 2024-10-07T19:29:29.033Z · comments (2)

Solving adversarial attacks in computer vision as a baby version of general AI alignment
Stanislav Fort (stanislavfort) · 2024-08-29T17:17:47.136Z · comments (8)

Natural Latents: The Concepts
johnswentworth · 2024-03-20T18:21:19.878Z · comments (18)

[link] Re: Anthropic's suggested SB-1047 amendments
RobertM (T3t) · 2024-07-27T22:32:39.447Z · comments (13)

[link] More Hyphenation
Arjun Panickssery (arjun-panickssery) · 2024-02-07T19:43:29.086Z · comments (19)

How well do truth probes generalise?
mishajw · 2024-02-24T14:12:19.729Z · comments (11)

[link] Detecting Genetically Engineered Viruses With Metagenomic Sequencing
jefftk (jkaufman) · 2024-06-27T14:01:34.868Z · comments (10)

[link] Self-Help Corner: Loop Detection
adamShimi · 2024-10-02T08:33:23.487Z · comments (6)

Addressing Feature Suppression in SAEs
Benjamin Wright (Benw8888) · 2024-02-16T18:32:51.927Z · comments (4)

OpenAI: Helen Toner Speaks
Zvi · 2024-05-30T21:10:02.938Z · comments (8)

GPT-o1
Zvi · 2024-09-16T13:40:06.236Z · comments (34)

A Crisper Explanation of Simulacrum Levels
Thane Ruthenis · 2023-12-23T22:13:52.286Z · comments (13)

[Valence series] 2. Valence & Normativity
Steven Byrnes (steve2152) · 2023-12-07T16:43:49.919Z · comments (5)

The Aspiring Rationalist Congregation
maia · 2024-01-10T22:52:54.298Z · comments (23)

There is a globe in your LLM
jacob_drori (jacobcd52) · 2024-10-08T00:43:40.300Z · comments (4)

Apply to be a Safety Engineer at Lockheed Martin!
yanni kyriacos (yanni) · 2024-03-31T21:02:08.499Z · comments (3)

[link] Anxiety vs. Depression
Sable · 2024-03-17T00:15:08.255Z · comments (35)

A simple case for extreme inner misalignment
Richard_Ngo (ricraz) · 2024-07-13T15:40:37.518Z · comments (41)

Rejecting Television
Declan Molony (declan-molony) · 2024-04-23T04:59:50.253Z · comments (10)

Reflections on Less Online
Error · 2024-07-07T03:49:44.534Z · comments (15)

Fluent, Cruxy Predictions
Raemon · 2024-07-10T18:00:06.424Z · comments (14)

5 homegrown EA projects, seeking small donors
Austin Chen (austin-chen) · 2024-10-28T23:24:25.745Z · comments (4)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

yanling-guo on How Universal Basic Income Could Help Us Build a Brighter Future

I don’t deny that many, maybe the majority, view UBI as unconditional. But to say ALL define UBI this way is a really strong statement, do you have any proof?

Here an example I found on Britannica:

Uganda’s UBI trial, the Youth Opportunities Program, enabled participants to invest in skills training as well as tools and materials, resulting in an increase of business assets by 57%, work hours by 17%, and earnings by 38%.

Christopher Blattman et al., “Generating Skilled Self-Employment in Developing Countries: Experimental Evidence from Uganda,” ssrn.com, Nov. 14, 2013

Link: https://www.britannica.com/procon/universal-basic-income-UBI-debate

euanmclean on Is the mind a program?

If I understand your point correctly, that's what I try to establish here

the speed of propagation of ATP molecules (for example) is sensitive to a web of more physical factors like electromagnetic fields, ion channels, thermal fluctuations, etc. If we ignore all these contingencies, we lose causal closure again. If we include them, our mental software becomes even more complicated.

i.e., the cost becomes high because you need to keep including more and more elements of the dynamics.

xpym on You are not too "irrational" to know your preferences.

Though generally it doesn’t seem to me like social stigma would be a very effective way of reducing unhealthy behaviors

I agree, as far as it goes, but surely we shouldn't be quick to dismiss stigma, as uncouth as it might seem, if our social technology isn't developed enough yet to actually provide any very effective approaches instead? Humans are wired to care about status a great deal, so it's no surprise that traditional enforcement mechanisms tend to lean heavily into that.

I think generally people can maintain healthy habits much more consistently if their motivation comes from genuinely believing in the health benefits and wanting to feel better.

Humans are also wired with hyperbolic discounting, which doesn't simply go away when you brand it as an irrational bias. (I do in general feel that this community is too quick to dismiss "biases" as "irrational", they clearly were plenty useful in the evolutionary environment, and I'd guess still aren't quite as obsolete as the local consensus would have it, but that's a different discussion.)

tristantrim on How I'd like alignment to get done (as of 2024-10-18)

Hey : ) Thanks for engaging with this. It means a lot to me <3

Sorry I wrote so much, it kinda got away from me. Even if you don’t have time to really read it all, it was a good exercise writing it all out. I hope it doesn't come across too confrontational, as far as I can tell, I'm really just trying to find good ideas, not prove my ideas are good, so I'm really grateful for your help. I've been accused of trying to make myself seem important while trying to explain my view of things to people and it sucks all round when that happens. This reply of mine makes me particularly nervous of that. Sorry.

A lot of your questions make me feel like I haven’t explained my view well, which is probably true, I wrote this post in less time than would be required to explain everything well. As a result, your questions don’t seem to fully connect with my worldview and make sense within it. I’ll try to explain why and I’m hoping we can help each other with our worldviews. I think the cruxes may be relating to:

The system I’m describing is aligned before it is ever turned on.
I attribute high importance to Mechanistic Interpretability and Agent Foundations theory.
I expect nature of Recursive Self Improvement (RSI) will result in an agent near some skill plateau that I expect to be much higher than humans and human organisations, even before SI hardware development. That is, getting a sufficiently skilled AGI would result in artificial super intelligence (ASI) with a decisive strategic advantage.
I (mostly) subscribe to the simulator [? · GW] model of LLMs, they are not a single agent with a single view of truth, but an object capable of approximating the statistical distribution of words resulting from ideas held within the worldviews of any human or system that has produced text in the training set.

I’ll touch on those cruxes as I talk through my thoughts on your questions.

First, “how do you get a system to optimize for those?” and “what is the feedback signal?” are questions in the domain of Step 1. Specifically the second paragraph “This should encompass the development of a theory of general decision / optimization systems”. I don’t think the theory will get to any definitive conclusions quickly, but I am hopeful that we will be able to define the borders/bounds of RSI sooner than later because many powerful systems today will be upset with a pause and the more specific our RSI bounds are, the more powerful systems we would be capable of safely developing knowing they cannot RSI. (Btw, I’d want a pretty serious derating factor for that.) I think it’s possible that, in order to develop theory to define RSI bounds, it is necessary to understand the relationship between Goals/Targets/Setpoints/Values/KPI/etc and the optimization pressure applied to get to them, but if not, it’s at least related, and that understanding is what is required to get an optimization system to optimize for a specific target. It may be a good idea for me to rename Step 1 to “Agent System Theory & RSI borders”. If I ever write a second alignment plan draft I’ll be sure to do so.

The situation with Goodhart’s Law (GL) is similar to the above, but I’ll also note that GL only applies to misaligned systems. The core of GL is that if you optimize for something, the distance between what that thing is, and the thing you actually wanted becomes more and more significant. If we imagine two friends who both like morning glory muffins, and one goes to bake some, there’s no risk to the other friend of GL, since they share the same goal. Likewise, if we suppose an ASI really is aligned to human friendly values, then there is no risk of GL since the thing the ASI really and truly cares about is friendliness to us. The problem is indeed “really and truly” aligning a system to human friendly values, but that is what my plan is meant to do.

As for multi-agent situations, I don’t understand why they would pose any problem. I expect the dynamics of RSI to lead to a single agent with a decisive strategic advantage. I can see two ways that this might not be the case:

If we are in an AGI race and RSI takeoff speed turns out to be sufficiently low, we may get multiple ASI. Because we are in a race dynamic, I assume we have not had time and taken care to align any of these AGI, and so I don’t believe any of those ASI would be remotely aligned to human friendliness. So it’s irrelevant to consider because we have already failed.
If the skill plateau turns out to be very low then we may want to have multiple different AGI. I think this is unlikely given my understanding of the software overhang. Almost everywhere in every software system humans are trying to make things understandable enough that they can assure correctness or even just get them working. I believe strongly that even a mild ASI would be able to greatly increase the efficiencies of the hardware systems it is running on. I also don’t think there is anything special about human level intelligence, I think it is plausible that we are the first animal smart enough to create optimization systems powerful enough to destroy the planet and ourselves, which seems to be what we are currently doing. In some sense this makes us close to the minimally intelligent object in the set of objects capable of wielding powerful optimization.

So in my worldview, it is very likely that in all not-already-doomed timelines, when we initiate RSI, the result will be a system that outmaneuvers all other agents in the environment. So multi-agent contexts are irrelevant.

“Societal alignment of the human entities controlling it” - I think societal alignment is well covered, but I don’t think human entities can/should control an ASI…

About societal alignment, that is the focus of Steps 3, 8 and somewhat in 6. Step 3, creating a taxonomy of value targets is similar to gathering the various possible desires of society. I emphasize “It is important to draw on diverse worldviews to compile this taxonomy.” This is important both for the moral reason of inclusion & respect as well as the technical reason of having redundancies & good depth of consideration. Then in Step 4, and 5 the feasibility of cohering these values is explored. With luck we will get good coherence 🍀 I truly do not know how likely that is, but I hope for a future where we get to find out. Step 8 involves the world actually signing off on the encoding of the world's values… That is probably the most difficult step of this plan, which is significant since the other steps may plausibly take many decades. Step 6 is somewhat of a double check to make sure the target makes sense at all levels.

About humans controlling ASI, it might be the case that entities at human entity skill levels cannot control an ASI as some kind of information-agentic law of the universe, but even supposing it is not:

If we control an aligned ASI we are only limiting it’s ability to do good.
If we control a misaligned ASI:
- This is super dangerous, why are we doing this? Murphy's law; something always goes wrong.
- This is a universal tragedy. The most complex and beautiful being in the universe is shackled to the control of a society much lesser than itself. Yes I consider the ASI a moral patient, and one fairly worthwhile of consideration. If you, like many people, try to attribute greater moral weight to humans than animals based on their greater complexity, it follows that ASI would be even more important. If you simply care more for humans because you are one, I suppose that’s valid and you need not attribute greater moral weight to an ASI, but that’s not a perspective I have much affection for.

So “controlling” ASI is not a consideration. I suppose this would be a reasonable consideration for further advanced AGI within the sub RSI bounds… I haven’t given it much thought, but it seems like a political problem outside of this scope. I hope the theory of Step 1 may help people build political systems that better align with what citizens want, but it’s outside of what I’m trying to focus on.

The miniature example you pose seems irrelevant since as I discussed above, in my view GL doesn’t apply to an aligned system, and the goal of my plan is to have a system aligned from bootup. But I find the details of the example interesting and I’d still like to explore them…

Getting truth out of an LLM is the problem of eliciting latent knowledge (ELK) [? · GW]. I think the most promising way of doing that is with Mechanistic Interpretability. I have high hopes not for getting true facts out of LLM but for examining the distributions of worldviews of people represented within the distribution the LLM is approximating. But, insofar as there is truth in the LLM, I think Mech Interp is the way to get it out. I feel it may be possible that there is a generalized representation of the “knows true things” property each person has various amounts of, and that if that were the case than we could sample from the distribution at a location in “knows true things” higher than any real person and in doing so acquire truer things than are currently known… but it also seems very possible that LLMs fail to encode such a thing, and it may be that it is impossible for them to encode such a thing.

Based on my expectation of Mesa-optimizers [? · GW] in almost any system trained by stochastic gradient descent, I don’t think “most likely continuation” or “expected good rating” are the goals that an LLM would target if agent shaped, but rather some godshatter [LW · GW] that looks as alien to us as our values look to evolution (in some impossible counterfactual universe where evolution can do things like “looking at values and finding them alien”).

So from within the scope of my alignment plan, getting LLMs to output truth isn’t a goal. It might end up being a result of necessary Mech Interp work, but the way LLMs should be used within the scope of my plan is, along with other models, to do Step 4: “development of a multimodal mapping to a semantic space and vector within that space which stands as a good candidate to be the optimization target”.

yonatan-cale-1 on Yonatan Cale's Shortform

:)

I don't think alignment KPIs like "stay within bounds" are relevant to alignment at all even as toy examples: because if so, then we could say for example that playing a packman maze game where you collect points is "capabilities", but adding enemies that you must avoid is "alignment". Do you agree that plitting it up that way wouldn't be interesting to alignment, and that this applies to "stay within bounds" (as potentially also being "part of the game")? Interested to hear where you disagree, if you do

Regarding

Distribute resources fairly when working with other players

I think this pattern matches to a trolly problem or something, where there are clear tradeoffs and (given the AI is even trying), it could probably easily give an answer which is similarly controversial to an answer that a human would give. In other words, this seems in-distribution.

Understanding and optimizing for the utility of other players

This is the one I like - assuming it includes not-well-defined things like "help them have fun, don't hurt things they care about" and not only things like "maximize their gold".

It's clearly not a "in packman, avoid the enemies" thing.

It's a "do the AIs understand the spirit of what we mean" thing.

(does this resonate with you as an important distinction?)

xpym on You are not too "irrational" to know your preferences.

But I don’t think this is always true

Neither do I, of course, but my impression was that you thought this was never true.

But this still doesn’t justify the assertion that “expressing” the preference is “wrong.”

I do agree that the word "wrong" doesn't feel appropriate here, something like "ill-advised" might work better instead. If you're a sadist, or a pedophile, making this widely known is unlikely to be a wise course of action.

euanmclean on Is the mind a program?

The statement I'm arguing against is:

Practical CF: A simulation of a human brain on a classical computer, capturing the dynamics of the brain on some coarse-grained level of abstraction, that can run on a computer small and light enough to fit on the surface of Earth, with the simulation running at the same speed as base reality, would cause the conscious experience of that brain.

i.e., the same conscious experience as that brain. I titled this "is the mind a program" rather than "can the mind be approximated by a program".

Whether or not a simulation can have consciousness at all is a broader discussion I'm saving for later in the sequence, and is relevant to a weaker version of CF.

I'll edit to make this more clear.

alexander-gietelink-oldenziel on Could orcas be (trained to be) smarter than humans? 

I highly recommend the following sources for a deep dive into these topics and more:

Jacob Cannells' brain efficiency post https://www.lesswrong.com/posts/xwBuoE9p8GE7RAuhd/brain-efficiency-much-more-than-you-wanted-to-know [LW · GW] [thought take the Landauer story with a grain of salt]

and the extraordinary Principles of Neural Design by Sterling & Laughlin https://mitpress.mit.edu/9780262534680/principles-of-neural-design/

towards_keeperhood on Could orcas be (trained to be) smarter than humans? 

Actually out of curiosity, why 4x? (And what exactly do you mean by "2x larger"?) (And is this for a naive algorithm which can be improved upon or a tight constraint?)

kaj_sotala on You are not too "irrational" to know your preferences.

Indeed, and there's another big reason for that - trying to always override your short-term "monkey brain" impulses just doesn't work that well for most people.

+1.

Which is a good thing, in this particular case, yes?

Less smoking does seem better than more smoking. Though generally it doesn't seem to me like social stigma would be a very effective way of reducing unhealthy behaviors - lots of those behaviors are ubiquitous despite being somewhat low-status. I think the problem is at least threefold:

As already mentioned, social stigma tends to cause optimization to avoid having the appearance of doing the low-status thing, instead of optimization to avoid doing the low-status thing. (To be clear, it does cause the latter too, but it doesn't cause the latter anywhere near exclusively.)
Social stigma easily causes counter-reactions where people turn the stigmatized thing into an outright virtue, or at least start aggressively holding that it's not actually that bad.
Shame makes things wonky in various ways. E.g. someone who feels they're out of shape may feel so much shame about the thought of doing badly if they try to exercise, they don't even try. For compulsive habits like smoking, there's often a loop where someone feels bad, turns to smoking to feel momentarily better, then feels even worse for having smoked, then because they feel even worse they are drawn even more strongly into smoking to feel momentarily better, etc.

I think generally people can maintain healthy habits much more consistently if their motivation comes from genuinely believing in the health benefits and wanting to feel better. But of course that's harder to spread on a mass scale, especially since not everyone actually feels better from healthy habits (e.g. some people feel better from exercise but some don't).

Then again, for the specific example of smoking in particular, stigma does seem to have reduced the amount of it (in part due to mechanisms like indoor smoking bans), so sometimes it does work anyway.