LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

AI #81: Alpha Proteo
Zvi · 2024-09-12T13:00:07.958Z · comments (3)

[link] Investigating an insurance-for-AI startup
L Rudolf L (LRudL) · 2024-09-21T15:29:10.083Z · comments (0)

My 10-year retrospective on trying SSRIs
Kaj_Sotala · 2024-09-22T20:30:02.483Z · comments (6)

How you can help pass important AI legislation with 10 minutes of effort
ThomasW · 2024-09-14T22:10:50.386Z · comments (2)

Referendum Mechanics in a Marketplace of Ideas
Martin Sustrik (sustrik) · 2024-08-25T08:30:01.901Z · comments (2)

[link] Congressional Insider Trading
Maxwell Tabarrok (maxwell-tabarrok) · 2024-08-30T13:32:57.264Z · comments (6)

On the UBI Paper
Zvi · 2024-09-03T14:50:08.647Z · comments (6)

Another argument against utility-centric alignment paradigms
Fiora from Rosebloom · 2024-09-22T07:28:27.856Z · comments (23)

Evidence against Learned Search in a Chess-Playing Neural Network
p.b. · 2024-09-13T11:59:55.634Z · comments (3)

... Wait, our models of semantics should inform fluid mechanics?!?
johnswentworth · 2024-08-26T16:38:53.924Z · comments (12)

[Intuitive self-models] 1. Preliminaries
Steven Byrnes (steve2152) · 2024-09-19T13:45:27.976Z · comments (10)

[link] Pay-on-results personal growth: first success
Chipmonk · 2024-09-14T03:39:12.975Z · comments (2)

Secret Collusion: Will We Know When to Unplug AI?
schroederdewitt · 2024-09-16T16:07:01.119Z · comments (7)

[link] Book review: Xenosystems
jessicata (jessica.liu.taylor) · 2024-09-16T20:17:56.670Z · comments (18)

[link] AI, centralization, and the One Ring
owencb · 2024-09-13T14:00:16.126Z · comments (11)

AI #82: The Governor Ponders
Zvi · 2024-09-19T13:30:04.863Z · comments (8)

Reformative Hypocrisy, and Paying Close Enough Attention to Selectively Reward It.
Andrew_Critch · 2024-09-11T04:41:24.872Z · comments (7)

[link] Making Eggs Without Ovaries
Niko_McCarty (niko-2) · 2024-09-22T17:44:46.733Z · comments (3)

AI and the Technological Richter Scale
Zvi · 2024-09-04T14:00:08.625Z · comments (8)

[link] On the Role of Proto-Languages
adamShimi · 2024-09-22T16:50:34.720Z · comments (1)

Applications of Chaos: Saying No (with Hastings Greer)
Elizabeth (pktechgirl) · 2024-09-21T16:30:07.415Z · comments (15)

I finally got ChatGPT to sound like me
lsusr · 2024-09-17T09:39:59.415Z · comments (18)

[link] Michael Dickens' Caffeine Tolerance Research
niplav · 2024-09-04T15:41:53.343Z · comments (3)

The Fragility of Life Hypothesis and the Evolution of Cooperation
KristianRonn · 2024-09-04T21:04:49.878Z · comments (6)

How to hire somebody better than yourself
lukehmiles (lcmgcd) · 2024-08-28T08:12:53.450Z · comments (5)

Forecasting One-Shot Games
Raemon · 2024-08-31T23:10:05.475Z · comments (0)

Interested in Cognitive Bootcamp?
Raemon · 2024-09-19T22:12:13.348Z · comments (0)

Conflating value alignment and intent alignment is causing confusion
Seth Herd · 2024-09-05T16:39:51.967Z · comments (17)

[link] MIRI's September 2024 newsletter
Harlan · 2024-09-16T18:15:40.785Z · comments (0)

AI #80: Never Have I Ever
Zvi · 2024-09-10T17:50:08.074Z · comments (20)

In defense of technological unemployment as the main AI concern
tailcalled · 2024-08-27T17:58:01.992Z · comments (36)

Economics Roundup #3
Zvi · 2024-09-10T13:50:06.955Z · comments (9)

[question] "Deception Genre" What Books are like Project Lawful?
Double · 2024-08-28T17:19:52.172Z · answers+comments (20)

Unit economics of LLM APIs
dschwarz · 2024-08-27T16:51:22.692Z · comments (0)

How difficult is AI Alignment?
Sammy Martin (SDM) · 2024-09-13T15:47:10.799Z · comments (6)

Which LessWrong/Alignment topics would you like to be tutored in? [Poll]
Ruby · 2024-09-19T01:35:02.999Z · comments (11)

Work with me on agent foundations: independent fellowship
Alex_Altair · 2024-09-21T13:59:16.706Z · comments (3)

Formalizing the Informal (event invite)
abramdemski · 2024-09-10T19:22:53.564Z · comments (0)

We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap
johnswentworth · 2024-09-19T22:22:05.307Z · comments (35)

[link] [Paper] Programming Refusal with Conditional Activation Steering
Bruce W. Lee (bruce-lee) · 2024-09-11T20:57:08.714Z · comments (0)

[link] Things I learned talking to the new breed of scientific institution
Abhishaike Mahajan (abhishaike-mahajan) · 2024-08-29T14:00:14.844Z · comments (6)

Struggling like a Shadowmoth
Raemon · 2024-09-24T00:47:05.030Z · comments (2)

instruction tuning and autoregressive distribution shift
nostalgebraist · 2024-09-05T16:53:41.497Z · comments (5)

[link] you should probably eat oatmeal sometimes
bhauth · 2024-08-25T14:50:37.570Z · comments (29)

[link] Generative ML in chemistry is bottlenecked by synthesis
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-16T16:31:34.801Z · comments (2)

Monthly Roundup #22: September 2024
Zvi · 2024-09-17T12:20:08.297Z · comments (9)

Free Will and Dodging Anvils: AIXI Off-Policy
Cole Wyeth (Amyr) · 2024-08-29T22:42:24.485Z · comments (12)

Model evals for dangerous capabilities
Zach Stein-Perlman · 2024-09-23T11:00:00.866Z · comments (4)

Book Review: On the Edge: The Fundamentals
Zvi · 2024-09-23T13:40:11.058Z · comments (2)

[question] What are the best arguments for/against AIs being "slightly 'nice'"?
Raemon · 2024-09-24T02:00:19.605Z · answers+comments (3)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

faul_sname on What are the best arguments for/against AIs being "slightly 'nice'"?

[habryka] The way humans think about the question of "preferences for weak agents" and "kindness" feels like the kind of thing that will come apart under extreme optimization, in a similar way to how I expect the idea of "having a continuous stream of consciousness with a good past and good future is important" to come apart as humans can make copies of themselves and change their memories, and instantiate slightly changed versions of themselves, etc.

I think there will be options that are good under most of the things that "preferences for weak agents" would likely come apart into under close examination. If you're trying to fulfill the preferences of fish, you might argue about whether the exact thing you should care about is maximizing their hedonic state vs ensuring that they exist in an ecological environment which resembles their niche vs minimizing "boundary-crossing actions"... but you can probably find an action that is better than "kill the fish" by all of those possible metrics.

I think that some people have an intuition that any future agent must pick exactly one utility function over the physical configuration of matter in the universe, and that any agent that has a deontological constraint like "don't do any actions which are 0.00001% better under my current interpretation of my utility function but which are horrifyingly bad to every other agent " will be outcompeted in the long term. I personally don't see it, and particularly I don't see how there's an available slot for an arbitrary outcome-based utility function that is not "reproduce yourself at all costs" but there isn't an available slot for process-based preferences like "and don't be an asshole for miniscule gains while doing that".

nicholas-weininger on How to choose what to work on

Nice post. It prompts two questions, which you may or may not be the right person to answer:

How do you find good obsessions? Is it "just" a matter of being curious and widely-read? What is the combination of life practice and psychological orientation that leads a person to become obsessed with one or more ideas in the way that you became obsessed with progress studies and with Fieldbook?
On your path to world-class status, how do you avoid the "middle-competence trap" (analogy to the middle-income trap)? How do you handle having something you love that you've gotten damn good at, better than most people will ever get, but can't seem to break through to the level of the achievers who really make their mark on the field? Maybe this is more of an issue for me than for others-- maybe for example it is "just" a matter of being willing to burrow deep into something to the exclusion of your other interests in life, and I'm too much of a generalist to do that-- but it's been a problem for me twice now, and I really wonder if it might be a common failure mode of this kind of questing process.

t3t on ASIs will not leave just a little sunlight for Earth

I agree with your top-level comment but don't agree with this. I think the swipes at midwits are bad (particularly on LessWrong) but think it can be very valuable to reframe basic arguments in different ways, pedagogically. If you parse this post as "attempting to impart a basic intuition that might let people (new to AI x-risk arguments) avoid certain classes of errors" rather than "trying to argue with the bleeding-edge arguments on x-risk", this post seems good (if spiky, with easily trimmed downside).

And I do think "attempting to impart a basic intuition that might let people avoid certain classes of errors" is an appropriate shape of post for LessWrong, to the extent that it's validly argued.

raemon on Struggling like a Shadowmoth

At the beginning of the post, I wrote:

"This post is probably hazardous for one type of person in one particular growth stage, and necessary for people in a different growth stage, and I don't really know how to tell the difference in advance.

If you read it and feel like it kinda wrecked you send me a DM. I'll try to help bandage it. "

I thought about putting it again at the end, where people would actually be more likely to see it when they realized they needed it. But, also, it felt like it would undercut the poetic heft of the end of the post. But then I realized I could put it in the comments here, which feels like a reasonable middle ground.

t3t on What are the best arguments for/against AIs being "slightly 'nice'"?

as applied to current foundation models it appears to do so

I don't think the outputs of RLHF'd LLMs have the same mapping to the internal cognition which generated them that human behavior does to the human cognition which generated it. (That is to say, I do not think LLMs behave in ways that look kind because they have a preference to be kind, since right now I don't think they meaningfully have preferences in that sense at all.)

rogerdearnaley on What are the best arguments for/against AIs being "slightly 'nice'"?

Base model LLMs are trained off human data. So by default they generate a prompt-dependent distribution of simulated human behavior with about the same breadth of degrees of kindness as can be found on the Internet/in books/etc. Which is a pretty wide range.

For instruct-trained models, RLHF for helpfulness and harmlessness seems likely to increase kindness, and superficially as applied to current foundation models it appears to do so. RL with many other objectives could, generally, induce powerseeking and thus could reasonably be expected to decrease it. Prompting can of course have wide range of effects.

So if we build an AGI based around an agentified fine-tuned LLM, the default level of kindness is probably in the order-of-magnitude of that of humans (who, for example, build nature reserves). A range of known methods seem likely to modify that significantly, up or down.

zy on Non-human centric view of existence

I am not sure if this is the point or the focus of the topic as it is irrelevant to the question. They are saying well if that is the case that is case because human is just a species and if that day comes that is fine. Then it got to how people want to address climate change really just because we want to live but not necessarily for the good of other species. But the question itself is not specifically related to AI risks at all, but general EA theories and philosophical thoughts. How this question arise to me is something personal that is beyond the scope of discussion that I felt I don’t need to answer on, but here it is. I don't think you necessarily need to know that for this question.

Also curious - why are you interested in knowing how this question came up? Is that helpful to your answer of this question, and how would it change the answer? Curious to see how your answers would change depending on different ways this question may arise.

screwtape on Struggling like a Shadowmoth

I can't remember whether it's Jacen, or the woman, who goes on to say "you might sit with it, and let it know that it's not alone, and do what you can to comfort it."

"'And also, perhaps,' [the woman] went on, 'you might stop by from time to time, to let the struggling, desperate, suffering creature know that it is not alone. That someone cares. That its pain is in the service of its destiny.'"

It's the woman.

And also I guess I'd like to register another weird coincidence, Traitor was also one of my favourite stories growing up. I disagree with you about the second half of the book being forgettable though. Ganner Rhysode also felt important.

matthew-barnett on ASIs will not leave just a little sunlight for Earth

The claim at hand, that we have both read Eliezer repeatedly make^[1], is that there is a sufficient level of intelligence and a sufficient power of nanotechnology that within days or weeks a system could design and innocuously build a nanotechnology factory out of simple biological materials that goes on to build either a disease or a cellular-sized drones that would quickly cause an extinction event — perhaps a virus that spreads quickly around the world with a replication rate that allows it to spread globally before any symptoms are found, or a series of diamond-based machines that can enter the bloodstream and explode on a coordinated signal. This is such a situation where no response from human civilization would occur, and the argument that an AI ought to be worried about people with guns and bombs coming for its data centers has no relevance.

Sure, I have also read Eliezer repeatedly make that claim. On the meta level, I don't think the fact that he has written about this specific scenario fully makes up for the vagueness in his object-level essay above. But I'm also happy to briefly reply on the object level on this particular narrow point:

In short, I interpret Eliezer to be making a mistake by assuming that the world will not adapt to anticipated developments in nanotechnology and AI in order to protect against various attacks that we can easily see coming, prior to the time that AIs will be capable of accomplishing these incredible feats. By the time AIs are capable of developing such advanced molecular nanotech, I think the world will have already been dramatically transformed by prior waves of technologies, many of which by themselves could importantly change the gameboard, and change what it means for humans to have defenses against advanced nanotech to begin with.

As a concrete example, I think it's fairly plausible that, by the time artificial superintelligences can create fully functional nanobots that are on-par with or better than biological machines, we will have already developed uploading technology that allows humans to literally become non-biological, implying that we can't be killed by a virus in the first place. This would reduce the viability of using a virus to cause humanity to go extinct, increasing human robustness.

As a more general argument, and by comparison to Eliezer, I think that nanotechnology will probably be developed more incrementally and predictably, rather than suddenly upon the creation of a superintelligent AI, and the technology will be diffused across civilization, rather than existing solely in the hands of a small lab run by an AI. I also think Eliezer seems to be imagining that superintelligent AI will be created in a world that looks broadly similar to our current world, with defensive technologies that are only roughly as powerful as the ones that exist in 2024. However, I don't think that will be the case.

Given an incremental and diffuse development trajectory, and transformative precursor technologies to mature nanotech, I expect society will have time to make preparations as the technology is developed, allowing us to develop defenses to such dramatic nanotech attacks alongside the offensive nanotechnologies that will also eventually be developed. It therefore seems unlikely to me that society will be completely caught by surprise by fully-developed-molecular nanotechnology, without any effective defenses.

martin-randall on ASIs will not leave just a little sunlight for Earth

If you're not supposed to end up as a pet of the AI, ...

I don't see how it falls out of human values that humans should not end up as pets of the AIs, given the hypothesis that we can make AIs that care enough about human thriving to take humans as pets, but we don't know how to make AIs that care more than that. Looking at a couple of LessWrong theories of human value for illustrative purposes:

Godshatter

Yudkowsky's Godshatter [LW · GW] theory requires petness to be negative for reproductive fitness in the evolutionary environment to a sufficient degree to be DNA-encoded as aversive. There have not been evolutionary opportunities for humans to be pets of AIs, so this would need to come in via extrapolation from humans being "pets" of much more powerful humans. But while being Genghis Khan is great for reproductive fitness, rebelling against Genghis Khan is terrible for reproductive fitness. I guess that optimal strategy is something like: "be best leader is best, follow best leader is good, follow other leader is bad, be other leader is worst". When AIs block off "be best leader", following an AI executes that strategy.

Maybe there's a window where DNA can encode "be leader is good" but cannot encode the more complex strategy, and the simple strategy is on net good because of Genghis Khan and a few others. This seems unlikely to me, it's a small window. More probable to me is that DNA can't encode this stuff at all, and Godshatter theory is largely false outside of basic things like sweetness being sweet.

Maybe being an AI's pet is a badwrongfun superstimulus [LW · GW]. Yudkowsky argues that a superstimulus can be bad, despite super-satisfying a human value, because it conflicts with other values, including instrumental values. But that's an argument from consequences, not values. Just because donuts are unhealthy doesn't mean that I don't value sweet treats.

Shard Theory

Pope's Shard Theory [LW · GW] implies that different humans have different values around petness based on formative experiences. Most humans have formative experiences of being raised by powerful agents known as "parents". Therefore we expect a mixture of positive and negative shards around petness. Seems to me that positive shards should be more common, but experiences vary.

Then we experience the situation of superintelligent AIs taking human pets and our shards conflict and negotiate. I think it's pretty obvious that we're going to label the negative shards as maladaptive and choose the positive shards. What's the alternative? "I didn't like having my diaper changed as a baby, so now as an adult human I'm going to reject the superintelligent AI that wants to take me as its pet and instead...", instead what? Die of asphyxiation? Be a feral human in an AI-operated nature reserve?

What about before the point-of-no-return? Within this partial alignment hypothetical, there's a sub-hypothetical in which "an international treaty that goes hard on shutting down all ASI development anywhere" is instrumentally the right choice, given the alternative of becoming pets, because it allows for developing better alignment techniques and AIs that care more about human thriving and have more pets. There's a sub-hypothetical in which it's instrumentally the wrong choice, because it carries higher extinction risk, and it's infeasible to align AIs while going hard on shutting them down. But there's not really a sub-hypothetical where shards about petness make that decision rather than, eg, shards that don't want to die.