LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] Atoms to Agents Proto-Lectures
johnswentworth · 2023-09-22T06:22:05.456Z · comments (14)

Towards a Less Bullshit Model of Semantics
johnswentworth · 2024-06-17T15:51:06.060Z · comments (44)

[link] Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant
Olli Järviniemi (jarviniemi) · 2024-05-06T07:07:05.019Z · comments (13)

Open Source Replication & Commentary on Anthropic's Dictionary Learning Paper
Neel Nanda (neel-nanda-1) · 2023-10-23T22:38:33.951Z · comments (12)

A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication
johnswentworth · 2024-07-26T00:33:42.000Z · comments (1)

Counting arguments provide no evidence for AI doom
Nora Belrose (nora-belrose) · 2024-02-27T23:03:49.296Z · comments (184)

[link] Against Aschenbrenner: How 'Situational Awareness' constructs a narrative that undermines safety and threatens humanity
GideonF · 2024-07-15T18:37:40.232Z · comments (17)

Refactoring cryonics as structural brain preservation
Andy_McKenzie · 2024-09-11T18:36:30.285Z · comments (14)

Notes on Dwarkesh Patel’s Podcast with Demis Hassabis
Zvi · 2024-03-01T16:30:08.687Z · comments (0)

[link] Things You’re Allowed to Do: University Edition
Saul Munn (saul-munn) · 2024-02-06T00:36:11.690Z · comments (13)

[link] The Soul Key
Richard_Ngo (ricraz) · 2023-11-04T17:51:53.176Z · comments (9)

On attunement
Joe Carlsmith (joekc) · 2024-03-25T12:47:34.856Z · comments (8)

SB 1047: Final Takes and Also AB 3211
Zvi · 2024-08-27T22:10:07.647Z · comments (11)

Apollo Research 1-year update
Marius Hobbhahn (marius-hobbhahn) · 2024-05-29T17:44:32.484Z · comments (0)

Takeoff speeds presentation at Anthropic
Tom Davidson (tom-davidson-1) · 2024-06-04T22:46:35.448Z · comments (0)

Trying to understand John Wentworth's research agenda
johnswentworth · 2023-10-20T00:05:40.929Z · comments (13)

[link] the Giga Press was a mistake
bhauth · 2024-08-21T04:51:24.150Z · comments (26)

OpenAI: The Board Expands
Zvi · 2024-03-12T14:00:04.110Z · comments (1)

[question] Am I confused about the "malign universal prior" argument?
nostalgebraist · 2024-08-27T23:17:22.779Z · answers+comments (33)

Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders
Johnny Lin (hijohnnylin) · 2024-03-25T21:17:58.421Z · comments (7)

Circular Reasoning
abramdemski · 2024-08-05T18:10:32.736Z · comments (36)

How to train your own "Sleeper Agents"
evhub · 2024-02-07T00:31:42.653Z · comments (11)

It's time for a self-reproducing machine
Carl Feynman (carl-feynman) · 2024-08-07T21:52:22.819Z · comments (71)

New page: Integrity
Zach Stein-Perlman · 2024-07-10T15:00:41.050Z · comments (3)

Improving the Welfare of AIs: A Nearcasted Proposal
ryan_greenblatt · 2023-10-30T14:51:35.901Z · comments (5)

Just admit that you’ve zoned out
joec · 2024-06-04T02:51:27.594Z · comments (22)

Quotes from Leopold Aschenbrenner’s Situational Awareness Paper
Zvi · 2024-06-07T11:40:03.981Z · comments (10)

Everything Wrong with Roko's Claims about an Engineered Pandemic
EZ97 · 2024-02-22T15:59:08.439Z · comments (10)

Access to powerful AI might make computer security radically easier
Buck · 2024-06-08T06:00:19.310Z · comments (14)

Meaning & Agency
abramdemski · 2023-12-19T22:27:32.123Z · comments (17)

[link] Linkpost: They Studied Dishonesty. Was Their Work a Lie?
Linch · 2023-10-02T08:10:51.857Z · comments (12)

AI #31: It Can Do What Now?
Zvi · 2023-09-28T16:00:01.919Z · comments (6)

Review: Conor Moreton's "Civilization & Cooperation"
Duncan Sabien (Deactivated) (Duncan_Sabien) · 2024-05-26T19:32:43.131Z · comments (8)

Prediction Markets aren't Magic
SimonM · 2023-12-21T12:54:07.754Z · comments (29)

[link] Introducing METR's Autonomy Evaluation Resources
Megan Kinniment (megan-kinniment) · 2024-03-15T23:16:59.696Z · comments (0)

Based Beff Jezos and the Accelerationists
Zvi · 2023-12-06T16:00:08.380Z · comments (29)

[link] Linkpost: A Post Mortem on the Gino Case
Linch · 2023-10-24T06:50:42.896Z · comments (7)

[link] New report: Safety Cases for AI
joshc (joshua-clymer) · 2024-03-20T16:45:27.984Z · comments (13)

AI #73: Openly Evil AI
Zvi · 2024-07-18T14:40:05.770Z · comments (20)

Public Call for Interest in Mathematical Alignment
Davidmanheim · 2023-11-22T13:22:09.558Z · comments (9)

Partial value takeover without world takeover
KatjaGrace · 2024-04-05T06:20:03.961Z · comments (23)

story-based decision-making
bhauth · 2024-02-07T02:35:27.286Z · comments (11)

[link] Large Language Models can Strategically Deceive their Users when Put Under Pressure.
ReaderM · 2023-11-15T16:36:04.446Z · comments (8)

Singular learning theory: exercises
Zach Furman (zfurman) · 2024-08-30T20:00:03.785Z · comments (3)

[link] I found >800 orthogonal "write code" steering vectors
Jacob G-W (g-w1) · 2024-07-15T19:06:17.636Z · comments (19)

[link] Techno-humanism is techno-optimism for the 21st century
Richard_Ngo (ricraz) · 2023-10-27T18:37:39.776Z · comments (5)

Covert Malicious Finetuning
Tony Wang (tw) · 2024-07-02T02:41:51.698Z · comments (4)

On the abolition of man
Joe Carlsmith (joekc) · 2024-01-18T18:17:06.201Z · comments (18)

Stagewise Development in Neural Networks
Jesse Hoogland (jhoogland) · 2024-03-20T19:54:06.181Z · comments (1)

Defining alignment research
Richard_Ngo (ricraz) · 2024-08-19T20:42:29.279Z · comments (21)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

raemon on Skills from a year of Purposeful Rationality Practice

Yeah. I'm not entirely sure what you mean by the metaphor, but I use a similar metaphor.

I think, ideally, you'll have at least 3 plans and at least 3 different frames for "what are your plans trying to accomplish". i.e. the 3 plans are solving somewhat different problems. And this is specifically getting at the "being able to move around freely in 3 dimensions" thingy.

The actual top-level blogpost I plan to write is probably going to be called "3 plans, 2 frames, and a crux", which is based on what actually feels reasonable to ask of people in most circumstances. i.e:

Come up with at least 3 different plans.
At least two of them should looking at the problem differently.
If it's obvious which plan is your favorite, figure out what observations you could make later on that might change your mind towards one of the other two.

When you're fluent at this sort of thing, you should have lots of different plans and backup plans in mind such that you have way more than 3, but, this felt like a reasonable bid given that people will probably find all these steps cumbersome and annoying at first.

cameron-berg on The case for a negative alignment tax

It seems fairly clear that widely deployed, highly capable AI systems enabling unrestricted access to knowledge about weapons development, social manipulation techniques, coordinated misinformation campaigns, engineered pathogens, etc. could pose a serious threat. Bad actors using that information at scale could potentially cause societal collapse even if the AI itself was not agentic or misaligned in the way we usually think about with existential risk.

nathan-helm-burger on RLHF is the worst possible thing done when facing the alignment problem

Alright, time to disagree with both @faul_sname [LW · GW] and @tailcalled [LW · GW], while also agreeing with some of each of their points, in order to explain my world view which differs importantly from both of theirs!

Topic 1: Offense-Defense Balance

tailcalled: [AI] It makes conflict more viable for small adversaries against large adversaries

faul_sname: I'm not sure about that. One dynamic of current-line AI is that it is pretty good at increasing the legibility of complex systems, which seems like it would advantage large adversaries over small ones relative to a world without such AI.

Unfortunately, until the world has acted to patch up some terrible security holes in society, we are all in a very fragile state. Currently, as of mid-2024, all nations on Earth have done a terrible job at putting preventative safety measures in place to protect against biorisk. The cost of doing this would be trivial compared to military expenses, even compared to militaries of small nations. So, I presume that the lack of such reasonable precautions is due to some combination of global decision makers having:

lack of belief in the existence of the risk

lack of belief that reducing the risk would be politically expedient

lack of knowledge about how to effectively counter the risk

lack of knowledge that it would be relatively cheap and easy to put preventative measures in place

corruption (combined with ignorance and/or laziness and/or stupidity), such that they aren't even motivated to learn what would be useful actions to take to benefit their country which would also benefit themselves (safety from bioweapons is good for rich people too!)

Furthermore, this situation is greatly exacerbated by AI. I have been working on AI Biorisk Evals with SecureBio for nearly a year now. As models increase in general capabilities, so too do they incidentally get more competent at assisting with the creation of bioweapons. It is my professional opinion that they are currently providing non-zero uplift over a baseline of 'bad actor with web search, including open-access scientific papers'. They are still far from some theoretical ceiling of maximal uplift, but the situation is getting worse rapidly. Other factors are at work making this situation worse, such as microbiology technology getting cheaper, more reliable, more available, and easier to use. Also, the rapid progress of wetlab automation technology.

Overall, this category of attack alone is rapidly making it much much easier for small actors (e.g. North Korea, or a well-funded terrorist organization) to plausibly make existential threats against large powerful actors (e.g. the United States). Why? Because ethnic-targeting and/or location-targeting of a bioweapon is possible, and getting easier as the advising AI gets smarter.

Importantly, a single attack could be completely devastating. If it were also difficult to trace the origin, then most of the population of the United States would be dead before anyone had figured out where the attack originated from. This makes it hard to plausibly threaten retaliatory mutual destruction.

Currently, there aren't other technologies that allow for such cheap, devastating, hard-to-trace attacks. In the future, technological advancement might unlock more. For instance, nanotech or nano-bio-weapon-combos. This doesn't seem like a near-term threat currently, but it's hard to foresee how technology might advance once AI gets powerful enough to accelerate research in a general way.

Topic 2: De-confusing intent-alignment versus value-alignment versus RLHF

I think it's important to distinguish which things we're talking about. For more detail, see this post by Seth Herd. [LW · GW]

RLHF is a tool, with certain inherent flaws, that can be directed to a variety of different ends.

Intent-alignment means getting an AI to do what you intend for it to do. Making it into a good obedient tool/servant which operates in accordance with your expectations. It doesn't surprise you by taking irreversible actions that you actually disapprove of once you learn of them. It's a good genie, not a bad genie.

Value-alignment means trying to make the AI act (relatively independently, using its own judgement) in accordance with specified values. These values could be the developer's values, or the developer's estimate of humanity's values, or their country's values, or their political party's values. Whatever. The point is that a choice is made by the developer about a set of values, and then the developer tries to get an agent to semi-independently work to alter the world in accordance with the specified values. This can involve a focus exclusively on end-states, or may also specify a limited set of acceptable means by which to make changes in the world to actualize the values.

Topic 3: Threat modeling of AI agents

End-state-focused Value-aligned Highly-Effective-Optimizer AI Agents (hereafter Powerful AI Agents) with the primary value of win-at-all-costs are extremely dangerous. The more you restrict the means by which the agents act, or the amount they must defer to supervisors before acting, the less effective at shaping the world they will be. If you are in a desperate existential conflict, this difference seems likely to be key, and thus to present strong pressure to remove the limiting control measures. I don't think you'll find much disagreement with this hypothetical being a bad state of affairs to be in. What people tend to disagree on are the factors involved in getting to that world state. I'll try to outline some of the disagreements in this area (rather than focusing on my own beliefs).

Some people don't believe it will be possible in the next 20 years (or longer!) to create Powerful AI Agents.
1. some say that they don't believe the 'highly effective' part will be possible, even if the value-aligned part is.
  1. some of these say that intelligence, even if there is such a thing, and even if it were possible to grant such a thing to an AI such that the AI had more of it than a human, wouldn't even be useful. This involves some sort of belief in declining returns to intelligence, and limitations on possible sets of actions by highly intelligent actors (even AI able to rapidly increase their populations and to work at superhuman speeds).
  2. some say that while superhumanly effective general AI is possible in theory, that there is basically no change that humanity develops it within the next 20 (or 50 or 100 or 1000) years.
2. some say that the value-aligned part will fail, and be more likely to fail the more effective the agents are. Thus, that any agent competent enough to be decisively useful in a large scale conflict will destroy the developer/operator as well as the operator's enemies.
Some people don't believe the we will get to Powerful AI Agents before we've already arrived at other world states that make it unlikely we will continue to proceed on a trajectory towards Powerful AI Agents.
1. some say that we have already reached (or will soon reach) the point of civilization-scale risk from weaponization / misuse of AIs well short of the theoretical Powerful AI Agents. For more details on such risks, see Topic 1.
  1. Some predict that these large scale risks will destroy civilization before we have a chance to develop Powerful AI Agents.
  2. Some predict that the presence of such large concrete threats will force an international coalition which effectively enforces global control over AI development and civilization scale weapons.
  3. Some predict that all relevant actors will be so abhorrent of the civilization-scale risks which weaponization of AI would lead to that all such actors will voluntarily abstain from developing and using such. They predict that this ethical fortitude will hold even in the face of imminent destruction by violent conflict, or that such destructive conflict will never occur between the relevant actors. They also predict that the set of relevant actors will remain small (e.g. just the largest state actors, not terrorists or small states). [author's note: I am unable to avoid mentioning that I am rolling my eyes disbelievingly while writing this point.]
  4. Some predict that having such powerful weapons available will enable one powerful actor (or team of actors) to gain a decisive strategic military advantage sufficiently strong that they will be able to force surrender or utterly annihilate their enemies. They predict that once it is clear that this option is available, the actors likely to be the ones in this position (current top 5 candidates, in descending order of probability according to the author: US, UK, China, Russia, India) will seize the opportunity and succeed. Then there will be no large scale AI-vs-AI conflict because a single state actor will control the entire world and rule sufficiently competently / harshly that no rebel faction will be able to gain sufficient power to launch a relevant counterattack.
2. some predict that the powerful state actors will foresee the risk of escalating conflicts, and decide to form a sufficiently strong and well-enforced treaty/coalition that we reign in the already-in-play civilizational scale risks and act to prevent the emergence of new ones. This includes monitoring and regulating each other and all countries / companies / academic institutions / etc all around the world in order to prevent further research into AI or any sort of self-replicating weapons or other cheap and difficult-to-restrict weapons of mass destruction.
  1. for an example of this idea see: https://arxiv.org/abs/2310.09217
  2. note that while this does seem like a fairly optimistic take, it is not nearly as implausible as the voluntary-abstention prediction from point 2.a.iii. This proposal would involve coercive enforcement of global bans via mandatory inspection, and promised military action against defectors.
Some say that intelligent agents being dangerously powerful is good actually, and won't be a risky situation
1. some deny the hypothesis of the orthogonality of values and intelligence, and say that a highly intelligent AI will converge on human-ish values and treat humanity nicely
2. some say that they value intelligent beings generally, rejecting human chauvinism. They therefore value the hypothetical future Powerful AI Agents so much that their creation and existence outweighs the likely extinction of humanity. (e.g. some e/acc supporters)

bokov-1 on My simple AGI investment & insurance strategy

Would you mind sharing how you allocated the ratio of these positions?

simonbaars on More Dakka

Logically, maybe not that much.

Socially, this is a believe people hold.

I think it's the main argument for people choosing not to pull the lever (almost 20% by a survey I found).

gb on What you know when you know nothing

Though more subtle, I feel that the 50% prior for “individual statements” is also wrong, actually; it’s not even clear a priori which statements are “individual” – just figuring that out seems to require a quite refined model about the world.

mitchell_porter on A New Class of Glitch Tokens - BPE Subtoken Artifacts (BSA)

specifically better at math when the number of digits is divisible by 3?

Ironically, this is the opposite of the well-known calculating prodigy Shakuntala Devi, who said that inserting commas (e.g. "5,271,009") interfered with her perception of the number.

abramdemski on Superrational Agents Kelly Bet Influence!

I think you are right, I was confused. A pareto-improvement bet is necessarily one which all parties would selfishly consent to (at least not actively object to).

bokov-1 on My simple AGI investment & insurance strategy

Maybe the key is not to assume the entire economy will win, but make some attempt to distinguish winners from losers and then find ETFs and other instruments that approximate these sectors.

So, some wild guesses...

AI labs and their big-tech partners: winners
Cloud hosting: winners
Commercial real estate specializing in server farms: winners
Whoever comes up with tractable ways to power all these server farms: winners
AI-enabling hardware companies: winners until the Chinese blockade Taiwan and impose an embargo on raw materials... after that... maybe losers except the ones that have already started diversifying their supply-chains?
Companies which inherently depend on aggregating and reselling labor: tricky, because if they do nothing, they're toast, but some of them can turn themselves into resellers of AI... e.g. a temp agency rolling out AI services as a cheaper product line
Professional services: same as above but less exposed
Businesses that are needed only in proportion to other businesses having human employees: travel, office real estate, office furniture and supplies: losers

As the effects ripple out and more and more workers are displaced...

Low to mid-end luxury goods and eventually anything that depends on mass discretionary spending: losers

Though what I really would like to do is create some sort of rough model of an individual non-AI company with the following parameters:

Recurring costs attributable to employees
Other recurring costs
Revenue
Fraction of employees whose jobs can be automated at the current state of the art
Variables representing of how far along this company is in planning or implementing AI-driven consolidation and how quickly it is capable of cutting over to AI
Fixed costs of cut-over to AI
Variable costs of cut-over to AI (depending on aggregate workload being automated)
Whatever other variables people who unlike me actually know something about fundamental analysis would put in such a model.

...and then be able to make a principled guess about where on the AI-winners vs AI-losers spectrum a given company is. I even started sketching out a model like this until I realized that someone with relevant expertise must have already written a general-purpose model of this sort and I should find it and adapt it to the AI-automation scenario instead of making up my own.

rhollerith_dot_com on My simple AGI investment & insurance strategy

Our situation is analogous to someone who has been diagnosed with cancer and told he has a low probability of survival, but at least there's a nifty investment opportunity he can buy that pays off big if he does survive.