LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Training AI agents to solve hard problems could lead to Scheming
Marius Hobbhahn (marius-hobbhahn) · 2024-11-19T00:10:55.522Z · comments (12)

Checking in on Scott's composition image bet with imagen 3
Dave Orr (dave-orr) · 2024-12-22T19:04:17.495Z · comments (0)

[link] Linkpost: Memorandum on Advancing the United States’ Leadership in Artificial Intelligence
Nisan · 2024-10-25T04:37:00.828Z · comments (2)

[link] The Intelligence Curse
lukedrago · 2025-01-03T19:07:43.493Z · comments (23)

Toward Safety Cases For AI Scheming
Mikita Balesni (mykyta-baliesnyi) · 2024-10-31T17:20:06.019Z · comments (1)

Against empathy-by-default
Steven Byrnes (steve2152) · 2024-10-16T16:38:49.926Z · comments (24)

Measuring whether AIs can statelessly strategize to subvert security measures
Alex Mallen (alex-mallen) · 2024-12-19T21:25:28.555Z · comments (0)

AI Alignment via Slow Substrates: Early Empirical Results With StarCraft II
Lester Leong (lester-leong) · 2024-10-14T04:05:05.096Z · comments (9)

Why our politicians aren't Median
Yair Halberstadt (yair-halberstadt) · 2024-11-03T14:03:33.779Z · comments (15)

Base LLMs refuse too
Connor Kissane (ckkissane) · 2024-09-29T16:04:21.343Z · comments (20)

o1 Turns Pro
Zvi · 2024-12-10T17:00:08.036Z · comments (3)

AI #86: Just Think of the Potential
Zvi · 2024-10-17T15:10:06.552Z · comments (8)

[Intuitive self-models] 5. Dissociative Identity (Multiple Personality) Disorder
Steven Byrnes (steve2152) · 2024-10-15T13:31:46.157Z · comments (7)

AI #95: o1 Joins the API
Zvi · 2024-12-19T15:10:05.196Z · comments (1)

AI #96: o3 But Not Yet For Thee
Zvi · 2024-12-26T20:30:06.722Z · comments (8)

[question] Could orcas be (trained to be) smarter than humans? 
Towards_Keeperhood (Simon Skade) · 2024-11-04T23:29:26.677Z · answers+comments (22)

[link] How much I'm paying for AI productivity software (and the future of AI use)
jacquesthibs (jacques-thibodeau) · 2024-10-11T17:11:27.025Z · comments (16)

Read The Sequences As If They Were Written Today
Peter Berggren (peter-berggren) · 2025-01-02T02:51:36.537Z · comments (3)

Seeking Collaborators
abramdemski · 2024-11-01T17:13:36.162Z · comments (15)

AI #87: Staying in Character
Zvi · 2024-10-29T07:10:08.212Z · comments (3)

[link] The Alignment Trap: AI Safety as Path to Power
crispweed · 2024-10-29T15:21:26.545Z · comments (17)

An Illustrated Summary of "Robust Agents Learn Causal World Model"
Dalcy (Darcy) · 2024-12-14T15:02:44.828Z · comments (2)

U.S.-China Economic and Security Review Commission pushes Manhattan Project-style AI initiative
Phib · 2024-11-19T18:42:43.296Z · comments (7)

[link] Parkinson's Law and the Ideology of Statistics
Benquo · 2025-01-04T15:49:21.247Z · comments (0)

Reading RFK Jr so that you don’t have to
braces · 2024-11-22T00:59:19.583Z · comments (1)

Human study on AI spear phishing campaigns
Simon Lermen (dalasnoin) · 2025-01-03T15:11:14.765Z · comments (7)

Win/continue/lose scenarios and execute/replace/audit protocols
Buck · 2024-11-15T15:47:24.868Z · comments (2)

AI #84: Better Than a Podcast
Zvi · 2024-10-03T15:00:07.128Z · comments (7)

Safe Predictive Agents with Joint Scoring Rules
Rubi J. Hudson (Rubi) · 2024-10-09T16:38:16.535Z · comments (10)

Toward Safety Case Inspired Basic Research
Lucas Teixeira · 2024-10-31T23:06:32.854Z · comments (2)

ReSolsticed vol I: "We're Not Going Quietly"
Raemon · 2024-12-26T17:52:33.727Z · comments (4)

[link] a space habitat design
bhauth · 2024-11-25T17:28:48.481Z · comments (13)

[link] The Evals Gap
Marius Hobbhahn (marius-hobbhahn) · 2024-11-11T16:42:46.287Z · comments (7)

Vegans need to eat just enough Meat - emperically evaluate the minimum ammount of meat that maximizes utility
Johannes C. Mayer (johannes-c-mayer) · 2024-12-22T22:08:31.971Z · comments (34)

[question] What Have Been Your Most Valuable Casual Conversations At Conferences?
johnswentworth · 2024-12-25T05:49:36.711Z · answers+comments (20)

AI Assistants Should Have a Direct Line to Their Developers
Jan_Kulveit · 2024-12-28T17:01:58.643Z · comments (6)

[link] new chinese stealth aircraft
bhauth · 2025-01-01T00:19:10.644Z · comments (0)

[link] How Likely Are Various Precursors of Existential Risk?
NunoSempere (Radamantis) · 2024-10-28T13:27:31.620Z · comments (4)

How might we solve the alignment problem? (Part 1: Intro, summary, ontology)
Joe Carlsmith (joekc) · 2024-10-28T21:57:12.063Z · comments (5)

[link] Ideas for benchmarking LLM creativity
gwern · 2024-12-16T05:18:55.631Z · comments (11)

o3, Oh My
Zvi · 2024-12-30T14:10:05.144Z · comments (16)

Luck Based Medicine: No Good Very Bad Winter Cured My Hypothyroidism
Elizabeth (pktechgirl) · 2024-12-08T20:10:02.651Z · comments (3)

[link] The Mysterious Trump Buyers on Polymarket
Annapurna (jorge-velez) · 2024-10-18T13:26:25.565Z · comments (10)

Estimates of GPU or equivalent resources of large AI players for 2024/5
CharlesD · 2024-11-28T23:01:58.522Z · comments (7)

Parental Writing Selection Bias
jefftk (jkaufman) · 2024-10-13T14:00:03.225Z · comments (3)

A Conflicted Linkspost
Screwtape · 2024-11-21T00:37:54.035Z · comments (0)

I Finally Worked Through Bayes' Theorem (Personal Achievement)
keltan · 2024-12-05T02:04:16.547Z · comments (6)

Correct my H5N1 research ($reward)
Elizabeth (pktechgirl) · 2024-12-09T19:07:03.277Z · comments (24)

Claude Sonnet 3.5.1 and Haiku 3.5
Zvi · 2024-10-24T14:50:06.286Z · comments (9)

[link] Anthropic's updated Responsible Scaling Policy
Zac Hatfield-Dodds (zac-hatfield-dodds) · 2024-10-15T16:46:48.727Z · comments (3)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

rai on AI: Practical Advice for the Worried

update please!

ricraz on The Field of AI Alignment: A Postmortem, and What To Do About It

FWIW twitter search is ridiculously bad, it's often better to use google instead. In this case I had it as the second result when I googled "richardmcngo twitter safety fundamentals" (richardmcngo being my twitter handle).

quila on quila's Shortform

The incentive problem still remains, such that it's more effective to use the price system than to use a command economy to deal with incentive issues:

going by the linked tweet, does "incentive problem" mean "needing to incentivize individuals to share information about their preferences in some way, which is currently done through their economic behavior, in order for their preferences to be fulfilled"? and contrasted with a "command economy", where everything is planned out long in advance, and possibly on less information about the preferences of individual moral patients?

if so, those sound like abstractions which were relevant to the world so far, but can you not imagine any better way a superintelligence could elicit this information? it does not need to use prices or trade. as some basic examples:

it could let beings enter whatever they want into a computer in real time.
it could give them superintelligent nano-factory-devices (aligned to its own values) which create whatever they want locally^[1], in the most seamless way physically possible.
it could mind-scan those who are okay with this.

(these are just examples; i personally would expect something more complex and less thing-oriented, around moral patients who are okay with/desire it, where superintelligence imbues itself as computation throughout the lowest level of physics upon which this is possible, and so it is as if physics itself is contextually aware and benevolent)

A potentially large crux is I don't really think a utopia is possible, at least in the early years even by superintelligences, because I expect preferences in the new environment to grow unboundedly such that preferences are always dissatisfied

i interpret this to mean "some entities' values will want to use as much matter as they can for things, so not all values can be unboundedly fulfilled". this is true and not a crux for me. i want the world to be as good as possible, and if the moral patient who wants to make unboundedly many paperclips making unboundedly many paperclips would be less good than other things which could be done with the world, then an aligned agent would choose the best tradeoff.

superintelligence is context-aware in this way, it is not {a rigid system which fails to outliers it doesn't expect (e.g.: "tries to create utopia, but instead gives all the matter to whichever maximizer requests it all first"), and so which needs a somewhat less rigid but not-superintelligent system (an economy) to avoid this}.

^{^}
(if morally acceptable, e.g. no creating hells)

lgs on Some arguments against a land value tax

LVT applies to all land, but not to the improvements on the land.

We do not care about disincentivizing an investment in land (by which I mean, just buying land). We do care about disincentivizing investments in improvements on the land (by which I include buying the improvement on the land, as well as building such improvements). A signal of LVT intent will not have negative consequences unless it is interpreted as a signal of broader confiscation.

matthew-barnett on Evaluating the historical value misspecification argument

Looking back on this post after a year, I haven't changed my mind about the content of the post, but I agree with Seth Herd when he said this post was "important but not well executed".

In hindsight I was too careless with my language in this post, and I should have spent more time making sure that every single paragraph of the post could not be misinterpreted. As a result of my carelessness, the post was misinterpreted in a predictable direction. And while I'm not sure how much I could have done to eliminate this misinterpretation, I do think that I could have reduced it a fair bit with more effort and attention.

If you're not sure what misinterpretation I'm referring to, I'll just try to restate the main point that I was trying to make below. To be clear, what I say below is not identical to the content of this post (as the post was narrowly trying to respond to the framing of this problem given by MIRI; and in hindsight, it was a mistake to reply in that way), but I think this is a much clearer presentation of one of the main ideas I was trying to convey by writing this post:

In my opinion, a common belief among people theorizing about AI safety around 2015, particularly on LessWrong, was that we would design a general AI system by assigning it a specific goal, and the AI would then follow that goal exactly. This strict adherence to the goal was considered dangerous because the goal itself would likely be subtly flawed or misspecified in a way we hadn’t anticipated. While the goal might appear to match what we want on the surface, in reality, it would be slightly different from what we anticipate, with edge cases that don't match our intentions. The idea was that the AI wouldn’t act in alignment with human intentions—it would rigidly pursue the given goal to its logical extreme, leading to unintended and potentially catastrophic consequences.

The goal in question could theoretically be anything, but it was often imagined as a formal utility function—a mathematical representation of a specific objective that we would directly program into the AI, potentially by hardcoding the goal in a programming language like Python or C++. The AI, acting as a powerful optimizer, would then work to maximize this utility function at any and all costs. However, other forms of goal specification were also considered for illustrative purposes. For example, a common hypothetical scenario was that an AI might be given an English-language instruction, such as "make as many paperclips as you can." In this example, the AI would misinterpret the instruction by interpreting it overly literally. It would focus exclusively on maximizing the number of paperclips, without regard for the broader intentions of the user, such as not harming humanity or destroying the environment in the process.

However, based on how current large language models operate, I don’t think this kind of failure mode is a good match for what we’re seeing in practice. LLMs typically do not misinterpret English-language instructions in the way that these older thought experiments imagined. This isn’t just because LLMs seem to "understand" English better than people expected—it's not that people expected superintelligences would not understand English. My point is not that LLMs possess natural language comprehension, so therefore the LessWrong community was mistaken.

Instead, my claim is that LLMs usually follow and execute user instructions in a manner that aligns with the user's actual intentions. In other words, the AI's actual behavior generally matches what the user meant for them to do, rather than leading to extreme, unintended outcomes caused by rigidly literal interpretations of instructions.

Because of the fact that LLMs are capable of doing this, despite in my opinion being general AIs, I believe it’s fair to say that the concerns raised by the LessWrong community about AI systems rigidly following misspecified goals were, at least in this specific sense, misguided when applied to the behavior of current LLMs.

max-entropy on The Online Sports Gambling Experiment Has Failed

I feel like we're the blind men with the elephant more often than we'd like to admit. A lot of the time when two people make conflicting claims about society, really they're both right about their substrate of society and the world is just twice as big as either thought.

Another shocker for most people: 20 million people in the US live in trailer parks. People with similar life circumstances tend to accumulate in similar places, only see those places, and thus vastly underestimate the diversity of life experience. (This is also true of everything in https://www.lesswrong.com/posts/KpMNqA5BiCRozCwM3/social-dark-matter)

1a3orn on RohanS's Shortform

So the story goes like this: there are two ways people think of "general intelligence." Fuzzy frame upcoming that I do not fully endorse.

General Intelligence = (general learning algorithm) + (data)
General Intelligence = (learning algorithm) + (general data)

It's hard to describe all the differences here, so I'm just going to enumerate some ways people approach the world differently, depending on the frame.

Seminal text for the first The Power of Intelligence [LW · GW], which attributes general problem solving entirely to the brain. Seminal text for the second is The Secret of Our Success [LW · GW], which points out that without the load of domain-specific culture, human problem solving is shit.
When the first think of the moon landing, they think "Man, look at that out-of-domain problem solving, that lets a man who evolved in Africa walk on the moon." When the second think of the moon landing, they think of how humans problem solving is so situated that we needed to not just hire the Nazis who had experience with rockets but put them in charge.
The first thinks of geniuses as those with a particularly high dose of General Intelligence, which is why they solved multiple problems in multiple domains (like Einstein, and Newton did). The second thinks of geniuses as slightly smarter-than average people who probably crested a wave of things that many of their peers might have figured out... and who did so because they were more stubborn, such that eventually they would endorse dumb ideas with as much fervor as they did their good ones (like Einstein, and Newton did).
First likes to make analogies of... intelligence to entire civilizations [LW · GW]. Second thinks that's cool, but look -- civilization does lots of things brains empirically don't, so maybe civilization is the problem-solving unit generally? Like the humans who walked on the moon did not, in fact, get their training data from the savannah, and that seems pretty relevant.
First... expects LLMs to not make it, because they are bad at out-of-domain thinking, maybe. Second is like, sure, LLMs are bad at out-of-domain thinking. So are humans, so what? Spiky intelligence and so on. Science advances not in one mind, but with the funeral of each mind. LLMs lose plasticity as they train. Etc.

frisby on The subset parity learning problem: much more than you wanted to know

Confirming that efficiently finding a small circuit (you don't actually need further restrictions than size) based on its values on a fixed collection of test inputs is known to imply --- see this paper.

kabir-kumar on Why I'm Moving from Mechanistic to Prosaic Interpretability

Intelligence is computation. It's measure is success. General intelligence is more generally successful.

christiankl on The Online Sports Gambling Experiment Has Failed

It's a combination of the way LLM work that they predict the most likely token that very similar to prediction common wisdom and experience of interacting with LLMs.

Pattern matching also matters. After reading the answers from Claude and ChatGPT, you can ask yourself what you expect a person to tell you when you ask them for the top five reason and how likely it is that they will tell you "sports online betting" as one of the top five reasons.