LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Housing Roundup #9: Restricting Supply
Zvi · 2024-07-17T12:50:05.321Z · comments (8)

Virtually Rational - VRChat Meetup
Tomás B. (Bjartur Tómas) · 2024-01-28T05:52:36.934Z · comments (3)

Beta Tester Request: Rallypoint Bounties
lukemarks (marc/er) · 2024-05-25T09:11:11.446Z · comments (4)

[link] Transformer Debugger
Henk Tillman (henk-tillman) · 2024-03-12T19:08:56.280Z · comments (0)

[link] Arrogance and People Pleasing
Jonathan Moregård (JonathanMoregard) · 2024-02-06T18:43:09.120Z · comments (7)

Using an LLM perplexity filter to detect weight exfiltration
Adam Karvonen (karvonenadam) · 2024-07-21T18:18:05.612Z · comments (11)

[link] Report: Evaluating an AI Chip Registration Policy
Deric Cheng (deric-cheng) · 2024-04-12T04:39:45.671Z · comments (0)

[link] The Best Essay (Paul Graham)
Chris_Leong · 2024-03-11T19:25:42.176Z · comments (2)

[link] Structured Transparency: a framework for addressing use/mis-use trade-offs when sharing information
habryka (habryka4) · 2024-04-11T18:35:44.824Z · comments (0)

[link] An Intuitive Explanation of Sparse Autoencoders for Mechanistic Interpretability of LLMs
Adam Karvonen (karvonenadam) · 2024-06-25T15:57:16.872Z · comments (0)

Useful starting code for interpretability
eggsyntax · 2024-02-13T23:13:47.940Z · comments (2)

[link] Sticker Shortcut Fallacy — The Real Worst Argument in the World
ymeskhout · 2024-06-12T14:52:41.988Z · comments (15)

An experiment on hidden cognition
Olli Järviniemi (jarviniemi) · 2024-07-22T03:26:05.564Z · comments (2)

$250K in Prizes: SafeBench Competition Announcement
ozhang (oliver-zhang) · 2024-04-03T22:07:41.171Z · comments (0)

[link] MIRI's July 2024 newsletter
Harlan · 2024-07-15T21:28:17.343Z · comments (2)

The Wisdom of Living for 200 Years
Martin Sustrik (sustrik) · 2024-06-28T04:44:10.609Z · comments (3)

Decent plan prize winner & highlights
lukehmiles (lcmgcd) · 2024-01-19T23:30:34.242Z · comments (2)

[link] Announcing Open Philanthropy's AI governance and policy RFP
Julian Hazell (julian-hazell) · 2024-07-17T02:02:39.933Z · comments (0)

Testing for consequence-blindness in LLMs using the HI-ADS unit test.
David Scott Krueger (formerly: capybaralet) (capybaralet) · 2023-11-24T23:35:29.560Z · comments (2)

[link] Beware the science fiction bias in predictions of the future
Nikita Sokolsky (nikita-sokolsky) · 2024-08-19T05:32:47.372Z · comments (20)

[question] What's the Deal with Logical Uncertainty?
Ape in the coat · 2024-09-16T08:11:43.588Z · answers+comments (21)

[link] A primer on the next generation of antibodies
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-01T22:37:59.207Z · comments (0)

Distinguish worst-case analysis from instrumental training-gaming
Olli Järviniemi (jarviniemi) · 2024-09-05T19:13:34.443Z · comments (0)

A Visual Task that's Hard for GPT-4o, but Doable for Primary Schoolers
Lennart Finke (l-f) · 2024-07-26T17:51:28.202Z · comments (4)

[link] Fictional parasites very different from our own
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-08T14:59:39.080Z · comments (0)

Seeking Mechanism Designer for Research into Internalizing Catastrophic Externalities
c.trout (ctrout) · 2024-09-11T15:09:48.019Z · comments (2)

Foresight Institute: 2023 Progress & 2024 Plans for funding beneficial technology development
Allison Duettmann (allison-duettmann) · 2023-11-22T22:09:16.956Z · comments (1)

[link] Clickbait Soapboxing
DaystarEld · 2024-03-13T14:09:29.890Z · comments (15)

A brief review of China's AI industry and regulations
Elliot_Mckernon (elliot) · 2024-03-14T12:19:00.775Z · comments (0)

Population ethics and the value of variety
cousin_it · 2024-06-23T10:42:21.402Z · comments (11)

[link] The Living Planet Index: A Case Study in Statistical Pitfalls
Jan_Kulveit · 2024-06-24T10:05:55.101Z · comments (0)

Even if we lose, we win
Morphism (pi-rogers) · 2024-01-15T02:15:43.447Z · comments (17)

Distillation of 'Do language models plan for future tokens'
TheManxLoiner · 2024-06-27T20:57:34.351Z · comments (2)

Best-of-n with misaligned reward models for Math reasoning
Fabien Roger (Fabien) · 2024-06-21T22:53:21.243Z · comments (0)

[link] The absence of self-rejection is self-acceptance
Chipmonk · 2023-12-21T21:54:52.116Z · comments (1)

[link] Secret US natsec project with intel revealed
Nathan Helm-Burger (nathan-helm-burger) · 2024-05-25T04:22:11.624Z · comments (0)

[link] Robert Caro And Mechanistic Models In Biography
adamShimi · 2024-07-14T10:56:42.763Z · comments (5)

[link] Truth is Universal: Robust Detection of Lies in LLMs
Lennart Buerger · 2024-07-19T14:07:25.162Z · comments (3)

[question] What percent of the sun would a Dyson Sphere cover?
Raemon · 2024-07-03T17:27:50.826Z · answers+comments (26)

aintelope project update
Gunnar_Zarncke · 2024-02-08T18:32:00.000Z · comments (2)

Weeping Agents
pleiotroth · 2024-06-06T12:18:54.978Z · comments (2)

An evaluation of Helen Toner’s interview on the TED AI Show
PeterH · 2024-06-06T17:39:40.800Z · comments (2)

A Basic Economics-Style Model of AI Existential Risk
Rubi J. Hudson (Rubi) · 2024-06-24T20:26:09.744Z · comments (3)

[question] Could there be "natural impact regularization" or "impact regularization by default"?
tailcalled · 2023-12-01T22:01:46.062Z · answers+comments (6)

[link] Alignment work in anomalous worlds
Tamsin Leake (carado-1) · 2023-12-16T19:34:26.202Z · comments (4)

Technology path dependence and evaluating expertise
bhauth · 2024-01-05T19:21:23.302Z · comments (2)

Building Trust in Strategic Settings
StrivingForLegibility · 2023-12-28T22:12:24.024Z · comments (0)

How Congressional Offices Process Constituent Communication
Tristan Williams (tristan-williams) · 2024-07-02T12:38:41.472Z · comments (0)

Scientific Method
Andrij “Androniq” Ghorbunov (andrij-androniq-ghorbunov) · 2024-02-18T21:06:45.228Z · comments (4)

[link] Review of Alignment Plan Critiques- December AI-Plans Critique-a-Thon Results
Iknownothing · 2024-01-15T19:37:07.984Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

niplav on shortplav

The question is, would it be better for companies than the current situation? Because it's the company who decides the form of the interview, so if the answer is negative, this is not going to happen.

Yeah, I don't think this is going to be adopted very soon. My best guess at how that could happen is if people try it in low-stakes contexts where the parties are ~symmetric in power, and this then spreads through e.g. people who do consulting for small startups, to salaries for high-value employees in small startups, to salaries for high-value employees in general etc.

Another way this could happen is if unions push for it, but I don't see that happening anytime soon.

(I'm going to see whether me putting this up as a way of determining rates can work, but probably not.)

niplav on shortplav

Yeah, the spherical cow system would be using the VCG mechanism with the Clarke pivot rule, but that would usually require some subsidy. There can be no spherical cow system which elicits truthful bids without subsidy, sadly :-/.

niplav on shortplav

Does this require some sort of enforcement mechanism to ensure that neither party puts in a bad-faith bid as a discovery mechanism for what number to seek in their real negotiations?

Maybe there's a misunderstanding here—the mechanism I was writing about would be the "real negotiations" (whatever result falls out of the mechanism now is what's going to happen). As in, there can be a lot of talking about salaries before the this two-sided sealed auction is performed, but the salary is decided through the auction.

I know of some software engineers who have published their salary history online.

jenniferrm on ASIs will not leave just a little sunlight for Earth

I appreciate your desire for this clarity, but I think the counter argument might actually just be "the oversimplifying assumption that everyone's labor just ontologically goes on existing is only true if society (and/or laws and/or voters-or-strongmen) make it true on purpose (which they tended to do, for historically contingent reasons, in some parts of Earth, for humans, and some pets, between the late 1700s and now)".

You could ask: why is the holocene extinction occurring when Ricardo's Law of Comparative Advantage says that wooly mammoths (and many amphibian species) and cave men could have traded...

...but once you put it that way, it is clear that it really kinda was NOT in the narrow short term interests of cave men to pay the costs inherent in respecting the right to life and right to property of beasts that can't reason about natural law.

Turning land away from use by amphibians and towards agriculture was just... good for humans and bad for frogs. So we did it. Simple as.

The math of ecology says: life eats life, and every species goes extinct eventually. The math of economics says: the richer you are, the more you can afford to be linearly risk tolerant (which is sort of the definition of prudent sanity) for larger and larger choices, and the faster you'll get richer than everyone else, and so there's probably "one big rich entity" at the end of economic history.

Once humans close their heart to other humans and "just stop counting those humans over there as having interests worth calculating about at all" it really does seem plausible that genocide is simply "what many humans would choose to do, given those (evil) values".

Slavery is legal in the US, after all. And the CCP has Uighur Gulags. And my understanding is that Darfur is headed for famine?

I think this is sort of the "ecologically economic core" of Eliezer's position: kindness is simply not a globally instrumentally convergent tactic across all possible ecological and economic regimes... right now quite a few humans want there to not be genocide and slavery of other humans, but if history goes in a sad way in the next ~100 years, there's a decent chance the other kind of human (the ones that quite like the long term effects of the genocide and/or enslavement other sapient beings) will eventually get their way and genocide a bunch of other humans.

If all of modern morality is a local optimum that is probably not the global optimum, then you might look out at the larger world and try and figure out what naturally occurs [LW · GW] when the powerful do as they will, and the weak cope as they can...

Once the billionaires like Putin and Xi and Trump and so on don't need human employees any more, its seems plausible they could aim for a global Earth population of humans of maybe 20,000 people, plus lots and lots of robot slaves?

It seems quite beautiful and nice to be here, now, with so many people having so many dreams, and so many of us caring about caring about other sapient beings... but unless we purposefully act to retain this moral shape, in ourselves and in our digital and human progeny, we (and they) will probably fall out of this shape in the long run.

And that would be sad. For quite a few philosophic reasons, and also for over 7 billion human reasons.

And personally, I think the only way to "keep the party going" even for a few more centuries or millennia is to become extremely wealthy.

I think we should be mining asteroids, and building fusion plants, and building new continents out of ice, and terraforming Venus and Mars, and I think we should build digital people who know how precious and rare humane values so they can enjoy the party with us, and keep it going for longer than we could plausibly hope to (since we tend to be pretty terrible at governing ourselves).

But we shouldn't believe good outcomes are inevitable or even likely, because they aren't. If something slightly smarter than us with a feasible doubling time of weeks instead of decades arrives, we could be the next frogs.

lil pepe (@pagcompliments) / X

tailcalled on We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap

But propositional calculus and first-order logic exist to support mathematics, which was developed before formal logix. What's your mathematics-of-value, rather than your logic-of-value?

tailcalled on ASIs will not leave just a little sunlight for Earth

It's also not enough for there to be a force that makes the AI care a little about human thriving. It's also necessary for this force to not make the AI care a lot about some extremely distorted version of you; as then we get into concepts like tiny molecular smiles, locking you in a pleasuredome, etc..

If you're not supposed to end up as a pet of the AI, then it seems like it needs to respect property rights, but that is easier said than done when considering massive differences in ability. Consider: would we even be able to have a society where we respected property rights of dogs? It seems like it would be difficult. How could we confirm a transaction without the dogs being defrauded of everything?

Probably an intermediate solution would be to just accept humans will be defrauded of everything very rapidly but then give us universal basic income or something so our failures aren't permanent setbacks. But it's unclear how to respect the freedom of funding while preventing people from funding terrorists and not encouraging people to get lost in junk. That's really where the issue of values becomes hard.

richard_kennaway on We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap

It's an interesting open problem.

Here is an analogy. Classical utility theory, as developed by VNM, Savage, and others, the theory of which Eliezer made the searchlight comment, is like propositional calculus. The propositional calculus exists, it's useful, you cannot ever go against it without falling into contradiction, but there's not enough there to do much mathematics. For that you need to invent at least first-order logic, and use that to axiomatise arithmetic and eventually all of mathematics, while fending off the paradoxes of self-reference. And all through that, there is the propositional calculus, as valid and necessary as ever, but mathematics requires a great deal more.

The theory that would deal with the "monsters" that I listed does not yet exist. The idea of expected utility may thread its way through all of that greater theory when we have it, but we do not have it. Until we do, talk of the utility function of a person or of an AI is at best sensing what Eliezer has called the rhythm of the situation [? · GW]. To place over-much reliance on its letter will fail.

faul_sname on ASIs will not leave just a little sunlight for Earth

You can imagine a version of Stockfish which does that -- a chessplayer which, if it's sure it can win anyways, will start letting you have a pawn or two -- but it's not simpler to build.

I think it sometimes is simpler to build? Simple RL game-playing agents sometimes exhibit exactly that sort of behavior, unless you make an explicit effort to train it out of them.

For example, HexHex is a vaguely-AlphaGo-shaped RL agent for the game of Hex. The reward function used to train the agent was "maximize the assessed probability of winning", not "maximize the assessed probability of winning, and also go hard even if that doesn't affect the assessed probability of winning". In their words:

We found it difficult to train the agent to quickly end a surely won game. When you play against the agent you'll notice that it will not pick the quickest path to victory. Some people even say it's playing mean ;-) Winning quickly simply wasn't part of the objective function! We found that penalizing long routes to victory either had no effect or degraded the performance of the agent, depending on the amount of penalization. Probably we haven't found the right balance there.

Along similar lines, the first attack on KataGo found by Wang et al in Adversarial Policies Beat Superhuman Go AIs was the pass-adversary. The pass-adversary first sets up a losing board position where it controls a small amount of territory and KataGo has a large amount of territory it would end up controlling if the game was played out fully. However, KataGo chooses to pass, since it assesses that the probability of winning from that position is similar if it does or does not make a move, and then the pass-adversary also passes, ending the game and winning by a quirk of the scoring rules.

Similarly, there isn't an equally-simple version of GPT-o1 that answers difficult questions by trying and reflecting and backing up and trying again, but doesn't fight its way through a broken software service to win an "unwinnable" capture-the-flag challenge. It's all just general intelligence at work.

I suspect that a version of GPT-o1 that is tuned to answer difficult questions in ways that human raters would find unsurprising would work just fine. I think "it's all just general intelligence at work" is a semantic stop sign [LW · GW], and if you dig into what you mean by "general intelligence at work" you get to the fiddly implementation details of how the agent tries to solve the problem. So you may for example see an OODA-loop-like structure like

Assess the situation
Figure out what affordances there are for doing things
For each of the possible actions, figure out what you would expect the outcome of that action to be. Maybe figure out ways it could go wrong, if you're feeling super advanced.
Choose one of the actions, or choose to give up if no sufficiently good action is available
Do the action
Determine how closely the result matches what you expect

An agent which "goes hard", in this case, is one which leans very strongly against the "give up" action in step 4. However, I expect that if you have some runs where the raters would have hoped for a "give up" instead of the thing the agent actually did, it would be pretty easy to generate a reinforcement signal which makes the agent more likely to mash the "give up" button in analogous situations without harming performance very much in other situations. I also expect that would generalize.

As a note, "you have to edit the service and then start the modified service" is the sort of thing I would be unsurprised to see in a CTF challenge, unless the rules of the challenge explicitly said not to do that.

But that is not the default outcome when OpenAI tries to train a smarter, more salesworthy AI.

I think you should break out "smarter" from "more salesworthy". In terms of "smarter", optimizing for task success at all costs is likely to train in patterns of bad behavior. In terms of "more salesworthy", businesses are going to care a lot about "will explain why the goal is not straightforwardly achievable rather than executing galaxy-brained evil-genie plans". As such, a modestly smart Do What I Mean and Check [LW · GW] agent is a much easier sell than a superintelligent evil genie agent.

If an AI is easygoing and therefore can't solve hard problems, then it's not the most profitable possible AI, and OpenAI will keep trying to build a more profitable one.

I expect the tails come apart along the "smart" and "profitable" axes.

tailcalled on ASIs will not leave just a little sunlight for Earth

Yes, I agree that this conditional statement is obvious. But while we're on the general topic of whether Earth will be kept alive, it would be nice to see some engagement with Paul Christiano's arguments (which Carl Shulman "agree[s] with [...] approximately in full") that superintelligences might care about what happens to you a little bit, articulated in a comment thread on Soares's "But Why Would the AI Kill Us?" and another thread on "Cosmopolitan Values Don't Come Free",

Nate Soares engaged extensively with this in reasonable-seeming ways that I'd thus expect Eliezer Yudkowsky to mostly agree with. Mostly it seems like a disagreement where Paul Christiano doesn't really have a model of what realistically causes good outcomes and so he's really uncertain, whereas Soares has a proper model and so is less uncertain.

But you can't really argue with someone whose main opinion is "I don't know", since "I don't know" is just garbage. He's gotta at least present some new powerful observable forces, or reject some of the forces presented, rather than postulating that maybe there's an unobserved kindness force that arbitrarily explains all the kindness that we see.

quetzal_rainbow on ASIs will not leave just a little sunlight for Earth

I think the simplest argument to "caring a little" is that there is a difference between "caring a little" and "caring enough". Let's say that AI is ready to pay 1$ for your survival. If you live in economy which rapidly disassembles Earth into Dyson swarm, oxygen, protected environment and food are not just stuff lying around, they are complex expensive artifacts and AI is certainly not ready to pay for your O'Neil cylinder to be evacuated into and not ready to pay opportunity costs of not disassembling Earth, so you die.

The other case is difference "caring in general" and "caring ceteris paribus". It's possible for AI to prefer, all things equal, world with n+1 happy humans to the world with n happy humans. But really AI wants to implement some particular neuromorphic computation from human brain and, given ability to freely operate, it would tile the world with chips imitating part of human brain.