LessWrong 2.0 Reader

View: New · Old · Top

← previous page (newer posts) · next page (older posts) →

[question] Superintelligence Strategy: A Pragmatic Path to… Doom?
Mr Beastly (mr-beastly) · 2025-03-19T22:30:50.796Z · answers+comments (0)

SHIFT relies on token-level features to de-bias Bias in Bios probes
Tim Hua · 2025-03-19T21:29:15.974Z · comments (2)

Janet must die
Shmi (shminux) · 2025-03-19T20:35:09.768Z · comments (3)

[question] Why am I getting downvoted on Lesswrong?
Oxidize · 2025-03-19T18:32:47.243Z · answers+comments (14)

[link] Forecasting AI Futures Resource Hub
Alvin Ånestrand (alvin-anestrand) · 2025-03-19T17:26:28.059Z · comments (0)

[link] TBC episode w Dave Kasten from Control AI on AI Policy
Eneasz · 2025-03-19T17:09:50.841Z · comments (0)

Prioritizing threats for AI control
ryan_greenblatt · 2025-03-19T17:09:45.044Z · comments (2)

The Illusion of Transparency as a Trust-Building Mechanism
Priyanka Bharadwaj (priyanka-bharadwaj) · 2025-03-19T17:09:05.830Z · comments (0)

How Do We Govern AI Well?
kaime (khalid-ali) · 2025-03-19T17:08:49.601Z · comments (0)

[link] METR: Measuring AI Ability to Complete Long Tasks
Zach Stein-Perlman · 2025-03-19T16:00:54.874Z · comments (104)

Why I think AI will go poorly for humanity
Alek Westover (alek-westover) · 2025-03-19T15:52:18.373Z · comments (0)

The principle of genomic liberty
TsviBT · 2025-03-19T14:27:57.175Z · comments (51)

Going Nova
Zvi · 2025-03-19T13:30:01.293Z · comments (14)

Equations Mean Things
abstractapplic · 2025-03-19T08:16:35.312Z · comments (10)

[link] Elite Coordination via the Consensus of Power
Richard_Ngo (ricraz) · 2025-03-19T06:56:44.825Z · comments (15)

What I am working on right now and why: representation engineering edition
Lukasz G Bartoszcze (lukasz-g-bartoszcze) · 2025-03-18T22:37:45.363Z · comments (0)

Boots theory and Sybil Ramkin
philh · 2025-03-18T22:10:08.855Z · comments (17)

[link] Schmidt Sciences Technical AI Safety RFP on Inference-Time Compute – Deadline: April 30
Ryan Gajarawala (ryan-gajarawala) · 2025-03-18T18:05:34.757Z · comments (0)

PRISM: Perspective Reasoning for Integrated Synthesis and Mediation (Interactive Demo)
Anthony Diamond (anthony-diamond) · 2025-03-18T18:03:26.804Z · comments (0)

Subspace Rerouting: Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Le magicien quantique · 2025-03-18T17:55:07.016Z · comments (1)

[link] Progress links and short notes, 2025-03-18
jasoncrawford · 2025-03-18T17:14:35.365Z · comments (0)

The Convergent Path to the Stars
Maxime Riché (maxime-riche) · 2025-03-18T17:09:37.046Z · comments (0)

[link] Sapir-Whorf Ego Death
Jonathan Moregård (JonathanMoregard) · 2025-03-18T16:57:21.437Z · comments (7)

[link] Smelling Nice is Good, Actually
Gordon Seidoh Worley (gworley) · 2025-03-18T16:54:43.324Z · comments (8)

[link] A Taxonomy of Jobs Deeply Resistant to TAI Automation
Deric Cheng (deric-cheng) · 2025-03-18T16:25:55.562Z · comments (0)

Why Are The Human Sciences Hard? Two New Hypotheses
Aydin Mohseni (aydin-mohseni) · 2025-03-18T15:45:52.239Z · comments (14)

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions
Stuart_Armstrong · 2025-03-18T14:48:54.762Z · comments (12)

[question] What is the theory of change behind writing papers about AI safety?
Kajus · 2025-03-18T12:51:31.405Z · answers+comments (1)

OpenAI #11: America Action Plan
Zvi · 2025-03-18T12:50:03.880Z · comments (3)

I changed my mind about orca intelligence
Towards_Keeperhood (Simon Skade) · 2025-03-18T10:15:29.860Z · comments (24)

[question] Is Peano arithmetic trying to kill us? Do we care?
Q Home · 2025-03-18T08:22:27.761Z · answers+comments (2)

Do What the Mammals Do
CrimsonChin · 2025-03-18T03:57:56.083Z · comments (6)

What Actually Matters Until We Reach the Singularity
Lexius (Convalexius) · 2025-03-18T02:17:16.144Z · comments (0)

Meaning as a cognitive substitute for survival instincts: A thought experiment
Ovidijus Šimkus (ovidijus-simkus) · 2025-03-18T01:53:52.411Z · comments (0)

Against Yudkowsky's evolution analogy for AI x-risk [unfinished]
Fiora Sunshine (Fiora from Rosebloom) · 2025-03-18T01:41:06.453Z · comments (18)

An "AI researcher" has written a paper on optimizing AI architecture and optimized a language model to several orders of magnitude more efficiency.
Y B (y-b) · 2025-03-18T01:15:34.589Z · comments (1)

LessOnline 2025: Early Bird Tickets On Sale
Ben Pace (Benito) · 2025-03-18T00:22:02.653Z · comments (4)

Feedback loops for exercise (VO2Max)
Elizabeth (pktechgirl) · 2025-03-18T00:10:06.827Z · comments (9)

FrontierMath Score of o3-mini Much Lower Than Claimed
YafahEdelman (yafah-edelman-1) · 2025-03-17T22:41:06.527Z · comments (7)

Proof-of-Concept Debugger for a Small LLM
Peter Lai (peter-lai) · 2025-03-17T22:27:52.386Z · comments (0)

Effectively Communicating with DC Policymakers
PolicyTakes · 2025-03-17T22:11:56.197Z · comments (0)

[link] Mind the Gap
Bridgett Kay (bridgett-kay) · 2025-03-17T21:59:35.113Z · comments (0)

EIS XV: A New Proof of Concept for Useful Interpretability
scasper · 2025-03-17T20:05:30.580Z · comments (2)

[link] Sentinel's Global Risks Weekly Roundup #11/2025. Trump invokes Alien Enemies Act, Chinese invasion barges deployed in exercise.
NunoSempere (Radamantis) · 2025-03-17T19:34:01.850Z · comments (3)

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations
Nicholas Goldowsky-Dill (nicholas-goldowsky-dill) · 2025-03-17T19:11:00.813Z · comments (7)

Things Look Bleak for White-Collar Jobs Due to AI Acceleration
Declan Molony (declan-molony) · 2025-03-17T17:03:35.585Z · comments (0)

[link] Three Types of Intelligence Explosion
rosehadshar · 2025-03-17T14:47:46.696Z · comments (8)

An Advent of Thought
Kaarel (kh) · 2025-03-17T14:21:08.765Z · comments (8)

Interested in working from a new Boston AI Safety Hub?
agucova · 2025-03-17T13:42:19.509Z · comments (0)

Other Civilizations Would Recover 84+% of Our Cosmic Resources - A Challenge to Extinction Risk Prioritization
Maxime Riché (maxime-riche) · 2025-03-17T13:12:09.770Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

david-matolcsi on Karma Tests in Logical Counterfactual Simulations motivates strong agents to protect weak agents

Thanks for the reply, I broadly agree with your points here. I agree we should pronably eventually try to do trades across logical counter-factuals. Decreasing logical risk is one good framing for that, but in general, there are just positive trades to be made.

However, I think you are still underestimating how hard it might be to strike these deals. "Be kind to other existing agents" is a natural idea to us, but it's still unclear to me if it's something you should assign hogh probability to as a preference of logically counter-factual beings. Sure, there is enough room for humans and mosquitos, but if you relax 'agent' and 'existing', suddenly there is not enough room for everyone. You can argue that "be kind to existing agents" is plausibly a relatively short description length statement, so it will be among the first guesses of the AI and will allocate at least some fraction of the universe to it. But once trading across logical counter-factuals, I'm not sure you can trust things like description length. Maybe in the logical counter-factual universe, they assign higher value/probability to longer instead of shortet statements, but the measure still ends up to 1, because math works differently.

Similarly, you argue that loving torture is probably rare, based on evolutionary grounds. But logically counter-factual beings weren't necessarily born through evolution. I have no idea how we should determine the dstribution of logicsl counter-factuals, and I don't know what fraction enjoys torture in that distribution.

Altogether, I agree logical trade is eventually worth trying, but it will be very hard and confusing and I see a decent chance that it basically won't work at all.

xelap on Utility Maximization = Description Length Minimization

There's a minor error in the formula giving the cross entropy: you need a minus sign on the RHS so that it reads E[- log P[X|M_2] | M_2]

The preceding text is "Of course, we could be wrong about the distribution - we could use a code optimized for a model M2 which is different from the “true” model M1. In this case, the average number of bits used will be"

tenoke on aog's Shortform

It's hard for me to respect a Safety-ish org so obviously wrong about the most important factors of their chosen topic.

I won't judge a random celebrity for expecting e.g. very long timelines but an AI research center? I'm sure they are very cool people but come on.

tenoke on Why Should I Assume CCP AGI is Worse Than USG AGI?

As in ultimately more people are likely to like their condition and agree (comparably more) with the AI's decisions while having roughly equal rights.

samuelshadrach on A Dissent on Honesty

If you can’t provide a few unambiguous examples of the dilemma in the post that actually happened in the real world, I’m less likely to take your post seriously.

Might be worth thinking more and then coming up with examples.

eva_ on A Dissent on Honesty

I consider you to be basically agreeing with me for 90% of what I intended and your disagreements for the other 10% to be the best written of any so far, and basically valid in all the places I'm not replying to it. I still have a few objections:

What if my highest value is getting a pretty girl with a country-sized dowry, while having not betrayed the Truth? ... In short, no, Rationality absolutely can be about both Winning and about The Truth.

I agree the utility function isn't up for grabs and that that is a coherent set of values to have, but I have this criticism that I want to make that I feel I don't have the right language to make. Maybe you can help me. I want to call that utility function perverse. The kind of utilityfunction that an entity is probably mistaken to imagine itself as having.

For any particular situation you might find yourself in, for any particular sequence of actions you might do in that situation, there is a possible utilityfunction you could be said to have such that the sequence of actions is the rational behaviour of a perfect omniscient utility maximiser. If nothing else, pick the exact sequence of events that will result, declare that your utility function is +100 for that sequence of events and 0 for anything else, and then declare yourself a supremely efficient rationalist.

Actually doing that would be a mistake. It wouldn't be making you better. This is not a way to succeed at your goals, this is a way to observe what you're inclined to do anyway and paint the target around it. Your utility function (fake or otherwise) is supposed to describe stuff you actually want. Why would you want specifically that in particular?

I think the stronger version of Rationality is the version that phrases it as about getting the things you want, whatever those things might be. In that sense, if The Truth is merely a value, you should carefully segment it in your brain out from your practice of rationality: Your rationality is about mirroring the mathematical structure best suited for obtaining goals, and then to whatever degree you value The Truth above its normal instrumental value is something you buy where it's cheapest like all your other values. Mixing the two makes both worse, you pollute your concept of rational behaviour with a love of the truth (and therefore, for example, are biased towards imagining that other people who display rationality are probably honest, or other people who display honesty are probably rational) and you damage your ability to pursue the truth by not putting in the values category where it belongs where it will lead you to try to cheaply buy more of it.

Of course maybe you're just the kind of guy who really loves mixing his value for The Truth in with his rationality into a weird soup. That'd explain your actiosn without making you a walking violation of any kind of mathematical law, it'd just be a really weird thing for you to innately want.

I am still trying to find a better way to phrase this argument such that someone might find it persuasive of something, because I don't expect this phrasing to work.

I say and write things^[3] [LW · GW] because I consider those things to be true, relevant, and at least somewhat important. That by itself is very often (possibly usually) sufficient for a thing to be useful in a general sense (i.e., I think that the world is better for me having said it, which necessarily involves the world being better for the people in it). Whether the specific person to whom the thing is nominally or factually addressed will be better off as a result of what I said or wrote is not my concern in any way other than that.

I think I meant something subtly different that what you've taken that part to mean. I think you understand that, f other people noticed a pattern that everything you said was false, irrelevant, or unimportant, they would eventually stop bothering to listen when you talk, and this would mean you'd lose the ability to get other people to know things, which is a useful ability to have. This is basically my position! Whether the specific person you address is better off in each specific case isn't materal because you aren't trying to always make them better off, you're just trying to avoid being seen as someone who predictibly doesn't make them better off. I agree that calculating the full expected consequences to every person of every thing you say isn't necessary for this purpose.

No, this is a terrible idea. Do not do this. Act consequentialism does not work. ... Look, this is going to sound fatuous, but there really isn’t any better general rule than this: you should only lie when doing so is the right thing to do.

I agree that Act Consequentialism doesn't really work. I was trying to be a Rule consequentialist instead wben I wrote the above rule. I agree that that sounds fatuous, but I think the immediate feeling is pointing at a valid retort: You haven't operationalized this position into a decision process that a person can actually do (or even pretend to do).

I took great effort to try to right down my policy as something explicit in terms a person could try to do (even though I am willing to admit it is not really correct mostly because finite agent problems), because a person can't be a real Rule Consequentialist without actually having a Rule. What is the rule for "Only lie when doing so is the right thing to do"? It sounds like an instruction to pass the act to my rightness calculator, but if I program that rule into my rightness calculator, and then give it any input, it gets into an infinite loop. I have an Act Consequentialist rightness calculator as a backup, but if I pass the rule "only lie when doing so is the right thing to do" into that as a backup I'm just right back at doing act consequentialism.

If you can write down a better rule for when to lie the than what I've put above (that is also better than the "never" or "only by coming up with galaxy-brained ways it technically isn't lying" or Eliezer's meta-honesty idea that I've read before) I'd consider you to have (possibly) won this issue, but that's the real price of entry. It's not enough to point out the flaws where all my rules don't work, you have to produce rules that work better.

huera on Scaffolding Skills

re 2: Now that you mention it, I realized sharpening can be easily outsourced. My mistake.

re 1: I don't see it, buying pre-chopped onions is simply not equivalent to having a freshly chopped onion and some vegetables cannot be bought pre-cut. While cutting isn't a bottleneck for most people I had this chain in mind: (no cutting skills) -> (cooking takes more time and is less pleasant) -> (Less willingness to try new or complex recipes).

(Also, if you don't have proper technique, you're at a higher risk of cutting yourself. In that respect, it's like free climbing / using safety ropes)

re 3: I had self-experiments in general in mind (people run self-experiments, without knowing statistics, or even gathering data), but it did not occur to me that not all self-experiments are QS (probably most aren't). As written you are, of course, correct.

davidmanheim on Why Should I Assume CCP AGI is Worse Than USG AGI?

There are a number of ways that the US seems to have better values than the CCP, by my lights, but it seems incredibly strange to claim the US values being egalitarian, and social equality or harmony more.

Rule of law, fostering diversity, encouraging human excellence? Sure, there you would have an argument. But egalitarian?

papetoast on Scaffolding Skills

Disagree with

Cooking / cutting vegetables (also other things)
Cutting vegetables / sharpening knives
QS experiments / knowing statistics

The first two is pretty much like sketch / making pencils and paper, and the third one is absolutely essential and not a skill than you can not have

mitchell_porter on Why Should I Assume CCP AGI is Worse Than USG AGI?

What would it mean for an AGI to be aligned with "Democracy," or "Confucianism," or "Marxism with Chinese characteristics," or "the American Constitution"? Contingent on a world where such an entity exists and is compatible with my existence, what would my life be like in a weird transhuman future as a non-citizen in each system?

None of these philosophies or ideologies was created with an interplanetary transhuman order in mind, so to some extent a superintelligent AI guided by them, will find itself "out of distribution" when deciding what to do. And how that turns out, should depend on underlying features of the AGI's thought - how it reasons and how it deals with ontological crisis. We could in fact do some experiments along these lines - tell an existing frontier AI to suppose that it is guided by historic human systems like these, and ask how it might reinterpret the central concepts, in order to deal with being in a situation of relative omnipotence.

Supposing that the human culture of America and China is also a clue to the world that their AIs would build when unleashed, one could look to their science fiction for paradigms of life under cosmic circumstances. The West has lots of science fiction, but the one we keep returning to in the context of AI, is the Culture universe of Iain Banks. As for China, we know about Liu Cixin ("Three-Body Problem" series), and I also dwell on the xianxia novels of Er Gen, which are fantasy but do depict a kind of politics of omnipotence.