LessWrong 2.0 Reader

View: New · Old · Top

← previous page (newer posts) · next page (older posts) →

[question] Should I fundraise for open source search engine?
samuelshadrach (xpostah) · 2025-03-23T13:04:16.149Z · answers+comments (2)

[link] Privateers Reborn: Cyber Letters of Marque
arealsociety (shane-zabel) · 2025-03-23T03:39:25.990Z · comments (2)

Beware nerfing AI with opinionated human-centric sensors
Haotian (haotian-huang) · 2025-03-23T01:09:16.770Z · comments (0)

Reframing AI Safety as a Neverending Institutional Challenge
scasper · 2025-03-23T00:13:48.614Z · comments (12)

The Dangerous Illusion of AI Deterrence: Why MAIM Isn’t Rational
mc1soft · 2025-03-22T22:55:02.355Z · comments (0)

Dayton, Ohio, ACX Meetup
Lunawarrior · 2025-03-22T19:45:55.510Z · comments (0)

[Replication] Crosscoder-based Stage-Wise Model Diffing
annas (annasoli) · 2025-03-22T18:35:19.003Z · comments (0)

The Principle of Satisfying Foreknowledge
Randall Reams (randall-reams) · 2025-03-22T18:20:27.998Z · comments (0)

[question] Urgency in the ITN framework
Shaïman · 2025-03-22T18:16:07.900Z · answers+comments (2)

Transhumanism and AI: Toward Prosperity or Extinction?
Shaïman · 2025-03-22T18:16:07.868Z · comments (2)

Tied Crosscoders: Explaining Chat Behavior from Base Model
Santiago Aranguri (aranguri) · 2025-03-22T18:07:21.751Z · comments (0)

Dusty Hands and Geo-arbitrage
Tomás B. (Bjartur Tómas) · 2025-03-22T16:05:30.364Z · comments (3)

100+ concrete projects and open problems in evals
Marius Hobbhahn (marius-hobbhahn) · 2025-03-22T15:21:40.970Z · comments (1)

Do models say what they learn?
Andy Arditi (andy-arditi) · 2025-03-22T15:19:18.800Z · comments (12)

AGI Morality and Why It Is Unlikely to Emerge as a Feature of Superintelligence
funnyfranco · 2025-03-22T12:06:55.723Z · comments (9)

2025 Q3 Pivotal Research Fellowship: Applications Open
Tobias H (clearthis) · 2025-03-22T10:54:35.492Z · comments (0)

[link] Good Research Takes are Not Sufficient for Good Strategic Takes
Neel Nanda (neel-nanda-1) · 2025-03-22T10:13:38.257Z · comments (28)

Grammatical Roles and Social Roles: A Structural Analogy
Lucien (lucien) · 2025-03-22T07:44:07.383Z · comments (0)

Legibility
lsusr · 2025-03-22T06:54:35.259Z · comments (22)

Why Were We Wrong About China and AI? A Case Study in Failed Rationality
thedudeabides · 2025-03-22T05:13:52.181Z · comments (38)

A Short Diatribe on Hidden Assertions.
Eggs (donald-sampson) · 2025-03-22T03:14:37.577Z · comments (2)

Transformer Attention’s High School Math Mistake
Max Ma (max-ma) · 2025-03-22T00:16:34.082Z · comments (1)

[link] Making Sense of President Trump’s Annexation Obsession
Annapurna (jorge-velez) · 2025-03-21T21:10:30.872Z · comments (3)

How I force LLMs to generate correct code
claudio · 2025-03-21T14:40:19.211Z · comments (7)

Prospects for Alignment Automation: Interpretability Case Study
Jacob Pfau (jacob-pfau) · 2025-03-21T14:05:51.528Z · comments (4)

[link] Epoch AI released a GATE Scenario Explorer
Lee.aao (leonid-artamonov) · 2025-03-21T13:57:09.172Z · comments (0)

They Took MY Job?
Zvi · 2025-03-21T13:30:38.507Z · comments (4)

Silly Time
jefftk (jkaufman) · 2025-03-21T12:30:08.560Z · comments (2)

[link] Towards a scale-free theory of intelligent agency
Richard_Ngo (ricraz) · 2025-03-21T01:39:42.251Z · comments (24)

[question] Any mistakes in my understanding of Transformers?
Kallistos · 2025-03-21T00:34:30.667Z · answers+comments (7)

[link] A Critique of “Utility”
Zero Contradictions · 2025-03-20T23:21:41.900Z · comments (10)

Intention to Treat
Alicorn · 2025-03-20T20:01:19.456Z · comments (4)

[link] Anthropic: Progress from our Frontier Red Team
UnofficialLinkpostBot (LinkpostBot) · 2025-03-20T19:12:02.151Z · comments (3)

Everything's An Emergency
omnizoid · 2025-03-20T17:12:23.006Z · comments (0)

Non-Consensual Consent: The Performance of Choice in a Coercive World
Alex_Steiner · 2025-03-20T17:12:16.302Z · comments (4)

Minor interpretability exploration #4: LayerNorm and the learning coefficient
Rareș Baron · 2025-03-20T16:18:04.801Z · comments (0)

[question] How far along Metr's law can AI start automating or helping with alignment research?
Christopher King (christopher-king) · 2025-03-20T15:58:08.369Z · answers+comments (21)

Human alignment
Lucien (lucien) · 2025-03-20T15:52:22.081Z · comments (2)

[question] Seeking: more Sci Fi micro reviews
Yair Halberstadt (yair-halberstadt) · 2025-03-20T14:31:37.597Z · answers+comments (0)

AI #108: Straight Line on a Graph
Zvi · 2025-03-20T13:50:00.983Z · comments (5)

[link] What is an alignment tax?
Vishakha (vishakha-agrawal) · 2025-03-20T13:06:58.087Z · comments (0)

Longtermist Implications of the Existence Neutrality Hypothesis
Maxime Riché (maxime-riche) · 2025-03-20T12:20:40.661Z · comments (2)

You don't have to be "into EA" to attend EAG(x) Conferences
gergogaspar (gergo-gaspar) · 2025-03-20T10:44:12.271Z · comments (0)

Defense Against The Super-Worms
viemccoy · 2025-03-20T07:24:56.975Z · comments (1)

Socially Graceful Degradation
Screwtape · 2025-03-20T04:03:41.213Z · comments (9)

Daniel Dennett, the Unity of Consciousness, and Animal Minds
stormykat (sandypawbs) · 2025-03-20T03:43:38.108Z · comments (0)

Apply to MATS 8.0!
Ryan Kidd (ryankidd44) · 2025-03-20T02:17:58.018Z · comments (4)

Improved visualizations of METR Time Horizons paper.
LDJ (luigi-d) · 2025-03-19T23:36:52.771Z · comments (4)

Is CCP authoritarianism good for building safe AI?
Hruss (henry-russell) · 2025-03-19T23:13:36.397Z · comments (0)

The case against "The case against AI alignment"
KvmanThinking (avery-liu) · 2025-03-19T22:40:33.812Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

habryka4 on What Makes an AI Startup "Net Positive" for Safety?

Ah, I see. I did interpret the framing around "net positive" to be largely around normal companies. It's IMO relatively easy to be net-positive, since all you need to do is to avoid harm in expectation and help in any way whatsoever, which my guess is almost any technology startup that doesn't accelerate things, but has reasonable people at the aim, can achieve.

When we are talking more about "how to make a safety-focused company that is substantially positive on safety?", that is a very different question in my mind.

eva_ on A Dissent on Honesty

I do have examples that motivated me to right this, but they're all examples where people are still strongly disagreeing about the object level of what happened, or possibly lying about how they disagree on the object level and pretending they're committed to honesty. I thought about putting them in the essay but decided it wouldn't be fair and I didn't want to distract my actual thesis into a case analysis of how maybe all my examples have a problem other than over-adherence to bad honesty norms. Should I put them in a comment? I'm genuinely unsure. I could probably DM you them if you really want?

hyperion-1 on Why Should I Assume CCP AGI is Worse Than USG AGI?

There's also the possibility that a CCP AGI can only happen through being trained on Western data to some extent (i.e., the English language internet) because otherwise they can't scale data enough. This implies that it would probably be a "Marxism with Chinese characteristics [with American characteristics]" AI since it seems like that just raises the "alignment to CCP values" technical challenge difficulty a lot.

elifland on AI 2027: What Superintelligence Looks Like

Isn't it kinda unreasonable to put 10% on superhuman coder in a year if current AIs have a 15 nanosecond time horizon? TBC, it seems fine IMO if the model just isn't very good at predicting the 10th/90th percentile, especially wiht extreme hyperparameters.

Yeah I guess I think this is probably wrong. Perhaps one way to frame this would be that we would think there was a much lower chance of aggressive superexponentiality if we started at a lower time horizon, so we are sort of hardcoding the superexponentiality parameters to the current time horizon. This might not be the only issue though.

elifland on AI 2027: What Superintelligence Looks Like

The more advanced model has fundamentally the same issues, but I haven't dug as deep there yet.

I agree that to the extent the treatment of superexponentiality is wrong it's also an issue there, albeit a substantially smaller one because the time horizon extension plays a smaller role in that model.

elifland on AI 2027: What Superintelligence Looks Like

Thanks a lot for digging into this, appreciate it!

being bad at 10% and 90% work

Is the 90% supposed to say 50%?

If the model is about "how far we are to super-human effort and are extrapolating how long it takes to get there", a median-defining singularity that is not mentioned except in a table is just not good science.

Yup, very open to the criticism that the possibility and importance of superexponentiality should be highlighted more prominently. Perhaps it should be highlighted in the summary? To be clear, I was not intentionally trying to hide this. It is mentioned but not strongly highlighted in the corresponding expandable in the scenario ("Why we forecast a superhuman coder in early 2027").

For another look, here is a 15 millisecond-to-4 hour time horizon comparison just in the super-exponentiated plurality.

copied from response to Ryan [LW(p) · GW(p)]: Yeah I guess I think this is probably wrong. Perhaps one way to frame this would be that we would think there was a much lower chance of aggressive superexponentiality if we started at a lower time horizon, so we are sort of hardcoding the superexponentiality parameters to the current time horizon. This might not be the only issue though.

The superexponential in the simulation is implemented pretty crudely, if you have ideas on how to improve it I'd be interested, I didn't spend a large amount of time on this. I think this is happening because it's crudely implemented as each doubling taking 10% shorter, but this already leads to extremely low doubling times by the time the marginal doublings from the starting time horizon are added.

An obvious marginal thing I could do is to add uncertainty as to the shortening factor, though when I tested this and eyeballed it it didn't make that big of a difference (but I might have been mistaken). I also think this would ideally be implemented as some chance of going superexponential after each doubling rather than a chance of it happening only from the start, but didn't have time to implement that.

I will defend the basic dynamics though of substantial probability of a superexponentiality leading to a singularity being reasonable. And I don't think it's obvious whether we model that dynamic as too fast or too slow in general.

Here is the pure singularity isolated (only the first super-exponential function applied) and the same with normal parameters, which mostly just hide what's happening:

Yup, similar reaction to the above graph, I agree the shape looks wrong, interested in suggestions if you have them.

edit: actually I'm pretty confused about this last graph, what code exactly are you running? I don't think I understand.

knight-lee on Karma Tests in Logical Counterfactual Simulations motivates strong agents to protect weak agents

Thank you so much for the thorough reply :)

My answer for "which weaker agents should the AI be kind to," is "all weaker agents."

Enough room for everyone

Our universe contains humans, octopuses, insects and many different weak agents. A superintelligent AI which has a very philosophically uncertain chance of being in a Karma Test, would be kind to all of these agents just in case the Karma Test executors had a particular one in mind.

Earth's future lightcone is extremely big ( stars), so there is room to be kind to all of these weaker agents, if you are morally uncertain about which ones to care about, and do not put all your weight on "caring for rocks."

Caring for every weak agent is a net positive

The only worry would if weaker agents who are diametrically opposed to human values (or certain human values), end up having more weight than humans-like agents. For example, they want human misery more than we don't want human misery, and they outvote us.

Such agents make a compelling argument in an online debate, but are unlikely to exist in real life due to convergent evolution.

We humans may enjoy certain blood sports where smaller animals suffer, but it is very unlikely we will outvote them, or that we would even want to outvote them. If an AI told us it was choosing to spare our lives due to the risk it is inside a Karma Test telling it to be kind to weaker agents, and asked us how to spend our vote, it's unlikely we'll spend our vote against fish, in order to enjoy the blood-sport of killing them, due to Karma and everything.

Even in the cynical case we do do that, we'll at least allow the average life of a fish-like creature be happy rather than miserable.

All in all, having the AI be kind to weaker agents should be a net positive.

Which acausal trade first?

From the point of view of a superintelligence, it may make sense to do acausal trades in the order you described. From the point of view of humans, we can debate the feasibility of acausal trades in any order we like :)

I'm not saying that the superintelligence has to do this trade first, before the other trades. I'm saying it should do it eventually.

Advantages and disadvantages

I was wrong to ask the question "which version is better." I think they can both be done, to cover each other's weaknesses.

Disadvantages

One disadvantage of Logical Counterfactual Simulations, is that at least some logical counterfactuals (e.g. your example of 2+2=5) cannot be imagined by an AI (or human).

The algorithms for imagining counterfactuals, designed for computing "what will happen if X was true," inevitably performs certain mathematical computations like 2+2=4. As such, it is unable to imagine what would happen if 2+2=5.

Therefore, even though the algorithms for imagining counterfactuals are capable of logical uncertainty, and assigning probabilities to logical statements like $P = N P$ and so forth, not every logical counterfactual can be imagined.

Even for logical counterfactuals which can be imagined (e.g. being in a Karma Test), I'm not sure how the ASI will assign probabilities to these counterfactuals. Being in a Karma Test cannot be proven or disproven, and is almost a philosophical question. My hope is that the ASI assigns a bit of probability and expected value to it, enough to make it worthwhile to spare humans.

If we can't argue why the ASI won't assign any probability, it's worth trying.

Advantages: UDT

The key advantage of Logical Counterfactual Simulations over physical simulations, is that it does not become a zero sum trade against UDT agents.

For physical simulations, in order to convince the UDT paperclip maximizer AI to make room for happy humans, you have to give it more paperclips somewhere else in the universe. This means whatever trade you make with it, reduces the total number of happy humans, and increases the total number of paperclips.

If you are a utilitarian, this would clearly be a zero sum game. But even if you are a selfish individual, what is your measure of survival? Is it the number of copies of yourself who are alive in the future? If that was the case, this would still be a zero sum game, since it's cheaper for your surviving copies to directly clone themselves, than to buy your doomed copies from the paperclip maximizers. Any trade with paperclip maximizers lead to more paperclips, and less of whatever you value.

Physical simulations may still work against a CDT paperclip maximizer

At the very beginning, humans are the only agent who promises to simulate the CDT paperclip maximizer $C$ and reward it for cooperating.

$C$ knows that faraway UDT agents $U$ also want to use simulations of it to bribe it into cooperating (potentially outbidding humanity), but $U$ fails to do so (for the exact same reason Roko's Basilisk [? · GW] fails).

$U$ has no motive to bribe $C$ until $C$ can verify whether $U$ bribed $C$ or not (i.e. $C$ simulates $U$ ). But $C$ has no motive to simulate $U$ because CDT agents don't initiate acausal trade.

Since humans are the only agent bribing $C$ at the beginning, we might convince $C$ to become a UDT agent who trades on behalf of humanity (so can't get bribed by $U$ ), but is committed to spend $x %$ of the universe on paperclips. This way if $C$ was inside our simulation, it gets a reward, but if $C$ was in the real world, it still turns $x %$ of the universe into paperclips.

Logical Counterfactual Simulations are still zero sum over logical counterfactuals, but the AI has a positive sum if AI alignment turns out easy, and humanity has a positive sum if AI alignment turns out hard (reducing logical risk [LW · GW]).

Advantages: No certainty

Logical Counterfactual Simulations prevents the ASI from reaching extreme certainty over which agents always win and which agents always lose, so it spreads out its trades.

If humans (and other sentient life) lose to misaligned ASI every time, such that we have nothing to trade with it, average human/sentient life in all of existence may end up miserable.

Logical Counterfactual Simulations, allows us to edit the Kingmaker Logic, so that the ASI can never be really sure we have nothing to trade, even if math and logic appear to prove we lose every time.

Thanks for reading :)

Do you agree each idea has advantages and disadvantages?

wei-dai on VDT: a solution to decision theory

“Omega looks at whether we’d pay if in the causal graph the knowledge of the digit of pi and its downstream consequences were edited”

Can you formalize this? In other words, do you have an algorithm for translating an arbitrary mind into a causal graph and then asking this question? Can you try it out on some simple minds, like GPT-2?

I suspect there may not be a simple/elegant/unique way of doing this, in which case the answer to the decision problem depends on the details of how exactly Omega is doing it. E.g., maybe all such algorithms are messy/heuristics based, and it makes sense to think a bit about whether you can trick the specific algorithm into giving a "wrong prediction" (in quotes because it's not clear exactly what right and wrong even mean in this context) that benefits you, or maybe you have to self-modify into something Omega's algorithm can recognize / work with, and it's a messy cost-benefit analysis of whether this is worth doing, etc.

gurkenglas on What Makes an AI Startup "Net Positive" for Safety?

If your investors only get paid up to 100x their investment, you want to go for strategies that return much more than 100x if they work.

jasoncrawford on Power Lies Trembling: a three-book review

This pairs well with Scott Alexander's Kolmogorov Complicity And The Parable Of Lightning