LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] Linguistic Imperialism in AI: Enforcing Human-Readable Chain-of-Thought
Lukas Petersson (lukas-petersson-1) · 2025-02-21T15:45:00.146Z · comments (0)

Safe Distillation With a Powerful Untrusted AI
Alek Westover (alek-westover) · 2025-02-20T03:14:04.893Z · comments (1)

[link] Teaching AI to reason: this year's most important story
Benjamin_Todd · 2025-02-13T17:40:02.869Z · comments (0)

Information throughput of biological humans and frontier LLMs
benwr · 2025-02-22T07:15:45.457Z · comments (0)

Closed-ended questions aren't as hard as you think
electroswing · 2025-02-19T03:53:11.855Z · comments (0)

[link] Neural Scaling Laws Rooted in the Data Distribution
aribrill (Particleman) · 2025-02-20T21:22:10.306Z · comments (0)

Are current LLMs safe for psychotherapy?
PaperBike · 2025-02-12T19:16:34.452Z · comments (4)

[link] AISN #48: Utility Engineering and EnigmaEval
Corin Katzke (corin-katzke) · 2025-02-18T19:15:16.751Z · comments (0)

A fable on AI x-risk
bgaesop · 2025-02-18T20:15:24.933Z · comments (0)

Permanent properties of things are a self-fulfilling prophecy
YanLyutnev (YanLutnev) · 2025-02-19T00:08:20.776Z · comments (0)

Claude 3.5 Sonnet (New)'s AGI scenario
Nathan Young · 2025-02-17T18:47:04.669Z · comments (2)

Response to the US Govt's Request for Information Concerning Its AI Action Plan
Davey Morse (davey-morse) · 2025-02-14T06:14:08.673Z · comments (0)

[link] Demonstrating specification gaming in reasoning models
Matrice Jacobine · 2025-02-20T19:26:20.563Z · comments (0)

OpenAI’s NSFW policy: user safety, harm reduction, and AI consent
8e9 · 2025-02-13T13:59:22.911Z · comments (3)

Build a Metaculus Forecasting Bot in 30 Minutes: A Practical Guide
ChristianWilliams · 2025-02-22T03:52:14.753Z · comments (0)

Sparse Autoencoder Feature Ablation for Unlearning
aludert · 2025-02-13T19:13:48.388Z · comments (0)

Undesirable Conclusions and Origin Adjustment
Jerdle (daniel-amdurer) · 2025-02-19T18:35:23.732Z · comments (0)

Intrinsic Dimension of Prompts in LLMs
Karthik Viswanathan (vkarthik095) · 2025-02-14T19:02:49.464Z · comments (0)

[link] Probability of AI-Caused Disaster
Alvin Ånestrand (alvin-anestrand) · 2025-02-12T19:40:11.121Z · comments (2)

Artificial Static Place Intelligence: Guaranteed Alignment
ank · 2025-02-15T11:08:50.226Z · comments (2)

arch-anarchist reading list
Peter lawless · 2025-02-16T22:47:00.273Z · comments (1)

[question] Why do we have the NATO logo?
KvmanThinking (avery-liu) · 2025-02-19T22:59:41.755Z · answers+comments (4)

Preference for uncertainty and impact overestimation bias in altruistic systems.
Luck (luck-1) · 2025-02-15T12:27:05.474Z · comments (0)

[link] Modularity and assembly: AI safety via thinking smaller
D Wong (d-nell) · 2025-02-20T00:58:39.714Z · comments (0)

Workshop: Interpretability in LLMs Using Geometric and Statistical Methods
Karthik Viswanathan (vkarthik095) · 2025-02-22T09:39:26.446Z · comments (0)

Biological humans collectively exert at most 400 gigabits/s of control over the world.
benwr · 2025-02-20T23:44:06.509Z · comments (1)

[link] New LLM Scaling Law
wrmedford · 2025-02-19T20:21:17.475Z · comments (0)

Fun, endless art debates v. morally charged art debates that are intrinsically endless
danielechlin · 2025-02-21T04:44:22.712Z · comments (0)

Gradient Anatomy's - Hallucination Robustness in Medical Q&A
DieSab (diego-sabajo) · 2025-02-12T19:16:58.949Z · comments (0)

On Static Space-Like Nature of Intelligence & Superintelligence
ank · 2025-02-22T00:12:36.263Z · comments (0)

[question] Programming Language Early Funding?
J Thomas Moros (J_Thomas_Moros) · 2025-02-16T17:34:06.058Z · answers+comments (5)

Quantifying the Qualitative: Towards a Bayesian Approach to Personal Insight
Pruthvi Kumar (pruthvi-kumar) · 2025-02-15T19:50:42.550Z · comments (0)

Static Place AI Makes AGI Redundant: Multiversal AI Alignment & Rational Utopia
ank · 2025-02-13T22:35:28.300Z · comments (2)

[link] LLMs can teach themselves to better predict the future
Ben Turtel (ben-turtel) · 2025-02-13T01:01:12.175Z · comments (1)

The Newbie's Guide to Navigating AI Futures
keithjmenezes · 2025-02-19T20:37:06.272Z · comments (0)

Places of Loving Grace [Story]
ank · 2025-02-18T23:49:18.580Z · comments (0)

the dumbest theory of everything
lostinwilliamsburg · 2025-02-13T07:57:38.842Z · comments (0)

[link] Sea Change
Charlie Sanders (charlie-sanders) · 2025-02-18T06:03:06.961Z · comments (2)

[link] Humans are Just Self Aware Intelligent Biological Machines
asksathvik · 2025-02-21T01:03:59.950Z · comments (3)

CyberEconomy. The Limits to Growth
Timur Sadekov (timur-sadekov) · 2025-02-16T21:02:34.040Z · comments (0)

Paranoia, Cognitive Biases, and Catastrophic Thought Patterns.
Spiritus Dei (spiritus-dei) · 2025-02-14T00:13:56.300Z · comments (1)

[link] Several Arguments Against the Mathematical Universe Hypothesis
Vittu Perkele · 2025-02-19T22:13:59.425Z · comments (6)

[link] Against Unlimited Genius for Baby-Killers
ggggg · 2025-02-19T20:33:27.188Z · comments (0)

Objective Realism: A Perspective Beyond Human Constructs
Apatheos · 2025-02-14T19:02:33.865Z · comments (1)

The Unearned Privilege We Rarely Discuss: Cognitive Capability
DiegoRojas · 2025-02-18T20:06:45.970Z · comments (7)

[link] On the Rebirth of Aristocracy in the American Regime
shawkisukkar · 2025-02-17T16:18:17.117Z · comments (3)

AI Alignment and the Financial War Against Narcissistic Manipulation
henophilia · 2025-02-19T20:42:10.918Z · comments (2)

Born on Third Base: The Case for Inheriting Nothing and Building Everything
charlieoneill (kingchucky211) · 2025-02-18T00:47:02.298Z · comments (16)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

vladimir_nesov on Vladimir_Nesov's Shortform

This is perfectly consistent with GPT-5 being 100x GPT-4 compute. Announcing specific features that will go into it suggests they have a prototype, in this case I'm guessing the LLM will itself be trained to decide whether to go into the reasoning mode, triggering it when needed and affordable, like any other tool.

vladimir_nesov on Vladimir_Nesov's Shortform

I'm merely referring to the historical precedent, whether there are informal commitments in the minds of the leadership is not something I can speak to. This pattern might continue or it might break. What I'm guessing about training system buildout from vague clues seems to be consistent with it continuing, so the naming pattern can be used as another clue to make a point estimate prediction that's more concrete.

stephen-mcaleese on How might we safely pass the buck to AI?

Here is a quote from the post describing what to do in that situation:

These automatic control measures only raise the alarm. What happens after the alarm is raised? The developer might also implement autonomous incident response:
Resampling: Force defective M_1 agents to take different actions.
AI agent elimination: Delete suspicious agents in favor of other (hopefully decorrelated) unsuspicious ones.
Fine-tuning: Train agents not to take suspicious actions again.

I'm curious about how fine-tuning away scheming thoughts could still lead to scheming. Here are two possible outcomes:

Fine-tuning actually robustly reduces scheming behavior (as intended) maybe because tokens and actions associated with scheming can be reliably detected and these tokens are required for scheming.
Fine-tuning reduces the appearance of scheming but doesn't really reduce scheming: in this case, the fine-tuning reduces the probability of certain words or trajectories but the underlying 'concept' of scheming remains and the model merely rephrases its thinking in a way that is not detected as scheming and uses different tokens.

For scenario 2, maybe you need something like activation engineering to change the underlying thought process rather than just token-level descriptions of it.

gurkenglas on The Learning-Theoretic Agenda: Status 2023

Re first, yep, I missed that :(. M does sound like a more worthy barrier than U. Do you have a working example of a (U,M) where some state machine performs well in a manner that's hard to detect?

Re second, I realized that this only allows discrete utilities but didn't think to therefore try a π' that does an exhaustive search over policies ^^. (I assume you are setting "uncomputable to measure performance because that involves the Solomonoff prior" aside here.) Even so, undecidability of whether 000... and 111... get the same utility sounds like a bug. What other types have you considered for the P representing U?

The box I'm currently thinking in is that a strict upper bound on what we can ask of P is that it decide what statements are true of U. So perhaps we impose some reasonableness constraint on statements, and then can directly ask whether e.g. some observation sequence matching regex1 is preferable to all observation sequences matching regex2?

Reviewing my "contribution" so far, I'd like to make sure I don't run out your patience; feel free to ask me to spend way more time thinking before I comment, or attempt https://www.lesswrong.com/posts/sPAA9X6basAXsWhau/announcement-learning-theory-online-course [LW · GW] first.

yimbygeorge on The case for the death penalty

Some conclusions should be drawn from existing countries which use the death penalty well, example Singapore. Low crime is great!

tailcalled on Benito's Shortform Feed

I think the billion-dollar question is, what is the relationship between these two perspectives? For example, a simplistic approach would be to see cognitive visualization as some sort of Monte Carlo version of spreadsheet epistemology. I think that's wrong, but the correct alternative is less clear. Maybe something involving LDSL, but LDSL seems far from the whole story.

vanessa-kosoy on The Learning-Theoretic Agenda: Status 2023

First, it's uncomputable to measure performance because that involves the Solomonoff prior. You can approximate it if you know some bits of Chaitin's constant, but that brings a penalty into the description complexity.

Second, I think that saying that comparison is computable means that the utility is only allowed to depend on a finite number of time steps, it rules out even geometric time discount. For such utility functions, the optimal policy has finite description complexity, so g is upper bounded. I doubt that's useful.

mattmacdermott on How might we safely pass the buck to AI?

I roughly agree, but it seems very robustly established in practice that the training-validation distinction is better than just having a training objective, even though your argument mostly applies just as well to the standard ML setup.

You point out an important difference which is that our ‘validation metrics’ might be quite weak compared to most cases, but I still think it’s clearly much better to use some things for validation than training.

Like, I think there are things that are easy to train away but hard/slow to validate away (just like when training an image classifier you could in principle memorise the validation set, but it would take a ridiculous amount of hyperparameter optimisation).

One example might be if we have interp methods that measure correlates of scheming. Incredibly easy to train away, still possible to validate away but probably harder enough that ratio of non-schemers you get is higher than if trained against it, which wouldn’t affect the ratio at all.

A separate argument is that I’m think if you just do random search over training ideas, rejecting if they don’t get a certain validation score, you actually don’t goodhart at all. Might put that argument in a top level post.

jiro on The case for the death penalty

We could have the exact same court procedures, just change the law on the books from 3–5 years to 3–5 fingers.

We could, but with actual humans, we won't.

Or are you just stating, without any justification

"By observing human beings" is not "without any justification". We know what societies that mutilate prisoners are like, because plenty of them have existed.

Also, individuals don't have to "become irrational" for the ones who are already irrational to gain more influence.

jiro on The case for the death penalty

You won't find systems with none at all, but you can easily find systems with those aspects to a significantly larger or smaller degree.