LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Mentorship in AGI Safety (MAGIS) call for mentors
Valentin2026 (Just Learning) · 2024-05-23T18:28:03.173Z · comments (3)

Music in the AI World
Martin Sustrik (sustrik) · 2024-08-16T04:20:01.706Z · comments (8)

RLHF is the worst possible thing done when facing the alignment problem
tailcalled · 2024-09-19T18:56:27.676Z · comments (10)

[LDSL#1] Performance optimization as a metaphor for life
tailcalled · 2024-08-08T16:16:27.349Z · comments (4)

Extracting SAE task features for in-context learning
Dmitrii Kharlapenko (dmitrii-kharlapenko) · 2024-08-12T20:34:13.747Z · comments (1)

[link] [Linkpost] Statement from Scarlett Johansson on OpenAI's use of the "Sky" voice, that was shockingly similar to her own voice.
Linch · 2024-05-20T23:50:28.138Z · comments (8)

Inference-Only Debate Experiments Using Math Problems
Arjun Panickssery (arjun-panickssery) · 2024-08-06T17:44:27.293Z · comments (0)

A Case for Superhuman Governance, using AI
ozziegooen · 2024-06-07T00:10:10.902Z · comments (0)

[link] Baking vs Patissing vs Cooking, the HPS explanation
adamShimi · 2024-07-17T20:29:09.645Z · comments (16)

Some comments on intelligence
Viliam · 2024-08-01T15:17:07.215Z · comments (5)

[question] What are things you're allowed to do as a startup?
Elizabeth (pktechgirl) · 2024-06-20T00:01:59.257Z · answers+comments (9)

Fun With CellxGene
sarahconstantin · 2024-09-06T22:00:03.461Z · comments (2)

[link] Managing AI Risks in an Era of Rapid Progress
Algon · 2023-10-28T15:48:25.029Z · comments (3)

Verifiable private execution of machine learning models with Risc0?
mako yass (MakoYass) · 2023-10-25T00:44:48.643Z · comments (1)

Sparse MLP Distillation
slavachalnev · 2024-01-15T19:39:02.926Z · comments (3)

AI Alignment Breakthroughs this week (10/08/23)
Logan Zoellner (logan-zoellner) · 2023-10-08T23:30:54.924Z · comments (14)

RA Bounty: Looking for feedback on screenplay about AI Risk
Writer · 2023-10-26T13:23:02.806Z · comments (6)

Understanding Subjective Probabilities
Isaac King (KingSupernova) · 2023-12-10T06:03:27.958Z · comments (16)

[link] AISN #28: Center for AI Safety 2023 Year in Review
aogara (Aidan O'Gara) · 2023-12-23T21:31:40.767Z · comments (1)

[link] Evaluating Stability of Unreflective Alignment
james.lucassen · 2024-02-01T22:15:40.902Z · comments (3)

Some additional SAE thoughts
Hoagy · 2024-01-13T19:31:40.089Z · comments (4)

Information-Theoretic Boxing of Superintelligences
JustinShovelain · 2023-11-30T14:31:11.798Z · comments (0)

Running the Numbers on a Heat Pump
jefftk (jkaufman) · 2024-02-09T03:00:04.920Z · comments (12)

[question] Current AI safety techniques?
Zach Stein-Perlman · 2023-10-03T19:30:54.481Z · answers+comments (2)

[link] The origins of the steam engine: An essay with interactive animated diagrams
jasoncrawford · 2023-11-29T18:30:36.315Z · comments (1)

Interpreting Quantum Mechanics in Infra-Bayesian Physicalism
Yegreg · 2024-02-12T18:56:03.967Z · comments (6)

[question] What's your standard for good work performance?
Chi Nguyen · 2023-09-27T16:58:16.114Z · answers+comments (3)

Adversarial Robustness Could Help Prevent Catastrophic Misuse
aogara (Aidan O'Gara) · 2023-12-11T19:12:26.956Z · comments (18)

[link] There is no IQ for AI
Gabriel Alfour (gabriel-alfour-1) · 2023-11-27T18:21:26.196Z · comments (10)

Differential Optimization Reframes and Generalizes Utility-Maximization
J Bostock (Jemist) · 2023-12-27T01:54:22.731Z · comments (2)

[link] When scientists consider whether their research will end the world
Harlan · 2023-12-19T03:47:06.645Z · comments (4)

AI Safety 101 : Reward Misspecification
markov (markovial) · 2023-10-18T20:39:34.538Z · comments (4)

Interpreting the Learning of Deceit
RogerDearnaley (roger-d-1) · 2023-12-18T08:12:39.682Z · comments (11)

Book Review: On the Edge: The Gamblers
Zvi · 2024-09-24T11:50:06.065Z · comments (1)

AI #62: Too Soon to Tell
Zvi · 2024-05-02T15:40:04.364Z · comments (8)

Investigating the Ability of LLMs to Recognize Their Own Writing
Christopher Ackerman (christopher-ackerman) · 2024-07-30T15:41:44.017Z · comments (0)

[link] 2024 State of the AI Regulatory Landscape
Deric Cheng (deric-cheng) · 2024-05-28T11:59:06.582Z · comments (0)

Protestants Trading Acausally
Martin Sustrik (sustrik) · 2024-04-01T14:46:26.374Z · comments (4)

AI #74: GPT-4o Mini Me and Llama 3
Zvi · 2024-07-25T13:50:06.528Z · comments (6)

AI Constitutions are a tool to reduce societal scale risk
Sammy Martin (SDM) · 2024-07-25T11:18:17.826Z · comments (2)

Against "argument from overhang risk"
RobertM (T3t) · 2024-05-16T04:44:00.318Z · comments (11)

AIS terminology proposal: standardize terms for probability ranges
eggsyntax · 2024-08-30T15:43:39.857Z · comments (12)

The Third Gemini
Zvi · 2024-02-20T19:50:05.195Z · comments (2)

The Intentional Stance, LLMs Edition
Eleni Angelou (ea-1) · 2024-04-30T17:12:29.005Z · comments (3)

[link] Epistemic states as a potential benign prior
Tamsin Leake (carado-1) · 2024-08-31T18:26:14.093Z · comments (2)

AI #59: Model Updates
Zvi · 2024-04-11T14:20:06.339Z · comments (2)

"Full Automation" is a Slippery Metric
ozziegooen · 2024-06-11T19:56:49.855Z · comments (1)

The Math of Suspicious Coincidences
Roko · 2024-02-07T13:32:35.513Z · comments (3)

Announcing SPAR Summer 2024!
laurenmarie12 · 2024-04-16T08:30:31.339Z · comments (2)

Putting multimodal LLMs to the Tetris test
Lovre · 2024-02-01T16:02:12.367Z · comments (5)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

linch on Linch's Shortform

AI News so far this week.
1. Mira Murati (CTO) leaving OpenAI

2. OpenAI restructuring to be a full for-profit company (what?)

3. Ivanka Trump calls Leopold's Situational Awareness article "excellent and important read"

4. More OpenAI leadership departing, unclear why.
4a. Apparently sama only learned about Mira's departure the same day she announced it on Twitter? "Move fast" indeed!
4b. WSJ reports some internals of what went down at OpenAI after the Nov board kerfuffle.

5. California Federation of Labor Unions (2million+ members) spoke out in favor of SB 1047.

programcrafter on What is Randomness?

How do probabilities arise in the Everett interpretation of quantum mechanics?
After a series of quantum experiments, there will be a version of the observer who sees a sequence of very unlikely outcomes. How to make sense of that?

Well, in MWI there is the space of worlds, each associated with a certain complex number ("amplitude"). Some worlds can be uniform hydrogen over all the universe, some contain humans, a certain subspace contains me (I mean, collection of particles moving in a way introspectively-identical to me) writing a LessWrong comment, etc.

It so happens that the larger magnitude of said complex number is, the more often we see such world; IIRC, that inequality allows to prove that likelihood of seeing any world is proportional to squared modulo of amplitude, which is Born's rule.

The worlds space is presumably infinite-dimensional, and also expands over time (though not at exponential rate as is widely said, because "branches" must be merging all the time as well as splitting). That means that probability distribution assigns a very low likelihood to pretty much any world... but why do we get any outcomes then?

I'm not attempting to answer question why we experience things in the first place (preferring instead to seal it even for myself), but as for why we continue to do so conditional on experiencing something before: because of the "conditional on". Conditional probability is the non-unitary operation over our observations of phase space, retaining some parts while zeroing all others, which are "incompatible with our observations"; also, as its formula is , it can amplify likelihoods for small values of $P (B)$ . That doesn't totally fix the issue, but I believe the right thing to do in improbable worlds is to continue updating on evidence and choosing best actions as usual.
(To demonstrate the issue with small probabilities is not fixed, let's divide likelihood of any single world by $2^{256}$ ; here's a 256-bit random string: e697c6dfb32cf132805d38cf85a60c832247449749293054704ad56209d2440e).

james-stephen-brown on The Other Existential Crisis

Part of the answer is to note that mixtures of indeterminism and determinism are possible, so that libertarian free will is not just pure randomness, where any action is equally likely.

This is really interesting, because I agree with this, but also agree with what Seth's saying. I think this disagreement might actually be largely a semantic one. As such, I'm going to (try to) avoid using the terms 'libertarian' or 'compatibilist' free will. First of all I agree with the use of "indeterminism" to mean non-uniform randomness. I agree that there is a way that determinism and indeterminism can be mixed in such a way as give rise to an emergent property that is not present in either purely determined or purely random systems. I understand this in relation to the idea of evolutionary "design" which emerges from a system that necessarily has both determined and indeterminate properties (indeterminate at least at the level of the genes, they might not be ultimately indeterminate).

I'm going to employ a decision-making map that seeks to clarify my understanding of the how we make decisions and where we might get "what we want" from.

As I see it, the items in white are largely set, and change only gradually, and with no sense of control involved. I don't believe we have any control over our genes, our intentions or desires, what results our actions will have, of the world—I also don't think we have any control over our model of ourselves or the world, those are formed subconsciously. But our effort (in the green areas) allows for deliberative decision making, following an evolutionary selection process, in which our conscious awareness is involved.

In this way we are not beholden to the first action available to us, we can, instead of taking an action in the world, make a series of simulated actions in our head, consciously experiencing the predicted outcome of those actions, until we find a satisfactory one. So, you don't end up with a determined or a random solution, you end up with an option based on your conscious experience of your simulated options. This process satisfies my wants in terms of my sense that I have some control (when I make the effort) over my decisions. At the same time, I'm agnostic about whether true indeterminism exists at all, but, like with evolution, with randomness at the level of the cell (that may not be ultimately random), I think even in an entirely determined universe, we exist on a level that is subject to, at least, some apparent indeterminism. And even if that apparent indeterminism turns out to be determined, our (eternal) inability to calculate what is determined, still means we have no grounds to act in any way other than as if we have the control we feel we have.

I'm actually not sure if this makes me a compatibilist or not.

Determinists are always telling each other to act like libertarians. That's a clue that libertarianism is worth wanting.

So, I don't think my position is the same as asserting that we should act like libertarians, as I have (now) described my conception of the situation, I just think I should act consistent with this conception. By analogy, there are still people who say atheists often act according to "religious" moral values, but in fact they're not—it's just that morality is mode of behaviour that has all the same functions regardless of one's belief system.

ryan_greenblatt on What are the best arguments for/against AIs being "slightly 'nice'"?

As in, I care about the long-run power of values-which-are-similar-to-my-values-on-reflection. Which includes me (on reflection) by definition, but I think probably also includes lots of other humans.

ryan_greenblatt on What are the best arguments for/against AIs being "slightly 'nice'"?

Then shouldn't such systems (which can surely recognize this argument) just take care of short term survival instrumentally? Maybe you're making a claim about irrationality being likely or a claim that systems that care about long run benefit act in appararently myopic ways.

(Note that historically it was much harder to keep value stability/lock-in than it will be for AIs.)

I'm not going to engage in detail FYI.

mako-yass on Eye contact is effortless when you’re no longer emotionally blocked on it

So, again, you did guess that you'd be able to do that for everyone, and I disagree with that.

I think most of the people who have difficulty making eye contact and want to overcome themselves on it are not in a good place to judge whether they should.

chipmonk on Eye contact is effortless when you’re no longer emotionally blocked on it

i see

hm that's why i put "safely" werp

alex_altair on Work with me on agent foundations: independent fellowship

Nice! Yeah I'd be happy to chat about that, and also happy to get referrals of any other researchers who might be interested in receiving this funding to work on it.

mako-yass on Eye contact is effortless when you’re no longer emotionally blocked on it

I'm aware that you have a nuanced perspective on this which is part of the reason I'm raising this.

nathan-helm-burger on Model evals for dangerous capabilities

Addendum:

I think it'd be great if we sorted dangerous capabilities evals into two categories, mitigated and unmitigated hazards.

Mitigated means the hazards you measure with enforceable mitigations in place, as in behind a secure API. This includes:

API-based fine-tuning, where you filter the data that customers are allowed to fine-tune on and/or put various restrictions in place on the fine-tuning process.
Limited prompt-only jailbreaks (including long context many-example jailbreaks). This could include requiring that the jail-breaking needs to evade a filter which is trying to catch and block users that try to jailbreak.
Note that in the context of dangerous capabilities evals, jail-breaking can look like 'refusal dodging'. Refusal dodging is when you try to justify your question about dangerous technology by placing it in a reasonable context, such as a student studying for an exam, or analyzing an academic paper. If the model will summarize and extract key information from hazardous academic papers, that should be a red flag on a capability eval.
Red-teamers may try to sneak past filters by, for example, fine-tuning the model to communicate in code, using purely innocent statements in plain-text. It's fair game for the developers to put filters in place to try to catch red-teamers attempting to do this. See https://arxiv.org/html/2406.20053v1
This sounds like it's relatively easy-mode, and indeed it should be. But I still want to see a separate report on this, since it reassures me that the developer in question is taking reasonable precautions and has implemented them competently.

Unmitigated means the hazards you measure as if the weights had been stolen, or you deliberately released the weights (looking at you Meta). Unmitigated includes:

Unlimited unfiltered jail-breaking attempts.
White-box jail-breaking, where you get to use an optimization process working against the activations of the model to avoid refusals. (white box attacks are generally harder to resist than black-box attacks).
Activation steering for non-refusal or for capabilities elicitation
Fine-tuning on task-specific domain knowledge and examples of alignment to terrorist agendas
Merging the model with other models or architectures
It's fair game for the company to do anti-fine-tuning modifications of the model before doing the unmitigated testing. It's not fair game for the company to restrict what the red-teamers do to try to undo such measures. See https://arxiv.org/abs/2405.14577

I think this distinction is important, since I don't think that any companies so far have been good at publishing cleanly separated scores for mitigated and unmitigated. What OpenAI called 'unmitigated' is what I would call 'first pass mitigated, before the red teamers showed us how many holes our mitigations still had and we fixed those specific issues'. That's also an interesting measure, but not nearly as informative as a true unmitigated eval score.

@Zach Stein-Perlman [LW · GW]