LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

The Bar for Contributing to AI Safety is Lower than You Think
Chris_Leong · 2024-08-16T15:20:19.055Z · comments (1)

Auto-Enhance: Developing a meta-benchmark to measure LLM agents’ ability to improve other agents
Sam F. Brown (sam-4) · 2024-07-22T12:33:57.656Z · comments (0)

[link] [Linkpost] A Case for AI Consciousness
cdkg · 2024-07-06T14:52:21.704Z · comments (2)

Ten counter-arguments that AI is (not) an existential risk (for now)
Ariel Kwiatkowski (ariel-kwiatkowski) · 2024-08-13T22:35:15.341Z · comments (5)

Why Reflective Stability is Important
Johannes C. Mayer (johannes-c-mayer) · 2024-09-05T15:28:19.913Z · comments (2)

[link] Pronouns are Annoying
ymeskhout · 2024-09-18T13:30:04.620Z · comments (21)

Why I'm bearish on mechanistic interpretability: the shards are not in the network
tailcalled · 2024-09-13T17:09:25.407Z · comments (40)

"Real AGI"
Seth Herd · 2024-09-13T14:13:24.124Z · comments (18)

Finding Deception in Language Models
Esben Kran (esben-kran) · 2024-08-20T09:42:13.060Z · comments (4)

[link] The Offense-Defense Balance of Gene Drives
Maxwell Tabarrok (maxwell-tabarrok) · 2024-09-27T16:47:25.976Z · comments (1)

Games of My Childhood: The Troops
Kaj_Sotala · 2024-07-08T11:20:03.033Z · comments (0)

[link] AI existential risk probabilities are too unreliable to inform policy
Oleg Trott (oleg-trott) · 2024-07-28T00:59:59.497Z · comments (5)

Bryan Johnson and a search for healthy longevity
NancyLebovitz · 2024-07-27T15:28:13.117Z · comments (17)

Determining the power of investors over Frontier AI Labs is strategically important to reduce x-risk
Lucie Philippon (lucie-philippon) · 2024-07-25T01:12:20.518Z · comments (7)

[link] Green and golden: a meditation
Richard_Ngo (ricraz) · 2024-08-18T01:36:43.613Z · comments (0)

Avoiding the Bog of Moral Hazard for AI
Nathan Helm-Burger (nathan-helm-burger) · 2024-09-13T21:24:34.137Z · comments (12)

Rabin's Paradox
Charlie Steiner · 2024-08-14T05:40:25.572Z · comments (40)

Travel Buffer
jefftk (jkaufman) · 2024-07-06T02:20:02.723Z · comments (3)

"Which Future Mind is Me?" Is a Question of Values
dadadarren · 2024-08-09T18:17:09.884Z · comments (12)

Training a Sparse Autoencoder in < 30 minutes on 16GB of VRAM using an S3 cache
Louka Ewington-Pitsos (louka-ewington-pitsos) · 2024-08-24T07:39:00.057Z · comments (0)

D&D.Sci: Whom Shall You Call? [Evaluation and Ruleset]
abstractapplic · 2024-07-17T22:34:25.111Z · comments (4)

[link] Minimalist And Maximalist Type Systems
adamShimi · 2024-07-05T16:25:59.448Z · comments (6)

What program structures enable efficient induction?
Daniel C (harper-owen) · 2024-09-05T10:12:14.058Z · comments (4)

My career exploration: Tools for building confidence
lynettebye · 2024-09-13T11:37:55.843Z · comments (0)

[question] Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception?
David Scott Krueger (formerly: capybaralet) (capybaralet) · 2024-09-04T12:40:07.678Z · answers+comments (7)

[question] Is this voting system strategy proof?
Donald Hobson (donald-hobson) · 2024-09-06T20:44:46.691Z · answers+comments (9)

Invitation to lead a project at AI Safety Camp (Virtual Edition, 2025)
Linda Linsefors · 2024-08-23T14:18:24.327Z · comments (2)

[link] The Dumbification of our smart screens
Itay Dreyfus (itay-dreyfus) · 2024-07-04T06:32:36.672Z · comments (0)

[link] Instruction Following without Instruction Tuning
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-09-24T13:49:09.078Z · comments (0)

Four ways I've made bad decisions
Sodium · 2024-07-14T22:18:47.630Z · comments (1)

[link] AlignedCut: Visual Concepts Discovery on Brain-Guided Universal Feature Space
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-09-14T23:23:26.296Z · comments (1)

[link] Will we ever run out of new jobs?
Kevin Kohler (KevinKohler) · 2024-08-19T15:04:03.849Z · comments (7)

[question] Self-censoring on AI x-risk discussions?
Decaeneus · 2024-07-01T18:24:15.759Z · answers+comments (2)

Initial Experiments Using SAEs to Help Detect AI Generated Text
Aaron_Scher · 2024-07-22T05:16:20.516Z · comments (0)

OpenAI Boycott Revisit
Jake Dennie · 2024-07-22T01:44:55.094Z · comments (2)

[link] Why Swiss watches and Taylor Swift are AGI-proof
Kevin Kohler (KevinKohler) · 2024-09-05T13:23:27.033Z · comments (11)

[link] Should Sports Betting Be Banned?
Maxwell Tabarrok (maxwell-tabarrok) · 2024-09-21T14:13:35.404Z · comments (2)

My experience applying to MATS 6.0
mic (michael-chen) · 2024-07-18T19:02:21.849Z · comments (3)

[link] Non-Transactional Compliments
Jonathan Moregård (JonathanMoregard) · 2024-08-09T13:42:16.471Z · comments (0)

[link] Jonothan Gorard:The territory is isomorphic to an equivalence class of its maps
Daniel C (harper-owen) · 2024-09-07T10:04:47.840Z · comments (18)

[link] Four Levels of Voting Methods
hive · 2024-09-26T18:15:00.565Z · comments (3)

[link] My lukewarm take on GLP-1 agonists
George3d6 · 2024-08-26T12:34:27.929Z · comments (0)

Interview with Robert Kralisch on Simulators
WillPetillo · 2024-08-26T05:49:15.543Z · comments (0)

All the Following are Distinct
Gianluca Calcagni (gianluca-calcagni) · 2024-08-02T16:35:51.815Z · comments (3)

The Residual Expansion: A Framework for thinking about Transformer Circuits
Daniel Tan (dtch1997) · 2024-08-02T11:04:56.347Z · comments (13)

An information-theoretic study of lying in LLMs
Annah (annah) · 2024-08-02T10:06:39.312Z · comments (0)

[link] How (and why) to get tested for CMV
Metacelsus · 2024-07-15T20:06:05.649Z · comments (0)

[link] Why good things often don’t lead to better outcomes
DMMF · 2024-09-19T16:37:07.778Z · comments (1)

Reducing global AI competition through the Commerce Control List and Immigration reform: a dual-pronged approach
Ben Smith (ben-smith) · 2024-09-03T05:28:24.549Z · comments (2)

[link] AI Safety Newsletter #39: Implications of a Trump Administration for AI Policy Plus, Safety Engineering
Corin Katzke (corin-katzke) · 2024-07-29T17:50:52.454Z · comments (1)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

linch on Linch's Shortform

AI News so far this week.
1. Mira Murati (CTO) leaving OpenAI

2. OpenAI restructuring to be a full for-profit company (what?)

3. Ivanka Trump calls Leopold's Situational Awareness article "excellent and important read"

4. More OpenAI leadership departing, unclear why.
4a. Apparently sama only learned about Mira's departure the same day she announced it on Twitter? "Move fast" indeed!
4b. WSJ reports some internals of what went down at OpenAI after the Nov board kerfuffle.

5. California Federation of Labor Unions (2million+ members) spoke out in favor of SB 1047.

programcrafter on What is Randomness?

How do probabilities arise in the Everett interpretation of quantum mechanics?
After a series of quantum experiments, there will be a version of the observer who sees a sequence of very unlikely outcomes. How to make sense of that?

Well, in MWI there is the space of worlds, each associated with a certain complex number ("amplitude"). Some worlds can be uniform hydrogen over all the universe, some contain humans, a certain subspace contains me (I mean, collection of particles moving in a way introspectively-identical to me) writing a LessWrong comment, etc.

It so happens that the larger magnitude of said complex number is, the more often we see such world; IIRC, that inequality allows to prove that likelihood of seeing any world is proportional to squared modulo of amplitude, which is Born's rule.

The worlds space is presumably infinite-dimensional, and also expands over time (though not at exponential rate as is widely said, because "branches" must be merging all the time as well as splitting). That means that probability distribution assigns a very low likelihood to pretty much any world... but why do we get any outcomes then?

I'm not attempting to answer question why we experience things in the first place (preferring instead to seal it even for myself), but as for why we continue to do so conditional on experiencing something before: because of the "conditional on". Conditional probability is the non-unitary operation over our observations of phase space, retaining some parts while zeroing all others, which are "incompatible with our observations"; also, as its formula is , it can amplify likelihoods for small values of $P (B)$ . That doesn't totally fix the issue, but I believe the right thing to do in improbable worlds is to continue updating on evidence and choosing best actions as usual.
(To demonstrate the issue with small probabilities is not fixed, let's divide likelihood of any single world by $2^{256}$ ; here's a 256-bit random string: e697c6dfb32cf132805d38cf85a60c832247449749293054704ad56209d2440e).

james-stephen-brown on The Other Existential Crisis

Part of the answer is to note that mixtures of indeterminism and determinism are possible, so that libertarian free will is not just pure randomness, where any action is equally likely.

This is really interesting, because I agree with this, but also agree with what Seth's saying. I think this disagreement might actually be largely a semantic one. As such, I'm going to (try to) avoid using the terms 'libertarian' or 'compatibilist' free will. First of all I agree with the use of "indeterminism" to mean non-uniform randomness. I agree that there is a way that determinism and indeterminism can be mixed in such a way as give rise to an emergent property that is not present in either purely determined or purely random systems. I understand this in relation to the idea of evolutionary "design" which emerges from a system that necessarily has both determined and indeterminate properties (indeterminate at least at the level of the genes, they might not be ultimately indeterminate).

I'm going to employ a decision-making map that seeks to clarify my understanding of the how we make decisions and where we might get "what we want" from.

As I see it, the items in white are largely set, and change only gradually, and with no sense of control involved. I don't believe we have any control over our genes, our intentions or desires, what results our actions will have, of the world—I also don't think we have any control over our model of ourselves or the world, those are formed subconsciously. But our effort (in the green areas) allows for deliberative decision making, following an evolutionary selection process, in which our conscious awareness is involved.

In this way we are not beholden to the first action available to us, we can, instead of taking an action in the world, make a series of simulated actions in our head, consciously experiencing the predicted outcome of those actions, until we find a satisfactory one. So, you don't end up with a determined or a random solution, you end up with an option based on your conscious experience of your simulated options. This process satisfies my wants in terms of my sense that I have some control (when I make the effort) over my decisions. At the same time, I'm agnostic about whether true indeterminism exists at all, but, like with evolution, with randomness at the level of the cell (that may not be ultimately random), I think even in an entirely determined universe, we exist on a level that is subject to, at least, some apparent indeterminism. And even if that apparent indeterminism turns out to be determined, our (eternal) inability to calculate what is determined, still means we have no grounds to act in any way other than as if we have the control we feel we have.

I'm actually not sure if this makes me a compatibilist or not.

Determinists are always telling each other to act like libertarians. That's a clue that libertarianism is worth wanting.

So, I don't think my position is the same as asserting that we should act like libertarians, as I have (now) described my conception of the situation, I just think I should act consistent with this conception. By analogy, there are still people who say atheists often act according to "religious" moral values, but in fact they're not—it's just that morality is mode of behaviour that has all the same functions regardless of one's belief system.

ryan_greenblatt on What are the best arguments for/against AIs being "slightly 'nice'"?

As in, I care about the long-run power of values-which-are-similar-to-my-values-on-reflection. Which includes me (on reflection) by definition, but I think probably also includes lots of other humans.

ryan_greenblatt on What are the best arguments for/against AIs being "slightly 'nice'"?

Then shouldn't such systems (which can surely recognize this argument) just take care of short term survival instrumentally? Maybe you're making a claim about irrationality being likely or a claim that systems that care about long run benefit act in appararently myopic ways.

(Note that historically it was much harder to keep value stability/lock-in than it will be for AIs.)

I'm not going to engage in detail FYI.

mako-yass on Eye contact is effortless when you’re no longer emotionally blocked on it

So, again, you did guess that you'd be able to do that for everyone, and I disagree with that.

I think most of the people who have difficulty making eye contact and want to overcome themselves on it are not in a good place to judge whether they should.

chipmonk on Eye contact is effortless when you’re no longer emotionally blocked on it

i see

hm that's why i put "safely" werp

alex_altair on Work with me on agent foundations: independent fellowship

Nice! Yeah I'd be happy to chat about that, and also happy to get referrals of any other researchers who might be interested in receiving this funding to work on it.

mako-yass on Eye contact is effortless when you’re no longer emotionally blocked on it

I'm aware that you have a nuanced perspective on this which is part of the reason I'm raising this.

nathan-helm-burger on Model evals for dangerous capabilities

Addendum:

I think it'd be great if we sorted dangerous capabilities evals into two categories, mitigated and unmitigated hazards.

Mitigated means the hazards you measure with enforceable mitigations in place, as in behind a secure API. This includes:

API-based fine-tuning, where you filter the data that customers are allowed to fine-tune on and/or put various restrictions in place on the fine-tuning process.
Limited prompt-only jailbreaks (including long context many-example jailbreaks). This could include requiring that the jail-breaking needs to evade a filter which is trying to catch and block users that try to jailbreak.
Note that in the context of dangerous capabilities evals, jail-breaking can look like 'refusal dodging'. Refusal dodging is when you try to justify your question about dangerous technology by placing it in a reasonable context, such as a student studying for an exam, or analyzing an academic paper. If the model will summarize and extract key information from hazardous academic papers, that should be a red flag on a capability eval.
Red-teamers may try to sneak past filters by, for example, fine-tuning the model to communicate in code, using purely innocent statements in plain-text. It's fair game for the developers to put filters in place to try to catch red-teamers attempting to do this. See https://arxiv.org/html/2406.20053v1
This sounds like it's relatively easy-mode, and indeed it should be. But I still want to see a separate report on this, since it reassures me that the developer in question is taking reasonable precautions and has implemented them competently.

Unmitigated means the hazards you measure as if the weights had been stolen, or you deliberately released the weights (looking at you Meta). Unmitigated includes:

Unlimited unfiltered jail-breaking attempts.
White-box jail-breaking, where you get to use an optimization process working against the activations of the model to avoid refusals. (white box attacks are generally harder to resist than black-box attacks).
Activation steering for non-refusal or for capabilities elicitation
Fine-tuning on task-specific domain knowledge and examples of alignment to terrorist agendas
Merging the model with other models or architectures
It's fair game for the company to do anti-fine-tuning modifications of the model before doing the unmitigated testing. It's not fair game for the company to restrict what the red-teamers do to try to undo such measures. See https://arxiv.org/abs/2405.14577

I think this distinction is important, since I don't think that any companies so far have been good at publishing cleanly separated scores for mitigated and unmitigated. What OpenAI called 'unmitigated' is what I would call 'first pass mitigated, before the red teamers showed us how many holes our mitigations still had and we fixed those specific issues'. That's also an interesting measure, but not nearly as informative as a true unmitigated eval score.

@Zach Stein-Perlman [LW · GW]