LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Questions for labs
Zach Stein-Perlman · 2024-04-30T22:15:55.362Z · comments (11)

MATS Summer 2023 Retrospective
utilistrutil · 2023-12-01T23:29:47.958Z · comments (34)

Attention SAEs Scale to GPT-2 Small
Connor Kissane (ckkissane) · 2024-02-03T06:50:22.583Z · comments (4)

[link] [Linkpost] Practically-A-Book Review: Rootclaim $100,000 Lab Leak Debate
trevor (TrevorWiesinger) · 2024-03-28T16:03:36.452Z · comments (22)

Secondary forces of debt
KatjaGrace · 2024-06-27T21:10:06.131Z · comments (18)

The Parable Of The Fallen Pendulum - Part 2
johnswentworth · 2024-03-12T21:41:30.180Z · comments (8)

Universal Love Integration Test: Hitler
Raemon · 2024-01-10T23:55:35.526Z · comments (65)

[link] AI takeoff and nuclear war
owencb · 2024-06-11T19:36:24.710Z · comments (6)

Mid-conditional love
KatjaGrace · 2024-04-17T04:00:08.341Z · comments (21)

Lying Alignment Chart
Zack_M_Davis · 2023-11-29T16:15:28.102Z · comments (17)

On Claude 3.0
Zvi · 2024-03-06T18:50:04.766Z · comments (5)

Value fragility and AI takeover
Joe Carlsmith (joekc) · 2024-08-05T21:28:07.306Z · comments (5)

[link] Are language models good at making predictions?
dynomight · 2023-11-06T13:10:36.379Z · comments (14)

Darwinian Traps and Existential Risks
KristianRonn · 2024-08-25T22:37:14.142Z · comments (14)

[link] The Offense-Defense Balance Rarely Changes
Maxwell Tabarrok (maxwell-tabarrok) · 2023-12-09T15:21:23.340Z · comments (23)

On the CrowdStrike Incident
Zvi · 2024-07-22T12:40:05.894Z · comments (14)

Grief is a fire sale
Nathan Young · 2024-03-04T01:11:06.882Z · comments (1)

Analogies between scaling labs and misaligned superintelligent AI
scasper · 2024-02-21T19:29:39.033Z · comments (5)

My guess at Conjecture's vision: triggering a narrative bifurcation
Alexandre Variengien (alexandre-variengien) · 2024-02-06T19:10:42.690Z · comments (12)

[link] The problems with the concept of an infohazard as used by the LW community [Linkpost]
Noosphere89 (sharmake-farah) · 2023-12-22T16:13:54.822Z · comments (43)

AISC9 has ended and there will be an AISC10
Linda Linsefors · 2024-04-29T10:53:18.812Z · comments (4)

[link] Bengio's Alignment Proposal: "Towards a Cautious Scientist AI with Convergent Safety Bounds"
mattmacdermott · 2024-02-29T13:59:34.959Z · comments (19)

Vote on Anthropic Topics to Discuss
Ben Pace (Benito) · 2024-03-06T19:43:47.194Z · comments (55)

[question] What could a policy banning AGI look like?
TsviBT · 2024-03-13T14:19:07.783Z · answers+comments (23)

[link] Claude 3.5 Sonnet
Zach Stein-Perlman · 2024-06-20T18:00:35.443Z · comments (41)

Neural uncertainty estimation review article (for alignment)
Charlie Steiner · 2023-12-05T08:01:32.723Z · comments (3)

[link] MIRI's June 2024 Newsletter
Harlan · 2024-06-14T23:02:23.721Z · comments (18)

Coherence of Caches and Agents
johnswentworth · 2024-04-01T23:04:31.320Z · comments (9)

A Simple Toy Coherence Theorem
johnswentworth · 2024-08-02T17:47:50.642Z · comments (15)

Anthropic Fall 2023 Debate Progress Update
Ansh Radhakrishnan (anshuman-radhakrishnan-1) · 2023-11-28T05:37:30.070Z · comments (9)

SAE-VIS: Announcement Post
CallumMcDougall (TheMcDouglas) · 2024-03-31T15:30:49.079Z · comments (8)

Q&A on Proposed SB 1047
Zvi · 2024-05-02T15:10:02.916Z · comments (8)

(Not) Derailing the LessOnline Puzzle Hunt
Error · 2024-06-04T01:28:31.688Z · comments (2)

Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations)
Thane Ruthenis · 2023-12-22T20:19:13.865Z · comments (14)

[link] The Cognitive-Theoretic Model of the Universe: A Partial Summary and Review
jessicata (jessica.liu.taylor) · 2024-03-27T19:59:27.893Z · comments (33)

On the UK Summit
Zvi · 2023-11-07T13:10:04.895Z · comments (6)

The Obliqueness Thesis
jessicata (jessica.liu.taylor) · 2024-09-19T00:26:30.677Z · comments (16)

The World in 2029
Nathan Young · 2024-03-02T18:03:29.368Z · comments (37)

The One and a Half Gemini
Zvi · 2024-02-22T13:10:04.725Z · comments (4)

On Dwarkesh’s Podcast with OpenAI’s John Schulman
Zvi · 2024-05-21T17:30:04.332Z · comments (4)

Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem
Ansh Radhakrishnan (anshuman-radhakrishnan-1) · 2023-12-16T05:49:23.672Z · comments (3)

A Gentle Introduction to Risk Frameworks Beyond Forecasting
pendingsurvival · 2024-04-11T18:03:25.605Z · comments (10)

[link] Nick Bostrom’s new book, “Deep Utopia”, is out today
PeterH · 2024-03-27T11:24:01.401Z · comments (5)

Companies' safety plans neglect risks from scheming AI
Zach Stein-Perlman · 2024-06-03T15:00:20.236Z · comments (4)

Mistakes people make when thinking about units
Isaac King (KingSupernova) · 2024-06-25T03:39:20.138Z · comments (14)

Dialogue on the Claim: "OpenAI's Firing of Sam Altman (And Shortly-Subsequent Events) On Net Reduced Existential Risk From AGI"
johnswentworth · 2023-11-21T17:39:17.828Z · comments (84)

[Full Post] Progress Update #1 from the GDM Mech Interp Team
Neel Nanda (neel-nanda-1) · 2024-04-19T19:06:59.185Z · comments (10)

In Defense of Open-Minded UDT
abramdemski · 2024-08-12T18:27:36.220Z · comments (27)

[link] LK-99 in retrospect
bhauth · 2024-07-07T02:06:27.660Z · comments (21)

[link] Excerpts from "A Reader's Manifesto"
Arjun Panickssery (arjun-panickssery) · 2024-09-06T22:37:40.254Z · comments (1)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

linch on Linch's Shortform

AI News so far this week.
1. Mira Murati (CTO) leaving OpenAI

2. OpenAI restructuring to be a full for-profit company (what?)

3. Ivanka Trump calls Leopold's Situational Awareness article "excellent and important read"

4. More OpenAI leadership departing, unclear why.
4a. Apparently sama only learned about Mira's departure the same day she announced it on Twitter? "Move fast" indeed!
4b. WSJ reports some internals of what went down at OpenAI after the Nov board kerfuffle.

5. California Federation of Labor Unions (2million+ members) spoke out in favor of SB 1047.

programcrafter on What is Randomness?

How do probabilities arise in the Everett interpretation of quantum mechanics?
After a series of quantum experiments, there will be a version of the observer who sees a sequence of very unlikely outcomes. How to make sense of that?

Well, in MWI there is the space of worlds, each associated with a certain complex number ("amplitude"). Some worlds can be uniform hydrogen over all the universe, some contain humans, a certain subspace contains me (I mean, collection of particles moving in a way introspectively-identical to me) writing a LessWrong comment, etc.

It so happens that the larger magnitude of said complex number is, the more often we see such world; IIRC, that inequality allows to prove that likelihood of seeing any world is proportional to squared modulo of amplitude, which is Born's rule.

The worlds space is presumably infinite-dimensional, and also expands over time (though not at exponential rate as is widely said, because "branches" must be merging all the time as well as splitting). That means that probability distribution assigns a very low likelihood to pretty much any world... but why do we get any outcomes then?

I'm not attempting to answer question why we experience things in the first place (preferring instead to seal it even for myself), but as for why we continue to do so conditional on experiencing something before: because of the "conditional on". Conditional probability is the non-unitary operation over our observations of phase space, retaining some parts while zeroing all others, which are "incompatible with our observations"; also, as its formula is , it can amplify likelihoods for small values of $P (B)$ . That doesn't totally fix the issue, but I believe the right thing to do in improbable worlds is to continue updating on evidence and choosing best actions as usual.
(To demonstrate the issue with small probabilities is not fixed, let's divide likelihood of any single world by $2^{256}$ ; here's a 256-bit random string: e697c6dfb32cf132805d38cf85a60c832247449749293054704ad56209d2440e).

james-stephen-brown on The Other Existential Crisis

Part of the answer is to note that mixtures of indeterminism and determinism are possible, so that libertarian free will is not just pure randomness, where any action is equally likely.

This is really interesting, because I agree with this, but also agree with what Seth's saying. I think this disagreement might actually be largely a semantic one. As such, I'm going to (try to) avoid using the terms 'libertarian' or 'compatibilist' free will. First of all I agree with the use of "indeterminism" to mean non-uniform randomness. I agree that there is a way that determinism and indeterminism can be mixed in such a way as give rise to an emergent property that is not present in either purely determined or purely random systems. I understand this in relation to the idea of evolutionary "design" which emerges from a system that necessarily has both determined and indeterminate properties (indeterminate at least at the level of the genes, they might not be ultimately indeterminate).

I'm going to employ a decision-making map that seeks to clarify my understanding of the how we make decisions and where we might get "what we want" from.

As I see it, the items in white are largely set, and change only gradually, and with no sense of control involved. I don't believe we have any control over our genes, our intentions or desires, what results our actions will have, of the world—I also don't think we have any control over our model of ourselves or the world, those are formed subconsciously. But our effort (in the green areas) allows for deliberative decision making, following an evolutionary selection process, in which our conscious awareness is involved.

In this way we are not beholden to the first action available to us, we can, instead of taking an action in the world, make a series of simulated actions in our head, consciously experiencing the predicted outcome of those actions, until we find a satisfactory one. So, you don't end up with a determined or a random solution, you end up with an option based on your conscious experience of your simulated options. This process satisfies my wants in terms of my sense that I have some control (when I make the effort) over my decisions. At the same time, I'm agnostic about whether true indeterminism exists at all, but, like with evolution, with randomness at the level of the cell (that may not be ultimately random), I think even in an entirely determined universe, we exist on a level that is subject to, at least, some apparent indeterminism. And even if that apparent indeterminism turns out to be determined, our (eternal) inability to calculate what is determined, still means we have no grounds to act in any way other than as if we have the control we feel we have.

I'm actually not sure if this makes me a compatibilist or not.

Determinists are always telling each other to act like libertarians. That's a clue that libertarianism is worth wanting.

So, I don't think my position is the same as asserting that we should act like libertarians, as I have (now) described my conception of the situation, I just think I should act consistent with this conception. By analogy, there are still people who say atheists often act according to "religious" moral values, but in fact they're not—it's just that morality is mode of behaviour that has all the same functions regardless of one's belief system.

ryan_greenblatt on What are the best arguments for/against AIs being "slightly 'nice'"?

As in, I care about the long-run power of values-which-are-similar-to-my-values-on-reflection. Which includes me (on reflection) by definition, but I think probably also includes lots of other humans.

ryan_greenblatt on What are the best arguments for/against AIs being "slightly 'nice'"?

Then shouldn't such systems (which can surely recognize this argument) just take care of short term survival instrumentally? Maybe you're making a claim about irrationality being likely or a claim that systems that care about long run benefit act in appararently myopic ways.

(Note that historically it was much harder to keep value stability/lock-in than it will be for AIs.)

I'm not going to engage in detail FYI.

mako-yass on Eye contact is effortless when you’re no longer emotionally blocked on it

So, again, you did guess that you'd be able to do that for everyone, and I disagree with that.

I think most of the people who have difficulty making eye contact and want to overcome themselves on it are not in a good place to judge whether they should.

chipmonk on Eye contact is effortless when you’re no longer emotionally blocked on it

i see

hm that's why i put "safely" werp

alex_altair on Work with me on agent foundations: independent fellowship

Nice! Yeah I'd be happy to chat about that, and also happy to get referrals of any other researchers who might be interested in receiving this funding to work on it.

mako-yass on Eye contact is effortless when you’re no longer emotionally blocked on it

I'm aware that you have a nuanced perspective on this which is part of the reason I'm raising this.

nathan-helm-burger on Model evals for dangerous capabilities

Addendum:

I think it'd be great if we sorted dangerous capabilities evals into two categories, mitigated and unmitigated hazards.

Mitigated means the hazards you measure with enforceable mitigations in place, as in behind a secure API. This includes:

API-based fine-tuning, where you filter the data that customers are allowed to fine-tune on and/or put various restrictions in place on the fine-tuning process.
Limited prompt-only jailbreaks (including long context many-example jailbreaks). This could include requiring that the jail-breaking needs to evade a filter which is trying to catch and block users that try to jailbreak.
Note that in the context of dangerous capabilities evals, jail-breaking can look like 'refusal dodging'. Refusal dodging is when you try to justify your question about dangerous technology by placing it in a reasonable context, such as a student studying for an exam, or analyzing an academic paper. If the model will summarize and extract key information from hazardous academic papers, that should be a red flag on a capability eval.
Red-teamers may try to sneak past filters by, for example, fine-tuning the model to communicate in code, using purely innocent statements in plain-text. It's fair game for the developers to put filters in place to try to catch red-teamers attempting to do this. See https://arxiv.org/html/2406.20053v1
This sounds like it's relatively easy-mode, and indeed it should be. But I still want to see a separate report on this, since it reassures me that the developer in question is taking reasonable precautions and has implemented them competently.

Unmitigated means the hazards you measure as if the weights had been stolen, or you deliberately released the weights (looking at you Meta). Unmitigated includes:

Unlimited unfiltered jail-breaking attempts.
White-box jail-breaking, where you get to use an optimization process working against the activations of the model to avoid refusals. (white box attacks are generally harder to resist than black-box attacks).
Activation steering for non-refusal or for capabilities elicitation
Fine-tuning on task-specific domain knowledge and examples of alignment to terrorist agendas
Merging the model with other models or architectures
It's fair game for the company to do anti-fine-tuning modifications of the model before doing the unmitigated testing. It's not fair game for the company to restrict what the red-teamers do to try to undo such measures. See https://arxiv.org/abs/2405.14577

I think this distinction is important, since I don't think that any companies so far have been good at publishing cleanly separated scores for mitigated and unmitigated. What OpenAI called 'unmitigated' is what I would call 'first pass mitigated, before the red teamers showed us how many holes our mitigations still had and we fixed those specific issues'. That's also an interesting measure, but not nearly as informative as a true unmitigated eval score.

@Zach Stein-Perlman [LW · GW]