LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

The Residual Expansion: A Framework for thinking about Transformer Circuits
Daniel Tan (dtch1997) · 2024-08-02T11:04:56.347Z · comments (13)

An information-theoretic study of lying in LLMs
Annah (annah) · 2024-08-02T10:06:39.312Z · comments (0)

[link] Announcing The Techno-Humanist Manifesto: A new philosophy of progress for the 21st century
jasoncrawford · 2024-07-08T16:33:02.194Z · comments (4)

[link] AI Safety Newsletter #39: Implications of a Trump Administration for AI Policy Plus, Safety Engineering
Corin Katzke (corin-katzke) · 2024-07-29T17:50:52.454Z · comments (1)

[link] Jonothan Gorard:The territory is isomorphic to an equivalence class of its maps
Daniel C (harper-owen) · 2024-09-07T10:04:47.840Z · comments (18)

Interview with Robert Kralisch on Simulators
WillPetillo · 2024-08-26T05:49:15.543Z · comments (0)

[link] My lukewarm take on GLP-1 agonists
George3d6 · 2024-08-26T12:34:27.929Z · comments (0)

[link] Why good things often don’t lead to better outcomes
DMMF · 2024-09-19T16:37:07.778Z · comments (1)

[link] CultFrisbee
Gauraventh (aryangauravyadav) · 2024-08-11T21:36:36.550Z · comments (3)

My experience applying to MATS 6.0
mic (michael-chen) · 2024-07-18T19:02:21.849Z · comments (3)

Simulation-aware causal decision theory: A case for one-boxing in CDT
kongus_bongus · 2024-08-09T18:09:20.013Z · comments (11)

Activation Pattern SVD: A proposal for SAE Interpretability
Daniel Tan (dtch1997) · 2024-06-28T22:12:48.789Z · comments (2)

Probabilistic Logic <=> Oracles?
Yudhister Kumar (randomwalks) · 2024-07-01T05:36:35.350Z · comments (0)

Physical Therapy Sucks (but have you tried hiding it in some peanut butter?)
Declan Molony (declan-molony) · 2024-09-10T05:54:47.000Z · comments (12)

[link] Announcing the AI Forecasting Benchmark Series | July 8, $120k in Prizes
ChristianWilliams · 2024-07-02T22:33:23.512Z · comments (0)

AI labs can boost external safety research
Zach Stein-Perlman · 2024-07-31T19:30:16.207Z · comments (0)

The new UK government's stance on AI safety
Elliot_Mckernon (elliot) · 2024-07-31T15:23:59.235Z · comments (0)

Automating LLM Auditing with Developmental Interpretability
htlou · 2024-09-04T15:50:04.337Z · comments (0)

Room Available in Boston Group House
NoSignalNoNoise (AspiringRationalist) · 2024-07-23T02:55:59.602Z · comments (1)

[link] Holomorphic surjection theorem (Picard's little theorem)
dkl9 · 2024-07-21T13:24:18.300Z · comments (0)

[question] If AI is in a bubble and the bubble bursts, what would you do?
Remmelt (remmelt-ellen) · 2024-08-19T10:56:03.948Z · answers+comments (5)

Announcing the Ultimate Jailbreaking Championship
InnerHufflepuff (grayswan) · 2024-09-04T00:35:31.234Z · comments (1)

Are LLMs on the Path to AGI?
Davidmanheim · 2024-08-30T03:14:04.710Z · comments (2)

[link] AI x Human Flourishing: Introducing the Cosmos Institute
Brendan McCord (brendan-mccord) · 2024-09-05T18:23:32.690Z · comments (5)

Against Explosive Growth
c.trout (ctrout) · 2024-09-04T21:45:03.120Z · comments (1)

[link] The Ap Distribution
criticalpoints · 2024-08-24T21:45:35.029Z · comments (3)

[question] Looking to interview AI Safety researchers for a book
jeffreycaruso · 2024-08-24T19:57:33.119Z · answers+comments (0)

Stacked Laptop Monitor Update
jefftk (jkaufman) · 2024-07-15T09:40:06.075Z · comments (3)

Cat Sustenance Fortification
jefftk (jkaufman) · 2024-07-31T02:30:04.898Z · comments (7)

[link] Does robustness improve with scale?
ChengCheng (ccstan99) · 2024-07-25T20:55:53.359Z · comments (0)

Longevity: A critical look at "Loss of epigenetic information as a cause of mammalian aging"
Anna Crow · 2024-07-24T01:40:57.634Z · comments (2)

Primary Perceptive Systems
ChristianKl · 2024-08-15T11:26:01.667Z · comments (2)

Funding for work that builds capacity to address risks from transformative AI
abergal · 2024-08-14T23:52:09.922Z · comments (0)

Antitrust as Controlled Creative Destruction
Martin Sustrik (sustrik) · 2024-07-10T16:40:01.503Z · comments (2)

[link] Benefits of Psyllium Dietary Fiber in Particular
Brendan Long (korin43) · 2024-08-28T18:13:23.891Z · comments (7)

Emergence, The Blind Spot of GenAI Interpretability?
Quentin FEUILLADE--MONTIXI (quentin-feuillade-montixi) · 2024-08-10T10:07:53.654Z · comments (7)

[link] Verification methods for international AI agreements
Akash (akash-wasil) · 2024-08-31T14:58:10.986Z · comments (1)

A bet for Samo Burja
Nathan Helm-Burger (nathan-helm-burger) · 2024-09-05T16:01:35.440Z · comments (2)

WTF is with the Infancy Gospel of Thomas?!? A deep dive into satire, philosophy, and more
kromem · 2024-07-09T09:29:44.482Z · comments (1)

Chevy Bolt Review
jefftk (jkaufman) · 2024-09-26T13:40:05.456Z · comments (2)

[question] Looking for intuitions to extend bargaining notions
ProgramCrafter (programcrafter) · 2024-08-24T05:00:13.995Z · answers+comments (0)

Beyond Biomarkers: Understanding Multiscale Causality
Matěj Nekoranec (matej-nekoranec) · 2024-07-07T09:56:22.554Z · comments (0)

[question] Building an Inexpensive, Aesthetic, Private Forum
Aaron Graifman (aaron-graifman) · 2024-09-09T17:10:42.677Z · answers+comments (15)

My Experience Using Gamification
Wyatt S (wyatt-s) · 2024-07-26T23:06:53.392Z · comments (4)

[question] How great is the utility of "saving" endangered languages?
SpectrumDT · 2024-08-20T13:14:32.895Z · answers+comments (29)

[link] The Best Bits From Build, Baby, Build
Maxwell Tabarrok (maxwell-tabarrok) · 2024-07-11T14:09:10.131Z · comments (0)

[link] GPT-2 Sometimes Fails at IOI
Ronak_Mehta · 2024-08-14T23:24:39.268Z · comments (0)

[link] Five toy worlds to think about heritability
David Hugh-Jones (david-hugh-jones) · 2024-06-28T13:11:55.041Z · comments (0)

Ball Sq Pathways
jefftk (jkaufman) · 2024-07-21T02:20:06.607Z · comments (1)

Something Is Lost When AI Makes Art
utilistrutil · 2024-08-18T22:53:46.951Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

linch on Linch's Shortform

AI News so far this week.
1. Mira Murati (CTO) leaving OpenAI

2. OpenAI restructuring to be a full for-profit company (what?)

3. Ivanka Trump calls Leopold's Situational Awareness article "excellent and important read"

4. More OpenAI leadership departing, unclear why.
4a. Apparently sama only learned about Mira's departure the same day she announced it on Twitter? "Move fast" indeed!
4b. WSJ reports some internals of what went down at OpenAI after the Nov board kerfuffle.

5. California Federation of Labor Unions (2million+ members) spoke out in favor of SB 1047.

programcrafter on What is Randomness?

How do probabilities arise in the Everett interpretation of quantum mechanics?
After a series of quantum experiments, there will be a version of the observer who sees a sequence of very unlikely outcomes. How to make sense of that?

Well, in MWI there is the space of worlds, each associated with a certain complex number ("amplitude"). Some worlds can be uniform hydrogen over all the universe, some contain humans, a certain subspace contains me (I mean, collection of particles moving in a way introspectively-identical to me) writing a LessWrong comment, etc.

It so happens that the larger magnitude of said complex number is, the more often we see such world; IIRC, that inequality allows to prove that likelihood of seeing any world is proportional to squared modulo of amplitude, which is Born's rule.

The worlds space is presumably infinite-dimensional, and also expands over time (though not at exponential rate as is widely said, because "branches" must be merging all the time as well as splitting). That means that probability distribution assigns a very low likelihood to pretty much any world... but why do we get any outcomes then?

I'm not attempting to answer question why we experience things in the first place (preferring instead to seal it even for myself), but as for why we continue to do so conditional on experiencing something before: because of the "conditional on". Conditional probability is the non-unitary operation over our observations of phase space, retaining some parts while zeroing all others, which are "incompatible with our observations"; also, as its formula is , it can amplify likelihoods for small values of $P (B)$ . That doesn't totally fix the issue, but I believe the right thing to do in improbable worlds is to continue updating on evidence and choosing best actions as usual.
(To demonstrate the issue with small probabilities is not fixed, let's divide likelihood of any single world by $2^{256}$ ; here's a 256-bit random string: e697c6dfb32cf132805d38cf85a60c832247449749293054704ad56209d2440e).

james-stephen-brown on The Other Existential Crisis

Part of the answer is to note that mixtures of indeterminism and determinism are possible, so that libertarian free will is not just pure randomness, where any action is equally likely.

This is really interesting, because I agree with this, but also agree with what Seth's saying. I think this disagreement might actually be largely a semantic one. As such, I'm going to (try to) avoid using the terms 'libertarian' or 'compatibilist' free will. First of all I agree with the use of "indeterminism" to mean non-uniform randomness. I agree that there is a way that determinism and indeterminism can be mixed in such a way as give rise to an emergent property that is not present in either purely determined or purely random systems. I understand this in relation to the idea of evolutionary "design" which emerges from a system that necessarily has both determined and indeterminate properties (indeterminate at least at the level of the genes, they might not be ultimately indeterminate).

I'm going to employ a decision-making map that seeks to clarify my understanding of the how we make decisions and where we might get "what we want" from.

As I see it, the items in white are largely set, and change only gradually, and with no sense of control involved. I don't believe we have any control over our genes, our intentions or desires, what results our actions will have, of the world—I also don't think we have any control over our model of ourselves or the world, those are formed subconsciously. But our effort (in the green areas) allows for deliberative decision making, following an evolutionary selection process, in which our conscious awareness is involved.

In this way we are not beholden to the first action available to us, we can, instead of taking an action in the world, make a series of simulated actions in our head, consciously experiencing the predicted outcome of those actions, until we find a satisfactory one. So, you don't end up with a determined or a random solution, you end up with an option based on your conscious experience of your simulated options. This process satisfies my wants in terms of my sense that I have some control (when I make the effort) over my decisions. At the same time, I'm agnostic about whether true indeterminism exists at all, but, like with evolution, with randomness at the level of the cell (that may not be ultimately random), I think even in an entirely determined universe, we exist on a level that is subject to, at least, some apparent indeterminism. And even if that apparent indeterminism turns out to be determined, our (eternal) inability to calculate what is determined, still means we have no grounds to act in any way other than as if we have the control we feel we have.

I'm actually not sure if this makes me a compatibilist or not.

Determinists are always telling each other to act like libertarians. That's a clue that libertarianism is worth wanting.

So, I don't think my position is the same as asserting that we should act like libertarians, as I have (now) described my conception of the situation, I just think I should act consistent with this conception. By analogy, there are still people who say atheists often act according to "religious" moral values, but in fact they're not—it's just that morality is mode of behaviour that has all the same functions regardless of one's belief system.

ryan_greenblatt on What are the best arguments for/against AIs being "slightly 'nice'"?

As in, I care about the long-run power of values-which-are-similar-to-my-values-on-reflection. Which includes me (on reflection) by definition, but I think probably also includes lots of other humans.

ryan_greenblatt on What are the best arguments for/against AIs being "slightly 'nice'"?

Then shouldn't such systems (which can surely recognize this argument) just take care of short term survival instrumentally? Maybe you're making a claim about irrationality being likely or a claim that systems that care about long run benefit act in appararently myopic ways.

(Note that historically it was much harder to keep value stability/lock-in than it will be for AIs.)

I'm not going to engage in detail FYI.

mako-yass on Eye contact is effortless when you’re no longer emotionally blocked on it

So, again, you did guess that you'd be able to do that for everyone, and I disagree with that.

I think most of the people who have difficulty making eye contact and want to overcome themselves on it are not in a good place to judge whether they should.

chipmonk on Eye contact is effortless when you’re no longer emotionally blocked on it

i see

hm that's why i put "safely" werp

alex_altair on Work with me on agent foundations: independent fellowship

Nice! Yeah I'd be happy to chat about that, and also happy to get referrals of any other researchers who might be interested in receiving this funding to work on it.

mako-yass on Eye contact is effortless when you’re no longer emotionally blocked on it

I'm aware that you have a nuanced perspective on this which is part of the reason I'm raising this.

nathan-helm-burger on Model evals for dangerous capabilities

Addendum:

I think it'd be great if we sorted dangerous capabilities evals into two categories, mitigated and unmitigated hazards.

Mitigated means the hazards you measure with enforceable mitigations in place, as in behind a secure API. This includes:

API-based fine-tuning, where you filter the data that customers are allowed to fine-tune on and/or put various restrictions in place on the fine-tuning process.
Limited prompt-only jailbreaks (including long context many-example jailbreaks). This could include requiring that the jail-breaking needs to evade a filter which is trying to catch and block users that try to jailbreak.
Note that in the context of dangerous capabilities evals, jail-breaking can look like 'refusal dodging'. Refusal dodging is when you try to justify your question about dangerous technology by placing it in a reasonable context, such as a student studying for an exam, or analyzing an academic paper. If the model will summarize and extract key information from hazardous academic papers, that should be a red flag on a capability eval.
Red-teamers may try to sneak past filters by, for example, fine-tuning the model to communicate in code, using purely innocent statements in plain-text. It's fair game for the developers to put filters in place to try to catch red-teamers attempting to do this. See https://arxiv.org/html/2406.20053v1
This sounds like it's relatively easy-mode, and indeed it should be. But I still want to see a separate report on this, since it reassures me that the developer in question is taking reasonable precautions and has implemented them competently.

Unmitigated means the hazards you measure as if the weights had been stolen, or you deliberately released the weights (looking at you Meta). Unmitigated includes:

Unlimited unfiltered jail-breaking attempts.
White-box jail-breaking, where you get to use an optimization process working against the activations of the model to avoid refusals. (white box attacks are generally harder to resist than black-box attacks).
Activation steering for non-refusal or for capabilities elicitation
Fine-tuning on task-specific domain knowledge and examples of alignment to terrorist agendas
Merging the model with other models or architectures
It's fair game for the company to do anti-fine-tuning modifications of the model before doing the unmitigated testing. It's not fair game for the company to restrict what the red-teamers do to try to undo such measures. See https://arxiv.org/abs/2405.14577

I think this distinction is important, since I don't think that any companies so far have been good at publishing cleanly separated scores for mitigated and unmitigated. What OpenAI called 'unmitigated' is what I would call 'first pass mitigated, before the red teamers showed us how many holes our mitigations still had and we fixed those specific issues'. That's also an interesting measure, but not nearly as informative as a true unmitigated eval score.

@Zach Stein-Perlman [LW · GW]