joseph-miller

Posts
Comments

Posts

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks 2024-12-06T22:19:26.717Z

Transformer Circuit Faithfulness Metrics Are Not Robust 2024-07-12T03:47:30.077Z

Joseph Miller's Shortform 2024-05-21T20:50:31.757Z

How To Do Patching Fast 2024-05-11T20:13:52.424Z

Why I'm doing PauseAI 2024-04-30T16:21:54.156Z

Global Pause AI Protest 10/21 2023-10-14T03:20:27.937Z

The International PauseAI Protest: Activism under uncertainty 2023-10-12T17:36:15.716Z

Even Superhuman Go AIs Have Surprising Failure Modes 2023-07-20T17:31:35.814Z

We Found An Neuron in GPT-2 2023-02-11T18:27:29.410Z

Comments

Comment by Joseph Miller (Josephm) on What Makes an AI Startup "Net Positive" for Safety? · 2025-04-19T10:22:38.254Z · LW · GW

I think almost all startups are really great! I think there really is a very small set of startups that end up harmful for the world

I think you're kind of avoiding the question. What startups are really great for AI safety?

Comment by Joseph Miller (Josephm) on British and American Connotations · 2025-04-18T19:24:15.062Z · LW · GW

In American English (AE), "quite" is an intensifier, while in British English (BE) it's a mild deintensifier.

This does depend on context. In formal or old-fashioned British English, "quite" is also an intensifier. For example:

"Sir, you quite misunderstand me," said Mrs. Bennet, alarmed.

from Pride and Prejudice by Jane Austen.

"Graft" implies corruption in AE but hard work in BE.

I think "graft" also often implies corruption in British English.

Comment by Joseph Miller (Josephm) on Joseph Miller's Shortform · 2025-04-18T12:54:21.093Z · LW · GW

Rationalist twitter rage-bait recipe:

Rationalist: *reasonable, highly decoupling point about the holocaust*

Everyone: *highly coupling rage*

Rationalist: *shocked pikachu face*

Comment by Joseph Miller (Josephm) on johnswentworth's Shortform · 2025-04-15T00:51:03.664Z · LW · GW

Then there’s the AI regulation activists and lobbyists. They lobby and protest and stuff, pretending like they’re pushing for regulations on AI, but really they’re mostly networking and trying to improve their social status with DC People.

The activists and the lobbyists are two very different groups. The activists are not trying to network with the DC people (yet). Unless you mean Encode, who I would call lobbyists, not activists.

Comment by Joseph Miller (Josephm) on Lucius Bushnaq's Shortform · 2025-04-15T00:46:13.967Z · LW · GW

If the animal specific features form an overcomplete basis, isn't the set of animals + attributes just an even more overcomplete basis?

Comment by Joseph Miller (Josephm) on Alexander Gietelink Oldenziel's Shortform · 2025-04-13T21:29:33.135Z · LW · GW

From the Caro biography, it's pretty clear Lyndon Johnson had extraordinary political talent.

Comment by Joseph Miller (Josephm) on Legibility · 2025-04-08T23:25:08.996Z · LW · GW

You accidentally touch a hot stove and don't feel any pain. It's been months since your sensory inputs have congealed into pain.

Is this something you have achieved? Could you give more details about what this means?

If you touch a hot stove will you reflexively remove your hand?
If I inflict on you what to most people would be extreme physical pain (that is not physically damaging) (capsaicin?) would this be at worst a mild annoyance to you?
Do you ever take painkillers? Would you in an extreme situation like a medical operation?

Comment by Joseph Miller (Josephm) on How Gay is the Vatican? · 2025-04-07T01:53:28.338Z · LW · GW

This is sort of content I come to LessWrong for.

Comment by Joseph Miller (Josephm) on Wei Dai's Shortform · 2025-03-29T11:55:42.809Z · LW · GW

Does this still seem wrong to you?

Yes. I plan to write down my views properly at some point. But roughly I subscribe to non-cognitivism.

Moral questions are not well defined because they are written in ambiguous natural language, so they are not truth apt. Now you could argue that many reasonable questions are also ambiguous in this sense. Eg the question "how many people live in Sweden" is ultimately ambiguous because it is not written in a formal system (ie. the borders of Sweden are not defined down to the atomic level).

But you could in theory define the Sweden question in formal terms. You could define arbitrarily at how many nanoseconds after conception a fetus becomes a person and resolve all other ambiguities until the only work left would be empirical measurement of a well defined quantity.

And technically you could do the same for any moral question. But unlike the Sweden question, it would be hard to pick formal definitions that everyone can agree are reasonable. You could try to formally define the terms in "what should our values be?". Then the philosophical question becomes "what is the formal definition of 'should'?". But this suffers the same ambiguity. So then you must define that question. And so on in an endless recursion. It seems to me that there cannot be any One True resolution to this. At some point you just have to arbitrarily pick some definitions.

The underlying philosophy here is that I think for a question to be one on which you can make progress, it must be one in which some answers can be shown to be correct and others incorrect. ie. questions where two people who disagree in good faith will reliably converge by understanding each other's view. Questions where two aliens from different civilizations can reliably give the same answer without communicating. And the only questions like this seem to be those defined in formal systems.

Choosing definitions does not seem like such a set of questions. So resolving the ambiguities in moral questions is not something on which progress can be made. So we will never finally arrive at the One True answer to moral questions.

Comment by Joseph Miller (Josephm) on Explaining British Naval Dominance During the Age of Sail · 2025-03-29T01:39:40.919Z · LW · GW

The unemployment pool that resulted from this efficiency wage made it easier to discipline officers by moving them back to the captains list.

I don't understand this point or how it explains captains' willingness to fight.

Comment by Joseph Miller (Josephm) on Wei Dai's Shortform · 2025-03-29T00:35:55.854Z · LW · GW

the One True Form of Moral Progress

Have you written about this? This sounds very wrong to me.

Comment by Joseph Miller (Josephm) on Tracing the Thoughts of a Large Language Model · 2025-03-28T01:48:09.580Z · LW · GW

DeepMind says boo SAEs, now Anthropic says yay SAEs!^[1]

Reading this paper pushed me a fair amount in the yay direction. We may still be at the unsatisfying level where we can only say "this cluster of features seems to roughly correlate with this type of thing" and "the interaction between this cluster and this cluster seems to mostly explain this loose group of behaviors". But it looks like we're actually pointing at real things in the model. And therefore we are beginning to be able to decompose the computation of LLMs in meaningful ways. The Addition Case Study is seriously cool and feels like a true insight into the model's internal algorithms.

Maybe we will further decompose these explanations until we can get down to satisfying low-level descriptions like "this mathematical object is computed by this function and is used in this algorithm". Even if we could still interpret circuits at this level of abstraction, humans probably couldn't hold in their heads all the relevant parts of a single forward pass at once. But AIs could or maybe that won't be required for useful applications.

The prominent error terms and simplifying assumptions are worrying, but maybe throwing enough compute and hill-climbing research at the problem will eventually shrink them to acceptable sizes. It's notable that this paper contains very few novel conceptual ideas and is mostly just a triumph of engineering schlep, massive compute and painstaking manual analysis.

^{^}
This is obviously a straw man of both sides. They seem to be thinking about it from pretty different perspectives. DeepMind is roughly judging them by their immediate usefulness in applications, while Anthropic is looking at them as a stepping stone towards ambitious moonshot interp.

Comment by Joseph Miller (Josephm) on Joseph Miller's Shortform · 2025-03-26T22:32:51.237Z · LW · GW

Claude 3.7's annoying personality is the first example of accidentally misaligned AI making my life worse. Claude 3.5/3.6 was renowned for its superior personality that made it more pleasant to interact with than ChatGPT.

3.7 has an annoying tendency to do what it thinks you should do, rather than following instructions. I've run into this frequently in two coding scenarios:

In Cursor, I ask it to implement some function in a particular file. Even when explicitly instructed not to, it guesses what I want to do next and changes other parts of the code as well.
I'm trying to fix part of my code and I ask it to diagnose a problem and suggest debugging steps. Even when explicitly instructed not to, it will suggest alternative approaches that circumvent the issue, rather than trying to fix the current approach.

I call this misalignment, rather than a capabilities failure, because it seems a step back from previous models and I suspect it is a side effect of training the model to be good at autonomous coding tasks, which may be overriding its compliance with instructions.

Comment by Joseph Miller (Josephm) on Will Jesus Christ return in an election year? · 2025-03-24T20:32:51.206Z · LW · GW

This means that the Jesus Christ market is quite interesting! You could make it even more interesting by replacing it with "This Market Will Resolve No At The End Of 2025": then it would be purely a market on how much Polymarket traders will want money later in the year.

It's unclear how this market would resolve. I think you meant something more like a market on "2+2=5"?

Comment by Joseph Miller (Josephm) on trevor's Shortform · 2025-03-20T19:55:48.117Z · LW · GW

I read this and still don't understand what an acceptable target slot is.

Comment by Joseph Miller (Josephm) on Joseph Miller's Shortform · 2025-03-19T17:20:54.477Z · LW · GW

Then it will often confabulate a reason why the correct thing it said was actually wrong. So you can never really trust it, you have to think about what makes sense and test your model against reality.

But to some extent that's true for any source of information. LLMs are correct about a lot of things and you can usually guess which things they're likely to get wrong.

Comment by Joseph Miller (Josephm) on Joseph Miller's Shortform · 2025-03-19T00:01:16.574Z · LW · GW

LLM hallucination is good epistemic training. When I code, I'm constantly asking Claude how things work and what things are possible. It often gets things wrong, but it's still helpful. You just have to use it to help you build up a gears level model of the system you are working with. Then, when it confabulates some explanation you can say "wait, what?? that makes no sense" and it will say "You're right to question these points - I wasn't fully accurate" and give you better information.

Comment by Joseph Miller (Josephm) on Against Yudkowsky's evolution analogy for AI x-risk [unfinished] · 2025-03-18T20:17:45.491Z · LW · GW

See No convincing evidence for gradient descent in activation space

Comment by Joseph Miller (Josephm) on kave's Shortform · 2025-03-18T11:09:10.828Z · LW · GW

It's not really feasible for the feature to rely on people reading this PSA to work well. The correct usage needs to be obvious.

Comment by Joseph Miller (Josephm) on Joseph Miller's Shortform · 2025-03-16T20:29:32.215Z · LW · GW

When I go on LessWrong, I generally just look at the quick takes and then close the tab. Quick takes cause me to spend more time on LessWrong but spend less time reading actual posts.

On the other hand, sometimes quick takes are very high quality and I read them and get value from them when I may not have read the same content as a full post.

Comment by Joseph Miller (Josephm) on leogao's Shortform · 2025-03-03T19:17:13.249Z · LW · GW

I find it very annoying that standard reference culture seems to often imply giving extremely positive references unless someone was truly awful, since it makes it much harder to get real info from references

Agreed, but also most of the world does operate in this reference culture. If you choose to take a stand against it, you might screw over a decent candidate by providing only a quite positive recommendation.

Comment by Joseph Miller (Josephm) on How To Do Patching Fast · 2025-03-01T16:18:51.213Z · LW · GW

Hey, long time no see! Thanks, I've correct it:

$= \frac{\partial F (e_{α})}{\partial e_{α}} \frac{\partial [e_{clean} + α \times (e_{corr} - e_{clean})]}{\partial α}$
$= \frac{\partial F (e_{α})}{\partial e_{α}} (e_{corr} - e_{clean})$
Set $α = 0$ , ie. $e_{α} = e_{c l e a n}$
$= (e_{corr} - e_{clean}) \frac{\partial F (e_{clean})}{\partial e_{clean}}$

Comment by Joseph Miller (Josephm) on Campbell Hutcheson's Shortform · 2025-02-26T23:24:51.314Z · LW · GW

It's surprising he bought the gun so long in advance. There should be footage of him buying it I think as required by California law.

Comment by Joseph Miller (Josephm) on Campbell Hutcheson's Shortform · 2025-02-26T23:23:36.756Z · LW · GW

You can see what he's referring to in the pictures Webb published of the scene.

Comment by Joseph Miller (Josephm) on LoganStrohl's Shortform · 2025-02-26T23:20:12.335Z · LW · GW

What is prospective memory training?

Comment by Joseph Miller (Josephm) on leogao's Shortform · 2025-02-24T09:12:23.315Z · LW · GW

I think there's a spectrum between great man theory and structural forces theory and I would classify your view as much closer to the structural forces view, rather than a combination of the two.

The strongest counter-example might be Mao. It seems like one man's idiosyncratic whims really did set the trajectory for hundreds of millions of people. Although of course as soon as he died most of the power vanished, but surely China and the world would be extremely different today without him.

Comment by Joseph Miller (Josephm) on leogao's Shortform · 2025-02-24T09:06:14.218Z · LW · GW

The Duke of Wellington said that Napoleon's presence on a battlefield “was worth forty thousand men”.

This would be about 4% of France's military size in 1812.

Comment by Joseph Miller (Josephm) on Joseph Miller's Shortform · 2025-02-22T22:09:23.923Z · LW · GW

I first encountered it in chapter 18 of The Looming Tower by Lawrence Wright.

But here's a easily linkable online source: https://ctc.westpoint.edu/revisiting-al-qaidas-anthrax-program/

Comment by Joseph Miller (Josephm) on Joseph Miller's Shortform · 2025-02-22T17:27:27.441Z · LW · GW

"Despite their extreme danger, we only became aware of them when the enemy drew our attention to them by repeatedly expressing concerns that they can be produced simply with easily available materials."

Ayman al-Zawahiri, former leader of Al-Qaeda, on chemical/biological weapons.

I don't think this is a knock-down argument against discussing CBRN risks from AI, but it seems worth considering.

Comment by Joseph Miller (Josephm) on Literature Review of Text AutoEncoders · 2025-02-21T14:25:22.542Z · LW · GW

This is great, thanks. I think these could be very helpful for interpretability.

Comment by Joseph Miller (Josephm) on A History of the Future, 2025-2040 · 2025-02-19T09:32:10.728Z · LW · GW

Thanks I enjoyed this.

The main thing that seems wrong to me, similar to some of your other recent posts, is that AI progress seems to mysteriously decelerate around 2030. I predict that things will look much more sci-fi after that point than in your story (if we're still alive).

Comment by Joseph Miller (Josephm) on Joseph Miller's Shortform · 2025-02-18T11:09:39.958Z · LW · GW

xAI claims to have a cluster of 200k GPUs, presumably H100s, online for long enough to train Grok 3.

I think this is faster datacenter scaling than any predictions I've heard.

Source: https://x.com/xai/status/1891699715298730482

Comment by Joseph Miller (Josephm) on Murder plots are infohazards · 2025-02-16T16:47:15.577Z · LW · GW

DM'd

Comment by Joseph Miller (Josephm) on Murder plots are infohazards · 2025-02-15T12:02:32.002Z · LW · GW

In that case I would consider applying for EA funds if you are willing to do the work professionally or set up a charity to do it. I think you could make a strong case that it meets the highest bar for important, neglected and tractable work.

Comment by Joseph Miller (Josephm) on Murder plots are infohazards · 2025-02-15T09:57:15.249Z · LW · GW

How long does it take you to save one life on average? GiveWell's top charities save a life for about $5000. If you can get close to that there should be many EA philanthropists willing to fund you or a charity you create.

And I think they should be willing to go up to like $10-20k at least because murders are probably especially bad deaths in terms of their effects on the world.

Comment by Joseph Miller (Josephm) on interpreting GPT: the logit lens · 2025-02-13T20:51:42.637Z · LW · GW

I just found the paper BERT's output layer recognizes all hidden layers? Some Intriguing Phenomena and a simple way to boost BERT, which precedes this post by a few months and invents essentially the same technique as the logit lens.

So consider also citing that paper when citing this post.

As an aside, I would guess that this is the most cited LessWrong post in the academic literature, but it would be cool if anyone had stats on that.

Comment by Joseph Miller (Josephm) on Viliam's Shortform · 2025-02-03T14:59:42.877Z · LW · GW

Yeah I guess, but actually the more I think about it, the more impractical it seems.

Comment by Joseph Miller (Josephm) on Viliam's Shortform · 2025-02-03T11:55:35.010Z · LW · GW

I think the solution would be something like adopting a security mindset with respect to preventing community members going off the rails.

The costs would be great because then everyone would be under suspicion by default, but maybe it would be worth it.

Comment by Joseph Miller (Josephm) on Joseph Miller's Shortform · 2025-02-02T23:29:26.100Z · LW · GW

The next international PauseAI protest is taking place in one week in London, New York, Stockholm (Sunday 9th Feb), Paris (Mon 10 Feb) and many other cities around the world.

We are calling for AI Safety to be the focus of the upcoming Paris AI Action Summit. If you're on the fence, take a look at Why I'm doing PauseAI.

Comment by Joseph Miller (Josephm) on TsviBT's Shortform · 2025-01-28T12:20:23.914Z · LW · GW

For those in Europe, Tomorrow Biostasis makes the process a lot easier and they have people who will talk you through step by step.

Comment by Joseph Miller (Josephm) on Reality has a surprising amount of detail · 2025-01-25T07:14:44.021Z · LW · GW

A good example of surprising detail I just read.

It turns out that the UI for a simple handheld calculator is a large design space with no easy solutions.

https://lcamtuf.substack.com/p/ui-is-hell-four-function-calculators

Comment by Joseph Miller (Josephm) on Thane Ruthenis's Shortform · 2025-01-20T22:51:32.255Z · LW · GW

Following OpenAI Twitter freakouts is a colossal, utterly pointless waste of your time and you shouldn't do it ever.

I feel like for the same reasons, this shortform is kind of an engaging waste of my time. One reason I read LessWrong is to avoid twitter garbage.

Comment by Joseph Miller (Josephm) on Leon Lang's Shortform · 2025-01-17T01:26:41.770Z · LW · GW

we thought that forecasting AI trends was important to be able to have us taken seriously

This might be the most dramatic example ever of forecasting affecting the outcome.

Similarly I'm concerned that a lot of alignment people are putting work into evals and benchmarks which may be having some accelerating affect on the AI capabilities which they are trying to understand.

"That which is measured improves. That which is measured and reported improves exponentially."

Comment by Joseph Miller (Josephm) on I'm offering free math consultations! · 2025-01-14T23:58:16.466Z · LW · GW

Just did a debugging session IRL with Gurkenglas and it was very helpful!

Comment by Joseph Miller (Josephm) on Turning up the Heat on Deceptively-Misaligned AI · 2025-01-08T16:42:44.877Z · LW · GW

correctness and beta-coherence can be rolled up into one specific property

Is that rolling up two things into one, or is that just beta-coherence?

Comment by Joseph Miller (Josephm) on Activation space interpretability may be doomed · 2025-01-08T16:37:20.897Z · LW · GW

I agree that the ultimate goal is to understand the weights. Seems pretty unclear whether trying to understand the activations is a useful stepping stone towards that. And it's hard to be sure how relevant theoretical toy example are to that question.

Comment by Joseph Miller (Josephm) on Joseph Miller's Shortform · 2025-01-08T00:58:24.236Z · LW · GW

Ilya Sutskever had two armed bodyguards with him at NeurIPS.

Some people are asking for a source on this. I'm pretty sure I've heard it from multiple people who were there in person but I can't find a written source. Can anyone confirm or deny?

Comment by Joseph Miller (Josephm) on Joseph Miller's Shortform · 2025-01-08T00:40:59.654Z · LW · GW

Well, it seems quite important whether the DROS registration could possibly have been staged.

That would be difficult. To purchase a gun in California you have to provide photo ID^[1], proof of address^[2] and a thumbprint^[3]. Also it looks like the payment must be trackable^[4] and gun stores have to maintain video surveillance footage for up to year.^[5]

My guess is that the police haven't actually invested this as a potential homicide, but if they did, there should be very strong evidence that Balaji bought a gun. Potentially a very sophisticated actor could fake this evidence but it seems challenging (I can't find any historical examples of this happening). It would probably be easier to corrupt the investigation. Or the perpetrators might just hope that there would be no investigation.

There is a 10-day waiting period to purchase guns in California^[5], so Balaji would probably have started planning his suicide before his hiking trip (I doubt someone like him would own a gun for recreational purposes?).

Is the interview with the NYT going to be published?

I think it's this piece that was published before his death.

Is any of the police behavior actually out of the ordinary?

Epistemic status: highly uncertain: my impressions from searching with LLMs for a few minutes.

It's fairly common for victim's families to contest official suicide rulings. In cases with lots of public attention police generally try to justify their conclusions. So we might expect the police to publicly state if there is footage of Balaji purchasing the gun shortly before his death. It could be that this will still happen with more time or public pressure.

Comment by Joseph Miller (Josephm) on Nina Panickssery's Shortform · 2025-01-07T12:10:22.205Z · LW · GW

land in space will be less valuable than land on earth until humans settle outside of earth (which I don't believe will happen in the next few decades).

Why would it take so long? Is this assuming no ASI?

Comment by Joseph Miller (Josephm) on Review: Planecrash · 2025-01-07T01:26:47.240Z · LW · GW

Wow that's great, thanks. @L Rudolf L you should link this in this post.

User info

Posts

Comments

Rationalist twitter rage-bait recipe: