Posts

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks 2024-12-06T22:19:26.717Z
Transformer Circuit Faithfulness Metrics Are Not Robust 2024-07-12T03:47:30.077Z
Joseph Miller's Shortform 2024-05-21T20:50:31.757Z
How To Do Patching Fast 2024-05-11T20:13:52.424Z
Why I'm doing PauseAI 2024-04-30T16:21:54.156Z
Global Pause AI Protest 10/21 2023-10-14T03:20:27.937Z
The International PauseAI Protest: Activism under uncertainty 2023-10-12T17:36:15.716Z
Even Superhuman Go AIs Have Surprising Failure Modes 2023-07-20T17:31:35.814Z
We Found An Neuron in GPT-2 2023-02-11T18:27:29.410Z

Comments

Comment by Joseph Miller (Josephm) on Viliam's Shortform · 2025-02-03T14:59:42.877Z · LW · GW

Yeah I guess, but actually the more I think about it, the more impractical it seems.

Comment by Joseph Miller (Josephm) on Viliam's Shortform · 2025-02-03T11:55:35.010Z · LW · GW

I think the solution would be something like adopting a security mindset with respect to preventing community members going off the rails.

The costs would be great because then everyone would be under suspicion by default, but maybe it would be worth it.

Comment by Joseph Miller (Josephm) on Joseph Miller's Shortform · 2025-02-02T23:29:26.100Z · LW · GW

The next international PauseAI protest is taking place in one week in London, New York, Stockholm (Sunday 9th Feb), Paris (Mon 10 Feb) and many other cities around the world.

We are calling for AI Safety to be the focus of the upcoming Paris AI Action Summit. If you're on the fence, take a look at Why I'm doing PauseAI.

Comment by Joseph Miller (Josephm) on TsviBT's Shortform · 2025-01-28T12:20:23.914Z · LW · GW

For those in Europe, Tomorrow Biostasis makes the process a lot easier and they have people who will talk you through step by step.

Comment by Joseph Miller (Josephm) on Reality has a surprising amount of detail · 2025-01-25T07:14:44.021Z · LW · GW

A good example of surprising detail I just read.

It turns out that the UI for a simple handheld calculator is a large design space with no easy solutions.

https://lcamtuf.substack.com/p/ui-is-hell-four-function-calculators

Comment by Joseph Miller (Josephm) on Thane Ruthenis's Shortform · 2025-01-20T22:51:32.255Z · LW · GW
  • Following OpenAI Twitter freakouts is a colossal, utterly pointless waste of your time and you shouldn't do it ever.

I feel like for the same reasons, this shortform is kind of an engaging waste of my time. One reason I read LessWrong is to avoid twitter garbage.

Comment by Joseph Miller (Josephm) on Leon Lang's Shortform · 2025-01-17T01:26:41.770Z · LW · GW

we thought that forecasting AI trends was important to be able to have us taken seriously

This might be the most dramatic example ever of forecasting affecting the outcome.

Similarly I'm concerned that a lot of alignment people are putting work into evals and benchmarks which may be having some accelerating affect on the AI capabilities which they are trying to understand.

"That which is measured improves. That which is measured and reported improves exponentially."

Comment by Joseph Miller (Josephm) on I'm offering free math consultations! · 2025-01-14T23:58:16.466Z · LW · GW

Just did a debugging session IRL with Gurkenglas and it was very helpful!

Comment by Joseph Miller (Josephm) on Turning up the Heat on Deceptively-Misaligned AI · 2025-01-08T16:42:44.877Z · LW · GW

correctness and beta-coherence can be rolled up into one specific property

Is that rolling up two things into one, or is that just beta-coherence?

Comment by Joseph Miller (Josephm) on Activation space interpretability may be doomed · 2025-01-08T16:37:20.897Z · LW · GW

I agree that the ultimate goal is to understand the weights. Seems pretty unclear whether trying to understand the activations is a useful stepping stone towards that. And it's hard to be sure how relevant theoretical toy example are to that question.

Comment by Joseph Miller (Josephm) on Joseph Miller's Shortform · 2025-01-08T00:58:24.236Z · LW · GW
  • Ilya Sutskever had two armed bodyguards with him at NeurIPS.

Some people are asking for a source on this. I'm pretty sure I've heard it from multiple people who were there in person but I can't find a written source. Can anyone confirm or deny?

Comment by Joseph Miller (Josephm) on Joseph Miller's Shortform · 2025-01-08T00:40:59.654Z · LW · GW

Well, it seems quite important whether the DROS registration could possibly have been staged.

That would be difficult. To purchase a gun in California you have to provide photo ID[1], proof of address[2] and a thumbprint[3]. Also it looks like the payment must be trackable[4] and gun stores have to maintain video surveillance footage for up to year.[5]

My guess is that the police haven't actually invested this as a potential homicide, but if they did, there should be very strong evidence that Balaji bought a gun. Potentially a very sophisticated actor could fake this evidence but it seems challenging (I can't find any historical examples of this happening). It would probably be easier to corrupt the investigation. Or the perpetrators might just hope that there would be no investigation.

There is a 10-day waiting period to purchase guns in California[5], so Balaji would probably have started planning his suicide before his hiking trip (I doubt someone like him would own a gun for recreational purposes?).

Is the interview with the NYT going to be published?

I think it's this piece that was published before his death.

Is any of the police behavior actually out of the ordinary?

Epistemic status: highly uncertain: my impressions from searching with LLMs for a few minutes.

It's fairly common for victim's families to contest official suicide rulings. In cases with lots of public attention police generally try to justify their conclusions. So we might expect the police to publicly state if there is footage of Balaji purchasing the gun shortly before his death. It could be that this will still happen with more time or public pressure.

  1. ^
  2. ^
  3. ^
  4. ^
  5. ^
Comment by Joseph Miller (Josephm) on Nina Panickssery's Shortform · 2025-01-07T12:10:22.205Z · LW · GW

land in space will be less valuable than land on earth until humans settle outside of earth (which I don't believe will happen in the next few decades).

Why would it take so long? Is this assuming no ASI?

Comment by Joseph Miller (Josephm) on Review: Planecrash · 2025-01-07T01:26:47.240Z · LW · GW

Wow that's great, thanks. @L Rudolf L you should link this in this post.

Comment by Joseph Miller (Josephm) on Joseph Miller's Shortform · 2025-01-07T01:22:22.255Z · LW · GW

As in, this is also what the police say?

Yes, edited to clarify. The police say there was no evidence of foul play. All parties agree he died in his bathroom of a gunshot wound.

Did the police find a gun in the apartment? Was it a gun Suchir had previously purchased himself according to records? Seems like relevant info.

The only source I can find on this is Webb, so take with a grain of salt. But yes, they found a gun in the apartment. According to Webb, the DROS registration information was on top of the gun case[1] in the apartment, so presumably there was a record of him purchasing the gun (Webb conjectures that this was staged). We don't know what type of gun it was[2] and Webb claims it's unusual for police not to release this info in a suicide case.

  1. ^

    Source: George Webb (10:10)

  2. ^

    Source: George Webb (2:15)

Comment by Joseph Miller (Josephm) on Joseph Miller's Shortform · 2025-01-07T00:10:15.303Z · LW · GW

This is an attempt to compile all publicly available primary evidence relating to the recent death of Suchir Balaji, an OpenAI whistleblower.

This is a tragic loss and I feel very sorry for the parents. The rest of this piece will be unemotive as it is important to establish the nature of this death as objectively as possible.

I was prompted to look at this by a surprising conversation I had IRL suggesting credible evidence that it was not suicide. The undisputed facts of the case are that he died of a gunshot wound in his bathroom sometime around November 26 2024. The police say it was a suicide with no evidence of foul play.

Most of the evidence we have comes from the parents and George Webb. Webb describes himself as an investigative journalist, but I would classify him as more of a conspiracy theorist, based on a quick scan of some of his older videos. I think many of the specific factual claims he has made about this case are true, though I generally doubt his interpretations.

Webb seems to have made contact with the parents early on and went with them when they first visited Balaji's apartment. He has since published videos from the scene of the death, against the wishes of the parents[1] and as a result the parents have now unendorsed Webb.[2]

List of evidence:

  • He didn't leave a suicide note.[3]
  • The cause of death was decided by the authorities in 14 (or 40, unclear) minutes.[4]
  • The parents arranged a private autopsy which "made their suspicions stronger".[5]
  • The parents say "there are a lot of facts that are very disturbing for us and we cannot share at the moment but when we do a PR all of that will come out."[6]
  • The parents say "his computer has been deleted, his desktop has been messed up".[7]
    • Although the parents also said that their son's phone and laptop are not lost and are in escrow.[8][9] I think the claim of the computer being deleted is more up-to-date, but I'm not sure as that video was posted earlier.
  • It was his birthday and he bought a bike on the week of his death.[10]
  • He said he didn't want to work and he was going to take a gap year, "leaving the AI industry and getting into machine learning and neuroscience" but also he was planning to start his own company and was reaching out to VCs for seed funding.[11]
  • He had just interviewed with the New York Times and he was supposed to do further interviews in the days after his death.[12]
  • According to the parents and Webb, there are signs of foul play at the scene of death:
    • There are several areas with blood, [Confirmed from pictures] suggesting to Webb and the parents he was trying to crawl out of the bathroom.[13][14]
    • Webb says the body had bleeding from the genitals.[15] I'm not aware of a better source for this claim, so right now I think it is probably false.
    • The trash can in the bathroom was knocked over.[13][16] [Confirmed from pictures].
    • A floss pick is on the floor.[13][17] [Confirmed from pictures]. Webb interprets this as being dropped at the time of death, suggesting that Balaji was caught by surprise.
    • The path of the bullet through the head missed the brain. I'm not sure what the primary source for this is, but I'm not sure why Webb would invent this, so I think it's true. Webb takes this as evidence that it was shot during a struggle rather than at the considered pace of a suicide.[18]
      • The bullet did not go all the way through the head, suggesting a lower caliber, quiet gun.[19]
    • According to Webb and the parents, the drawers of the apartment were ransacked, the cupboards were thrown open.[8][20] From the pictures this looks false, although the apartment is very messy and his hiking backpacks are strewn around with much of their contents on the table (he had recently returned from a hiking trip).
    • The blood on the sink looks different, suggesting to Webb that it came from a different part of the body.[21] This is not obvious to me from the pictures but not implausible and the main pool of blood looks surprisingly dark.
    • There is a half-eaten meal at the desk in the apartment. [Confirmed from pictures].
    • There is a tuft of Balaji's hair, soaked in blood, under the bathroom door. [Confirmed that's what it looks like in the pictures], again suggesting to Webb a violent struggle.
  • According to the parents, he had a USB thumb drive which is now missing, containing important evidence for an upcoming court case about OpenAI's use of copyrighted data.[8][22]
  • People that spoke to him around the time of his death report him to have been in high spirits and making plans for the near future.[23]
  • George Webb claims there were security cameras working all on floors except the floor which he lived on.[24] This appears to conflict with the parents' claim that the police said no one came in or out (see below), but may be referring to different cameras, as the parents also mention that the murderer could have come through a different entrance to the main one.[25]
  • The parents say "OpenAI has deleted the copyright data that was evidence that was given to the discovery for the [New York Times] lawsuit. They deleted the data and now my son is also gone, so now they're all set for winning the lawsuit... It's also said that my son had the documentation to prove the copyright violation. His statement, his testimony would have turned the AI industry upside down..."[26]
    • Looking into the details of this, OpenAI did delete some data but this wasn't a permanent deletion of any of the primary sources and I think it was probably an accident and not significant to the outcome of the case.
    • I don't see any strong reason to believe Balaji had secret evidence that would have been critical to the outcome of the case.
  • Ilya Sutskever had two armed bodyguards with him at NeurIPS.

Evidence against:

  • One reason the authorities gave for declaring it a suicide was that CCTV footage showed that no one else came in or out of the apartment.[27]
  • In high school Balaji won a $100,000 prize for a computer science competition. His parents didn't find out until they saw the news online, suggesting he may not have been very open with them.[28]

My interpretations:

  • If we interpret the apartment as simply messy (as it looks to me), rather than ransacked, then we can probably discount the knocked-over trash can, the floss pick on the floor and the half-eaten meal. We can also probably discard the hypothesis of someone trying to locate a USB drive with secret information, which raises more questions than it answers (why didn't he reveal this information before? why didn't he back up this crucial data anywhere?).
  • In my uninformed view, it doesn't look like the pictures of the scene of death strongly suggest a struggle between murderer and victim, although I don't know how to explain the tuft of hair.
  • The motivations of OpenAI or some other actor to murder a whistleblower are unlikely. The most plausible to me is that they want to send a warning to other potential whistleblowers, but this isn't very compelling.
  • There's no smoking gun and the parents (understandably) do not look like they are thinking very systematically to establish a case for foul-play. This is notable because their claim of foul-play is the main factor that privileged this hypothesis to credible people.
  • Balaji appeared from the outside to be a happy and highly successful person with important plans in the next few days. It is surprising that someone like that would commit suicide.

Overall my conclusion is that this was a suicide with roughly 96% confidence. This is a slight update downwards from 98% when I first heard about it and overall quite concerning.

I encourage people to trade on this related prediction market and report further evidence.

Useful sources:

  1. ^

    I'm not linking to this evidence here, in the spirit of respecting the wishes of the parents, but this is an important source that informed my understanding of the situation.

  2. ^
  3. ^

    Source: Poornima Ramarao (11:22)

  4. ^

    Source: Poornima Ramarao (12:38)

  5. ^

    Source: Poornima Ramarao (13:02)

  6. ^

    Source: Poornima Ramarao (15:47)

  7. ^

    Source: Poornima Ramarao (16:36)

  8. ^

    Source: George Webb + Poornima Ramarao (1:45)

  9. ^

    Source: George Webb (9:56)

  10. ^

    Source: (23:27)

  11. ^

    Source: (8:02)

  12. ^

    Source: (26:00)

  13. ^

    Source: George Webb + Poornima Ramarao (0:35)

  14. ^

    Source: George Webb (6:53)

  15. ^

    Source: George Webb (3:38)

  16. ^

    Source: George Webb (5:44)

  17. ^

    Source: George Webb (5:46)

  18. ^

    Source: George Webb (0:05)

  19. ^

    Source: George Webb (6:23)

  20. ^

    Source: George Webb (9:12)

  21. ^

    Source: Poornima Ramarao (1:18)

  22. ^

    Source: George Webb (9:45)

  23. ^

    Source: Poornima Ramarao (4:30)

  24. ^

    Source: George Webb (9:30)

  25. ^

    Source: Poornima Ramarao (2:40)

  26. ^

    Source: Poornima Ramarao (4:14)

  27. ^

    Source: Poornima Ramarao (12:42)

  28. ^

    Source: Ramamurthy (17:37)

  29. ^

    Source: George Webb (13:29)

  30. ^

    Source: George Webb (5:43)

Comment by Joseph Miller (Josephm) on Review: Planecrash · 2025-01-01T14:28:49.302Z · LW · GW

Has someone made an ebook that I can easily download onto my kindle?

I'm unclear if a good ebook should include all the pictures from the original version.

Comment by Joseph Miller (Josephm) on Joseph Miller's Shortform · 2024-12-29T22:22:47.250Z · LW · GW

LLMs can pick up a much broader class of typos than spelling mistakes.

For example in this comment I wrote "Don't push the frontier of regulations" when from context I clearly meant to say "Don't push the frontier of capabilities" I think an LLM could have caught that.

Comment by Joseph Miller (Josephm) on Joseph Miller's Shortform · 2024-12-29T19:18:50.657Z · LW · GW

LessWrong LLM feature idea: Typo checker

It's becoming a habit for me to run anything I write through an LLM to check for mistakes before I send it off.

I think the hardest part of implementing this feature well would be to get it to only comment on things that are definitely mistakes / typos. I don't want a general LLM writing feedback tool built-in to LessWrong.

Comment by Joseph Miller (Josephm) on evhub's Shortform · 2024-12-28T13:53:39.810Z · LW · GW

The ideal version of Anthropic would

  1. Make substantial progress on technical AI safety
  2. Use its voice to make people take AI risk more seriously
  3. Support AI safety regulation
  4. Not substantially accelerate the AI arms race

In practice I think Anthropic has

  1. Made a little progress on technical AI safety
  2. Used its voice to make people take AI risk less seriously[1]
  3. Obstructed AI safety regulation
  4. Substantially accelerated the AI arms race

What I would do differently.

  1. Do better alignment research, idk this is hard.
  2. Communicate in a manner that is consistent with the apparent belief of Anthropic leadership that alignment may be hard and x-risk is >10% probable. Their communications strongly signal "this is a Serious Issue, like climate change, and we will talk lots about it and make gestures towards fixing the problem but none of us are actually worried about it, and you shouldn't be either. When we have to make a hard trade-off between safety and the bottom line, we will follow the money every time."
  3. Lobby politicians to regulate AI. When a good regulation like SB-1047 is proposed, support it.
  4. Don't push the frontier of capabilities. Obviously this is basically saying that Anthropic should stop making money and therefore stop existing. The more nuanced version is that for Anthropic to justify its existence, each time it pushes the frontier of capabilities should be earned by substantial progress on the other three points.
  1. ^

    My understanding is that a significant aim of your recent research is to test models' alignment so that people will take AI risk more seriously when things start to heat up. This seems good but I expect the net effect of Anthropic is still to make people take alignment less seriously due to the public communications of the company.

Comment by Joseph Miller (Josephm) on davekasten's Shortform · 2024-12-17T19:22:21.079Z · LW · GW

The ARENA curriculum is very good.

Comment by Joseph Miller (Josephm) on Probability of death by suicide by a 26 year old · 2024-12-14T05:40:39.763Z · LW · GW

It does seem pretty suspicious.

I'm like 98% confident this was not foul-play, partly because I doubt whatever evidence he had would be that important to the court case and obviously his death is going to draw far more attention to his view.

However, 98% is still quite worrying and I wish I could be >99% confident. I will be interested to see if there is further evidence. Given OpenAI's very shady behavior with the secret non-disparagement agreements that came out a few months, it doesn't seem completely impossible they might do this (but still very very unlikely imo).

The probability of a 26 year old dying of suicide in any given month (within the month of being named the key witness in the OpenAI copyright case, right before deposition) is roughly 1 in 100,000

This prior is a useful starting point, but you've definitely got to account for the stress of leaving OpenAI and going through a lawsuit.

(I downvoted this post for combative tone.)

Comment by Joseph Miller (Josephm) on Frontier Models are Capable of In-context Scheming · 2024-12-11T03:19:50.887Z · LW · GW

One of the striking parts is that it sounds like all the pretraining people are optimistic

What's the source for this?

Comment by Joseph Miller (Josephm) on You should consider applying to PhDs (soon!) · 2024-11-30T16:19:03.572Z · LW · GW

I started working on PhD applications about 12 days ago. I expect to have fairly polished applications for the first deadline on December 1, despite not working on this full time. So I think it's quite possible to do applications for the December 15 deadlines. You would need to contact your referees (and potential supervisors for UK universities) in the next couple of days.

Comment by Joseph Miller (Josephm) on Joseph Miller's Shortform · 2024-11-30T02:33:11.885Z · LW · GW

There are two types of people in this world.

There are people who treat the lock on a public bathroom as a tool for communicating occupancy and a safeguard against accidental attempts to enter when the room is unavailable. For these people the standard protocol is to discern the likely state of engagement of the inner room and then tentatively proceed inside if they detect no signs of human activity.

And there are people who view the lock on a public bathroom as a physical barricade with which to temporarily defend possessed territory. They start by giving the door a hearty push to test the tensile strength of the barrier. On meeting resistance they engage with full force, wringing the handle up and down and slamming into the door with their full body weight. Only once their attempts are thwarted do they reluctantly retreat to find another stall.

Comment by Joseph Miller (Josephm) on The Big Nonprofits Post · 2024-11-29T18:20:29.203Z · LW · GW

Tarbell Fellowship at PPF

I think you've massively underrated this. My impression is that Tarbell has had significant effect on the general AI discourse, by allowing a number of articles to be written in mainstream outlets.

Comment by Joseph Miller (Josephm) on Eli's shortform feed · 2024-11-28T16:26:35.554Z · LW · GW

karma should also transfer automatically

Comment by Joseph Miller (Josephm) on leogao's Shortform · 2024-11-27T13:25:03.944Z · LW · GW

Unconferences are a thing for this reason

Comment by Joseph Miller (Josephm) on No convincing evidence for gradient descent in activation space · 2024-11-21T01:26:08.650Z · LW · GW

This is fantastic technical writing. It would have taken me hours to understand these papers this deeply, but you convey the core insights quickly in an entertaining and understandable way.

Comment by Joseph Miller (Josephm) on Intrinsic Power-Seeking: AI Might Seek Power for Power’s Sake · 2024-11-20T21:56:41.267Z · LW · GW

If there are ‘subshards’ which achieve this desirable behavior because they, from their own perspective, ‘intrinsically’ desire power (whatever that sort of distinction makes when you’ve broken things down that far), and it is these subshards which implement the instrumental drive... so what? After all, there has to be some level of analysis at which an agent stops thinking about whether or not it should do some thing and just starts doing the thing. Your muscles “intrinsically desire” to fire when told to fire, but the motor actions are still ultimately instrumental, to accomplish something other than individual muscles twitching. You can’t have ‘instrumental desire’ homunculuses all the way down to the individual transistor or ReLU neuron.

I sent this paragraph to TurnTrout as I was curious to get his reaction. Paraphrasing his response below:

No, that's not the point. That's actually the opposite of what i'm trying to say. The subshards implement the algorithmic pieces and the broader agent has an "intrinsic desire" for power. The subshards themselves are not agentic, and that's why (in context) I substitute them in for "circuits".

It's explained in this post that I linked to. Though I guess in context I do say "prioritize" in a way that might be confusing. Shard Theory argues against homonculist accounts of cognition by considering the mechanistic effects of reinforcement processes. Also the subshards are not implementing an instrumental drive in the sense of "implementing the power-seeking behavior demanded by some broader consequentialist plan" they're just seeking power, just 'cuz.

From my early post: Inner and Outer Alignment Decompose One Hard Problem Into Two Extremely Hard Problems

I literally do not understand what the internal cognition is supposed to look like for an inner-aligned agent. Most of what I’ve read has been vague, on the level of “an inner-aligned agent cares about optimizing the outer objective.”

Charles Foster comments:

  • "We are attempting to mechanistically explain how an agent makes decisions. One proposed reduction is that inside the agent, there is an even smaller inner agent that interacts with a non-agential evaluative submodule to make decisions for the outer agent. But that raises the immediate questions of “How does the inner agent make its decisions about how to interact with the evaluative submodule?” and then “At some point, there’s gotta be some non-agential causal structure that is responsible for actually implementing decision-making, right?” and then “Can we just explain the original agent’s behavior in those terms? What is positing an externalized evaluative submodule buying us?"

Perhaps my emphasis on mechanistic reasoning and my unusual level of precision in my speculation about AI internals, perhaps these make people realize how complicated realistic cognition is in the shard picture. Perhaps people realize how much might have to go right, how many algorithmic details may need to be etched into a network so that it does what we want and generalizes well.

But perhaps people don’t realize that a network which is inner-aligned on an objective will also require a precise and conforming internal structure, and they don’t realize this because no one has written detailed plausible stabs at inner-aligned cognition.

Comment by Joseph Miller (Josephm) on What are the good rationality films? · 2024-11-20T14:20:51.031Z · LW · GW

Kinda a stretch, but Groundhog Day is about someone becoming stronger. Also just a great film.

Comment by Joseph Miller (Josephm) on An alternative approach to superbabies · 2024-11-06T19:46:28.740Z · LW · GW

I would recommend reading the original reddit post that motivated it: https://www.reddit.com/r/biology/comments/16y81ct/the_case_for_whales_actually_matching_or_even/.

It is meant seriously, but the author is rightly acknowledging how far-fetched it sounds.

Comment by Joseph Miller (Josephm) on Why I quit effective altruism, and why Timothy Telleen-Lawton is staying (for now) · 2024-10-25T20:27:30.418Z · LW · GW

[00:31:25] Timothy:... This is going to be like, they didn't talk about any content, like there's no specific evidence, 

[00:31:48] Elizabeth: I wrote down my evidence ahead of time.

[00:31:49] Timothy: Yeah, you already wrote down your evidence

I feel pretty uncertain to what extent I agree with your views on EA. But this podcast didn't really help me decide because there wasn't much discussion of specific evidence. Where is all of it written down? I'm aware of your post on vegan advocacy but unclear if there are lots more examples. I also heard a similar line of despair about EA epistemics from other long-time rationalists when hanging around Lighthaven this summer. But basically no one brought up specific examples.

It seems difficult to characterize the EA movement as a monolith in the way you're trying to do. The case of vegan advocacy is mostly irrelevant to my experience of EA. I have little contact with vegan advocates and most of the people I hang around in EA circles seem to have quite good epistemics.

However I can relate to your other example, because I'm one of the "baby EAs" who was vegetarian and was in the Lightcone offices in summer 2022. But my experience provides something of a counter-example. In fact, I became vegetarian before encountering EA and mostly found out about the potential nutritional problems from other EAs. When you wrote your post, I got myself tested for iron deficiency and started taking supplements (although not for iron deficiency). I eventually stopped being vegetarian, instead offsetting my impact with donations to animal charities, even though this isn't very popular in EA circles.

My model is that people exist on a spectrum of weirdness to normy-ness. The weird people are often willing to pay social costs to be more truthful. While the more normy people will refrain from saying and thinking the difficult truths. But most people are mostly fixed at a certain point on the spectrum. The truth-seeking weirdos probably made up a larger proportion of the early EA movement, but I'd guess in absolute terms the number of those sorts of people hanging around EA spaces has not declined, and their epistemics have not degraded - there just aren't very many of them in the world. But these days there is a greater number of the more normy people in EA circles too. 

And yes, it dilutes the density of high epistemics in EA. But that doesn't seem like a reason to abandon the movement. It is a sign that more people are being influenced by good ideas and that creates opportunities for the movement to do bigger things.

When you want to have interesting discussions with epistemic peers, you can still find your own circles within the movement to spend time with, and you can still come to the (relative) haven of LessWrong. If LessWrong culture also faced a similar decline in epistemic standards I would be much more concerned, but it has always felt like EA is the applied, consumer facing product of the rationalist movement, that targets real-world impact over absolute truth-seeking. For example, I think most EAs (and also some rationalists) are hopelessly confused about moral philosophy, but I'm still happy there's more people trying to live by utilitarian principles, who might otherwise not be trying to maximize value at all.

Comment by Joseph Miller (Josephm) on Why Stop AI is barricading OpenAI · 2024-10-14T15:12:47.269Z · LW · GW

Respect for doing this.

I strongly wish you would not tie StopAI to the claim that extinction is >99% likely. It means that even your natural supporters in PauseAI will have to say "yes I broadly agree with them but disagree with their claims about extinction being certain."

I would also echo the feedback here. There's no reason to write in the same style as cranks.

Comment by Joseph Miller (Josephm) on the case for CoT unfaithfulness is overstated · 2024-10-02T20:39:36.237Z · LW · GW

Question –> CoT –> Answer

So to be clear, testing whether this causal relationship holds is actually important, it's just that we need to do it on questions where the CoT is required for the model to answer the question?

Comment by Joseph Miller (Josephm) on Implementing activation steering · 2024-10-01T19:23:23.610Z · LW · GW

Optimize the steering vector to minimize some loss function.

Comment by Joseph Miller (Josephm) on Joseph Miller's Shortform · 2024-09-25T23:33:25.347Z · LW · GW

Crossposted from https://x.com/JosephMiller_/status/1839085556245950552

1/ Sparse autoencoders trained on the embedding weights of a language model have very interpretable features! We can decompose a token into its top activating features to understand how the model represents the meaning of the token.🧵

2/ To visualize each feature, we project the output direction of the feature onto the token embeddings to find the most similar tokens. We also show the bottom and median tokens by similarity, but they are not very interpretable.

3/ The token "deaf" decomposes into features for audio and disability! None of the examples in this thread are cherry-picked – they were all (really) randomly chosen.

4/ Usually SAEs are trained on the internal activations of a component for billions of different input strings. But here we just train on the rows of the embedding weight matrix (where each row is the embedding for one token).

5/ Most SAEs have many thousands of features. But for our embedding SAE, we only use 2000 features because of our limited dataset. We are essentially compressing the embedding matrix into a smaller, sparser representation.

6/ The reconstructions are not highly accurate – on average we have ~60% variance unexplained (~0.7 cosine similarity) with ~6 features active per token. So more work is needed to see how useful they are.

7/ Note that for this experiment we used the subset of the token embeddings that correspond to English words, so the task is easier - but the results are qualitatively similar when you train on all embeddings.

8/ We also compare to PCA directions and find that the SAE directions are in fact much more interpretable (as we would expect)!

9/ I worked on embedding SAEs at an @apartresearch hackathon in April, with Sajjan Sivia and Chenxing (June) He.
Embedding SAEs were also invented independently by @Michael Pearce.

Comment by Joseph Miller (Josephm) on [Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders · 2024-09-25T14:03:17.876Z · LW · GW

Nice post. I think this is a really interesting discovery.

[Copying from messages with Joseph Bloom]
TLDR: I'm confused what is different about the SAE input that causes the absorbed feature not to fire.

Me:

Summary of your findings

  • Say you have a “starts with s” feature and a “snake” feature.
  • You find that for most words, “starts with s” correctly categorizes words that start with s. But for a few words that start with s, like snake, it doesn’t fire.
  • These exceptional tokens where it doesn’t fire, all have another feature that corresponds very closely to the token. For example, there is a “snake” feature that corresponds strongly to the snake token.
  • You say that the “snake” feature has absorbed the “starts with s” feature because the concept of snake also contains/entails the concept of ‘start with s’.
  • Most of the features that absorb other features correspond to common words, like “and”.

So why is this happening? Well it makes sense that the model can do better on L1 on the snake token by just firing a single “snake” feature (rather than the “starts with s” feature and, say, the “reptile” feature). And it makes sense it would only have enough space to have these specific token features for common tokens.

Joseph Bloom:

rather than the “starts with s” feature and, say, the “reptile” feature

We found cases of seemingly more general features getting absorbed in the context of spelling but they are more rare / probably the exception. It's worth distinguishing that we suspect that feature absorption is just easiest to find for token aligned features but conceptually could occur any time a similar structure exists between features.

And it makes sense it would only have enough space to have these specific token features for common tokens.

I think this needs further investigation. We certainly sometimes see rarer tokens which get absorbed (eg: a rare token is a translated word of a common token). I predict there is a strong density effect but it could be non-trivial.

Me:

We found cases of seemingly more general features getting absorbed in the context of spelling

What’s an example?

We certainly sometimes see rarer tokens which get absorbed (eg: a rare token is a translated word of a common token)

You mean like the “starts with s” feature could be absorbed into the “snake” feature on the french word for snake?

Do this only happen if the french word also starts with s?

Joseph Bloom:

What’s an example?

You mean like the “starts with s” feature could be absorbed into the “snake” feature on the french word for snake?

Yes

Do this only happen if the french word also starts with s?

More likely. I think the process is stochastic so it's all distributions.

↓[Key point]↓

Me:

But here’s what I’m confused about. How does the “starts with s” feature ‘know’ not to fire? How is it able to fire on all words that start with s, except those tokens (like “snake”) that having a strongly correlated feature? I would assume that the token embeddings of the model contain some “starts with s” direction. And the “starts with s” feature input weights read off this direction. So why wouldn’t it also activate on “snake”? Surely that token embedding also has the “starts with s” direction?

Joseph Bloom:

I would assume that the token embeddings of the model contain some “starts with s” direction. And the “starts with s” feature input weights read off this direction. So why wouldn’t it also activate on “snake”? Surely that token embedding also has the “starts with s” direction?

I think the success of the linear probe is why we think the snake token does have the starts with s direction. The linear probe has much better recall and doesn't struggle with obvious examples. I think the feature absorption work is not about how models really work, it's about how SAEs obscure how models work.

But here’s what I’m confused about. How does the “starts with s” feature ‘know’ not to fire? Like what is the mechanism by which it fires on all words that start with s, except those tokens (like “snake”) that having a strongly correlated feature?

Short answer, I don't know. Long answer - some hypotheses:

  1. Linear probes, can easily do calculations of the form "A AND B". In large vector spaces, it may be possible to learn a direction of the form "(^S.*) AND not (snake) and not (sun) ...". Note that "snake" has a component seperate to starts with s so this is possible.  To the extent this may be hard, that's possibly why we don't see more absorption but my own intuition says that in large vector spaces this should be perfectly possible to do.
  2. Encoder weights and Decoder weights aren't tied. If they were, you can imagine the choosing these exceptions for absorbed examples would damage reconstruction performance. Since we don't tie the weights, the model can detect "(^S.*) AND not (snake) and not (sun) ..." but write "(^S.*)". I'm interested to explore this further and am sad we didn't get to this in the project.
Comment by Joseph Miller (Josephm) on Pronouns are Annoying · 2024-09-19T00:21:42.159Z · LW · GW

Anyone who harbors such an intense attachment to specific gendered pronoun preferences clearly sees it as much more than a superficial aesthetic designator.

This makes you sound like a bit of a straw vulcan imo. All I have to do is imagine how jarring and upsetting it would be to have everyone start calling me "she" and it's very obvious how, for almost all people, what pronoun others call them is deeply emotionally salient.

Comment by Joseph Miller (Josephm) on I finally got ChatGPT to sound like me · 2024-09-18T18:47:49.550Z · LW · GW

I agree, I'm a fan of lsusr's writing, so I don't think it's very inaccurate. In particular

a kind of minimalist clarity that leaves room for the reader to reflect and draw their own conclusions

might be gesturing at some concrete distinctive feature.

However, it's sufficiently close to horoscope flattery, that I couldn't quite believe lsusr would, with a straight face, present this as some great insight into his writing style.

Comment by Joseph Miller (Josephm) on I finally got ChatGPT to sound like me · 2024-09-17T20:58:27.986Z · LW · GW

I very confused how seriously this post is intended

Today, ChatGPT-4o explained to my satisfaction what makes me different from other writers on this website.

What makes lsusr's writing interesting is the subtlety with which they engage complex issues. Many rationalist bloggers can become quite verbose or dogmatic in their pursuit of certain truths. Lsusr, by contrast, exhibits restraint and humility in the face of uncertainty. They’re willing to question common assumptions within the rationalist sphere and sometimes explore paths that others might find unconventional, often leading to unique insights.

In essence, lsusr strikes a balance between rigorous analysis and a kind of minimalist clarity that leaves room for the reader to reflect and draw their own conclusions, rather than being led to a definitive answer. This makes the blog a place of exploration rather than indoctrination, offering readers the tools and ideas to enhance their own thinking rather than a packaged belief system.

I think this isn't meant seriously because it's basically just saying lsusr is better than most rationalist bloggers, not any concrete distinctive features of lsusr's writing.

Comment by Joseph Miller (Josephm) on Head in the Cloud: Why an Upload of Your Mind is Not You · 2024-09-17T01:35:16.399Z · LW · GW

I think this argument mostly centers on the definition of certain words, and thus does not change my views on whether I should upload my mind if given the choice.

But can this person be said to understand Chinese? My answer is no.

What you have shown here is what you think the word "understands" means. But everyone agrees about the physical situation here - everyone anticipates the same experiences.

This shows that our brains are highly resilient and adaptive to changes experienced by our minds. By comparison, a digital simulation is very brittle and non-adaptive to change.

The substrate of the simulation, ie. a silicon chip, is brittle (at our current level of tech) but it can still run a simulation of a neuroplastic brain - just program it to simulate the brain chemistry. Then if the simulated brain is damaged, it will be able to adapt.

The bigger point here is that you are implicitly asserting that in order to be "sentient" a mind must have similar properties to a human brain. That's fine, but it's is purely a statement about how you like to define the word "sentient".

Only living organisms can possess sentience because sentience provides introspective knowledge that enables them to keep surviving;

"Sentience" has no widely agreed concrete definition, but I think it would be relatively unusual to say it "provides introspective knowledge". Do you agree that any questions about the actual computation, algorithms or knowledge in a brain can be answered by only considering the physical implementation of neurons and synapses?

sentience would not emerge in artificial systems because they are not alive in the first place.

Again, I think this is purely a statement about the definition of the word "alive". Someone who disagrees would not anticipate any different experiences as a consequence of thinking an artificial system is "alive".

Comment by Joseph Miller (Josephm) on Did Christopher Hitchens change his mind about waterboarding? · 2024-09-15T10:10:30.220Z · LW · GW

Nice scholarship

Comment by Joseph Miller (Josephm) on OpenAI o1 · 2024-09-13T14:14:26.369Z · LW · GW

If this is a pattern with new, more capable models, this seems like a big problem. One major purpose of this kind of evaluation to set up thresholds that ring alarms bells when they are crossed. If it takes weeks of access to a model to figure out how to evaluate it correctly, the alarm bells may go off too late.

Comment by Joseph Miller (Josephm) on OpenAI o1 · 2024-09-12T23:30:31.841Z · LW · GW

METR had only ~10 days to evaluate.


Should it really take any longer than 10 days to evaluate? Isn't it just a matter of plugging it into their existing framework and pressing go?

Comment by Joseph Miller (Josephm) on The Best Lay Argument is not a Simple English Yud Essay · 2024-09-10T18:36:02.866Z · LW · GW

As the author of example 2, this is very helpful!

Comment by Joseph Miller (Josephm) on O O's Shortform · 2024-09-04T18:40:00.560Z · LW · GW

The impression I have from reading Chip War is that EUV is a pretty massive hurdle which took the West well over a decade to conquer. However, I also thought that 5nm was impossible without EUV, which seems to be no longer true, so this may be too complex a topic to make meaningful predictions about without deeper expertise.

Comment by Joseph Miller (Josephm) on Habryka's Shortform Feed · 2024-09-04T18:34:30.323Z · LW · GW
  • Created a popular format for in-person office spaces that heavily influenced Constellation and FAR Labs

This one seems big to me. There are now lots of EA / AI Safety offices around the world and I reckon they are very impactful for motivating people, making it easier to start projects and building a community.

One thing I'm not clear about is to what extent the Lightcone WeWork invented this format. I've never been to Trajan House but I believe it came first, so I thought it would have been part of the inspiration for the Lightcone WeWork.

Also my impression was that Lightcone itself thought the office was net negative, which is why it was shut down, so I'm slightly surprised to see this one listed.

Comment by Joseph Miller (Josephm) on Why Large Bureaucratic Organizations? · 2024-08-28T18:32:11.017Z · LW · GW

Looking forward to the crossover "Dath Ilan vs Wentworld" to find out which is the most adequate civilization.

Comment by Joseph Miller (Josephm) on Zach Stein-Perlman's Shortform · 2024-08-27T14:58:31.839Z · LW · GW

llama had to be so big to be SOTA,

How many parameters do you estimate for other SOTA models?