Posts

Detecting Strategic Deception Using Linear Probes 2025-02-06T15:46:53.024Z
Paper: Open Problems in Mechanistic Interpretability 2025-01-29T10:25:54.727Z
Activation space interpretability may be doomed 2025-01-08T12:49:38.421Z
Reasons for and against working on technical AI safety at a frontier AI lab 2025-01-05T14:49:53.529Z
Book Summary: Zero to One 2024-12-29T16:13:52.922Z
Remap your caps lock key 2024-12-15T14:03:33.623Z
You should consider applying to PhDs (soon!) 2024-11-29T20:33:12.462Z
bilalchughtai's Shortform 2024-07-29T18:57:51.169Z
Understanding Positional Features in Layer 0 SAEs 2024-07-29T09:36:40.701Z
Unlearning via RMU is mostly shallow 2024-07-23T16:07:52.223Z
Transformer Circuit Faithfulness Metrics Are Not Robust 2024-07-12T03:47:30.077Z
Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs 2024-07-08T22:24:38.441Z

Comments

Comment by bilalchughtai (beelal) on How do you deal w/ Super Stimuli? · 2025-01-15T12:12:25.960Z · LW · GW

As a general rule, I try and minimise my phone screen time and maximise my laptop screen time. I can do every "productive" task faster on a laptop than on my phone.

Here are some things object level things I do that I find helpful that I haven't yet seen discussed.

  • Use a very minimalist app launcher on my phone, that makes searching for apps a conscious decision.
  • Use a greyscale filter on my phone (which is hard to turn off), as this makes doing most things on my phone harder.
  • Every time I get a notification I didn't need to get, I instantly disable it. This also generalizes to unsubscribing from emails I don't need to receive.
Comment by bilalchughtai (beelal) on Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs · 2025-01-13T12:59:34.770Z · LW · GW

What is the error message?

Comment by bilalchughtai (beelal) on Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs · 2025-01-12T20:33:09.400Z · LW · GW

Yep, this sounds interesting! My suggestion for anyone wanting to run this experiment would be to start with SAD-mini, a subset of SAD with the five most intuitive and simple tasks. It should be fairly easy to adapt our codebase to call the Goodfire API. Feel free to reach out to myself or @L Rudolf L if you want assistance or guidance.

Comment by bilalchughtai (beelal) on Activation space interpretability may be doomed · 2025-01-12T19:49:41.150Z · LW · GW

How do you know what "ideal behaviour" is after you steer or project out your feature? How would you differentiate a feature with sufficiently high cosine sim to a "true model feature" and a "true model feature"? I agree you can get some signal on whether a feature is causal, but would argue this is not ambitious enough.

Comment by bilalchughtai (beelal) on Activation space interpretability may be doomed · 2025-01-12T19:45:52.190Z · LW · GW

Yes, that's right -- see footnote 10. We think that Transcoders and Crosscoders are directionally correct, in the sense that they leverage more of the models functional structure via activations from several sites, but agree that their vanilla versions suffer similar problems to regular SAEs.

Comment by bilalchughtai (beelal) on Activation space interpretability may be doomed · 2025-01-12T19:24:51.640Z · LW · GW

Also related to the idea that the best linear SAE encoder is not the transpose of the decoder.

Comment by bilalchughtai (beelal) on AI Safety as a YC Startup · 2025-01-10T13:59:53.877Z · LW · GW

For another perspective on leveraging startups for improving the world see this blog post by @benkuhn.

Comment by bilalchughtai (beelal) on bilalchughtai's Shortform · 2025-01-10T13:47:57.074Z · LW · GW

A LW feature that I would find helpful is an easy to access list of all links cited by a given post.

Comment by bilalchughtai (beelal) on Reasons for and against working on technical AI safety at a frontier AI lab · 2025-01-05T18:19:34.085Z · LW · GW

Agreed that this post presents the altruistic case.

I discuss both the money and status points in the "career capital" paragraph (though perhaps should have factored them out).

Comment by bilalchughtai (beelal) on Policymakers don't have access to paywalled articles · 2025-01-05T14:54:27.478Z · LW · GW

your image of a man with a huge monitor doesn't quite scream "government policymaker" to me

Comment by bilalchughtai (beelal) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2025-01-01T15:28:21.474Z · LW · GW

In fact, this mindset gave me burnout earlier this year.

I relate pretty strongly to this. I think almost all junior researchers are incentivised to 'paper grind' for longer than is correct. I do think there are pretty strong returns to having one good paper for credibility reasons; it signals that you are capable of doing AI safety research, and thus makes it easier to apply for subsequent opportunities.

Over the past 6 months I've dropped the paper grind mindset and am much happier for this. Notably, were it not for short term grants where needing to visibly make progress is important, I would have made this update sooner. Another take that I have is that if you have the flexibility to do so (e.g. by already having stable funding, perhaps via being a PhD student), front-loading learning seems good. See here for a related take by Rohin. Making progress on hard problems requires understanding things deeply, in a way which making progress on easy problems that you could complete during e.g. MATS might not.

Comment by bilalchughtai (beelal) on bilalchughtai's Shortform · 2025-01-01T15:09:00.458Z · LW · GW

You might want to stop using the honey extension. Here are some shady things they do, beyond the usual:

  1. Steal affiliate marketing revenue from influencers (who they also often sponsor), by replacing the genuine affiliate referral cookie with their affiliate referral cookie.
  2. Deceive customers by deliberately withholding the best coupon codes, while claiming they have found the best coupon codes on the internet; partner businesses control which coupon codes honey shows consumers.
Comment by bilalchughtai (beelal) on (The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser · 2024-12-23T16:47:55.314Z · LW · GW

any update on this?

Comment by bilalchughtai (beelal) on You should consider applying to PhDs (soon!) · 2024-12-03T16:56:49.471Z · LW · GW

thanks! added to post

Comment by bilalchughtai (beelal) on You should consider applying to PhDs (soon!) · 2024-12-03T09:43:34.012Z · LW · GW

UC Berkeley has historically had the largest concentration of people thinking about AI existential safety. It's also closely coupled to the Bay Area safety community. I think you're possibly underrating Boston universities (i.e. Harvard and Northeastern, as you say the MIT deadline has passed). There is a decent safety community there, in part due to excellent safety-focussed student groups. Toronto is also especially strong on safety imo.

Generally, I would advise thinking more about advisors with aligned interests over universities (this relates to Neel's comment about interests), though intellectual environment does of course matter. When you apply, you'll want to name some advisors who you might want to work with on your statement of purpose.

Comment by bilalchughtai (beelal) on (The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser · 2024-11-30T22:49:18.386Z · LW · GW

Is there a way for UK taxpayers to tax-efficiently donate (e.g. via Gift Aid)?

Comment by bilalchughtai (beelal) on StefanHex's Shortform · 2024-11-19T23:33:02.487Z · LW · GW

Agreed. A related thought is that we might only need to be able to interpret a single model at a particular capability level to unlock the safety benefits, as long as we can make a sufficient case that we should use that model. We don't care inherently about interpreting GPT-4, we care about there existing a GPT-4 level model that we can interpret.

Comment by bilalchughtai (beelal) on jake_mendel's Shortform · 2024-10-11T13:49:35.953Z · LW · GW

Tangentially relevant: this paper by Jacob Andreas' lab shows you can get pretty far on some algorithmic tasks by just training a randomly initialized network's embedding parameters. This is in some sense the opposite to experiment 2.

Comment by bilalchughtai (beelal) on bilalchughtai's Shortform · 2024-07-30T09:18:04.833Z · LW · GW

I don't think it's great for post age-60 actually, as compared with a regular pension, see my reply. The comment on asset tests is useful though, thanks. Roughly LISA assets count towards many tests, while pensions don't. More details here for those interested: https://www.moneysavingexpert.com/savings/lifetime-isas/

Comment by bilalchughtai (beelal) on bilalchughtai's Shortform · 2024-07-30T09:13:11.828Z · LW · GW

Couple more things I didn't explain:

  1. The LISA is a tax free investment account. There are no capital gains taxes on it. This is similar to the regular ISA (which you can put up to £20k in per year, doesn't have a 25% bonus, and can be used for anything - the £4k LISA cap contributes to this £20k). I omitted this as I was implicitly viewing using this account as the counterfactual.
  2. The LISA is often strictly worse than a workplace pension for saving for retirement, if you are employed. This is because you invest in a LISA post-(income)tax, while pension contributions are calculated pre-tax. Even if the bonus approximately makes up for tax you pay, employer contributions tip the balance towards the pension.
Comment by bilalchughtai (beelal) on bilalchughtai's Shortform · 2024-07-29T18:57:51.259Z · LW · GW

Should you invest in a Lifetime ISA? (UK)

The Lifetime Individual Savings Account (LISA) is a government saving scheme in the UK intended primarily to help individuals between the ages of 18 and 50 buy their first home (among a few other things). You can hold your money either as cash or in stocks and shares.

The unique selling point of the scheme is that the government will add a 25% bonus on all savings up to £4000 per year. However, this comes with several restrictions. The account is intended to only be used for the following purposes:
1) to buy your first home, worth £450k or less
2) if you are aged 60 or older
3) if you are terminally ill

The government do permit individuals to use the money for other purposes, with the caveat that a 25% cut will be taken before doing so. Seems like a no brainer? Not quite.

Suppose you invest  in your LISA. The government bonus puts this up to . Suppose later you decide to withdraw your money for purposes other than (1-3). Then you end up with . That's a 6.25% loss!

So when does it make sense to use your LISA? Suppose further you have some uncertainty over whether you will use your money for (1-3). Most likely, you are worried that you might not buy a home in the UK, or you might want to buy a home over the price of £450k (because for instance you live in London, and £450k doesn't stretch that far).

Let's compute the expected value of your investment if the probability of using your money for the purposes (1-3) is  (which likely means your probability of using it for 1 is also about ). Suppose we invest £. For our investment to be worth it, we should expect at least £ back.

EV = (bonus scenario) + (penalty scenario) = , implying .

So, you should use your ISA if your probability of using it to buy a home (or 2,3) is above 20%. This is surprisingly low! Note further this calculation applies regardless of if you use a cash or stocks and shares LISA. 

I havn't ever seen this calculation written up publicly before so thought it was worth sharing.