Posts

OpenAI appoints Retired U.S. Army General Paul M. Nakasone to Board of Directors 2024-06-13T21:28:18.110Z
GPT2, Five Years On 2024-06-05T17:44:17.552Z
Quick Thoughts on Scaling Monosemanticity 2024-05-23T16:22:48.035Z
How is GPT-4o Related to GPT-4? 2024-05-15T18:33:43.925Z
How to Model the Future of Open-Source LLMs? 2024-04-19T14:28:00.175Z
Paul Christiano named as US AI Safety Institute Head of AI Safety 2024-04-16T16:22:06.937Z
Highlights from Lex Fridman’s interview of Yann LeCun 2024-03-13T20:58:13.052Z
Interpretability isn’t Free 2022-08-04T15:02:54.842Z
Anthropic's SoLU (Softmax Linear Unit) 2022-07-04T18:38:05.597Z
Joel Burget's Shortform 2022-06-11T19:53:38.922Z
The two missing core reasons why aligning at-least-partially superhuman AGI is hard 2022-04-19T17:15:23.965Z
Chesterton’s Fence vs The Onion in the Varnish 2022-03-24T21:20:14.114Z

Comments

Comment by Joel Burget (joel-burget) on the case for CoT unfaithfulness is overstated · 2024-10-07T00:10:29.079Z · LW · GW

There are now two alleged instances of full chains of thought leaking (use an appropriate amount of spepticism), both of which seem coherent enough.

Comment by Joel Burget (joel-burget) on the case for CoT unfaithfulness is overstated · 2024-10-01T16:17:55.224Z · LW · GW

I think it's more likely that this is just a (non-model) bug in ChatGPT. In the examples you gave, it looks like there's always one step that comes completely out of nowhere and the rest of the chain of though would make sense without it. This reminds me of the bug where ChatGPT would show other users' conversations.

Comment by Joel Burget (joel-burget) on the case for CoT unfaithfulness is overstated · 2024-09-30T15:11:40.155Z · LW · GW

I hesitate to draw any conclusions from the o1 CoT summary since it's passed through a summarizing model.

after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users. We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer. For the o1 model series we show a model-generated summary of the chain of thought.

Comment by Joel Burget (joel-burget) on OpenAI o1 · 2024-09-12T18:46:51.685Z · LW · GW

o1-preview and o1-mini are available today (ramping over some number of hours) in ChatGPT for plus and team users and our API for tier 5 users.

https://x.com/sama/status/1834283103038439566

Comment by Joel Burget (joel-burget) on The Information: OpenAI shows 'Strawberry' to feds, races to launch it · 2024-09-05T19:48:24.504Z · LW · GW
Comment by Joel Burget (joel-burget) on the Giga Press was a mistake · 2024-08-22T14:58:07.374Z · LW · GW

Construction Physics has a very different take on the economics of the Giga-press.

Tesla was the first car manufacturer to adopt large castings, but the savings were so significant — an estimated 20 to 40% reduction in the cost of a car body — that they’re being adopted by many other car manufacturers, particularly Chinese ones. Large, complex castings have been described as a key tool for not only reducing cost but also good EV charging performance.

I think Construction Physics is usually pretty good. In this case my guess is that @bhauth has looked into this more deeply so I trust this post a bit more.

Comment by Joel Burget (joel-burget) on Extended Interview with Zhukeepa on Religion · 2024-08-21T19:58:01.910Z · LW · GW

I wonder how much my reply to Adam Shai addresses your concerns?

Very helpful, thank you.

Comment by Joel Burget (joel-burget) on Extended Interview with Zhukeepa on Religion · 2024-08-19T14:15:38.236Z · LW · GW

In physics, the objects of study are mass, velocity, energy, etc. It’s natural to quantify them, and as soon as you’ve done that you’ve taken the first step in applying math to physics. There are a couple reasons that this is a productive thing to do:

  1. You already derive benefit from a very simple starting point.
  2. There are strong feedback loops. You can make experimental predictions, test them, and refine your theories.

Together this means that you benefit from even very simple math and can scale up smoothly to more sophisticated. From simply adding masses to F=ma to Lagrangian mechanics and beyond.

It’s not clear to me that those virtues apply here:

  • I don’t see the easy starting point, the equivalent of adding two masses.
    • It’s not obvious that the objects of study are quantifiable. It’s not even clear what the objects of study are.
    • Formal statements about religion must be unfathomably complex?
  • I don’t see feedback loops. It must be hard to run experiments, make predictions, etc.

Perhaps these concerns would be addressed by examples of the kind of statement you have in mind.

Comment by Joel Burget (joel-burget) on JumpReLU SAEs + Early Access to Gemma 2 SAEs · 2024-07-27T19:07:10.619Z · LW · GW

Re the choice of kernel, my intuition would have been that something smoother (e.g. approximating a Gaussian, or perhaps Epanechnikov) would have given the best results. Did you use rect just because it's very cheap or was there a theoretical reason?

Comment by Joel Burget (joel-burget) on Book review: The Quincunx · 2024-07-15T14:52:40.151Z · LW · GW

Thanks for this! I ended up reading The Quincunx based on this review and really enjoyed it.

As an aside, I want to recommend a physical book instead of the Kindle version, for a couple reasons:

  1. There are maps and genealogy diagrams interspersed between chapters, but they were difficult to impossible to read on the Kindle.
  2. I discovered, only after finishing the book, that there's a list of characters at the back of the book. This would have been extremely helpful to refer to as I was reading. There are a lot of characters and I can't tell you how many times I tried highlighting someone's name, hoping that Kindle's X-Ray feature would work, and remind me who they were (since they may have only appeared hundreds of pages before). But it doesn't seem to be enabled for this book.

(Also, without the physical book, I didn't realize how long The Quincunx is.)

Even with those difficulties, a great read.

Comment by Joel Burget (joel-burget) on silentbob's Shortform · 2024-06-25T14:03:35.158Z · LW · GW

If, for instance, one minimum’s attractor basin has a radius that is just 0.00000001% larger than that of the other minimum, then its volume will be roughly 40 million times larger (if my Javascript code to calculate this is accurate enough, that is).

Could you share this code? I'd like to take a look.

Comment by Joel Burget (joel-burget) on Matthew Barnett's Shortform · 2024-06-17T05:53:37.327Z · LW · GW

For others who want the resolution to this cliffhanger, what does Bostrom predict happens next?

The remainder of this section:

We observe here how it could be the case that when dumb, smarter is safer; yet when smart, smarter is more dangerous. There is a kind of pivot point, at which a strategy that has previously worked excellently suddenly starts to backfire. We may call the phenomenon the treacherous turn.

The treacherous turn — While weak, an AI behaves cooperatively (increasingly so, as it gets smarter). When the AI gets sufficiently strong — without warning or provocation — it strikes, forms a singleton, and begins directly to optimize the world according to the criteria implied by its final values.

A treacherous turn can result from a strategic decision to play nice and build strength while weak in order to strike later; but this model should not be interpreted too narrowly. For example, an AI might not play nice in order that it be allowed to survive and prosper. Instead, the AI might calculate that if it is terminated, the programmers who built it will develop a new and somewhat different AI architecture, but one that will be given a similar utility function. In this case, the original AI may be indifferent to its own demise, knowing that its goals will continue to be pursued in the future. It might even choose a strategy in which it malfunctions in some particularly interesting or reassuring way. Though this might cause the AI to be terminated, it might also encourage the engineers who perform the postmortem to believe that they have gleaned a valuable new insight into AI dynamics—leading them to place more trust in the next system they design, and thus increasing the chance that the now-defunct original AI’s goals will be achieved. Many other possible strategic considerations might also influence an advanced AI, and it would be hubristic to suppose that we could anticipate all of them, especially for an AI that has attained the strategizing superpower.

A treacherous turn could also come about if the AI discovers an unanticipated way of fulfilling its final goal as specified. Suppose, for example, that an AI’s final goal is to “make the project’s sponsor happy.” Initially, the only method available to the AI to achieve this outcome is by behaving in ways that please its sponsor in something like the intended manner. The AI gives helpful answers to questions; it exhibits a delightful personality; it makes money. The more capable the AI gets, the more satisfying its performances become, and everything goeth according to plan—until the AI becomes intelligent enough to figure out that it can realize its final goal more fully and reliably by implanting electrodes into the pleasure centers of its sponsor’s brain, something assured to delight the sponsor immensely. Of course, the sponsor might not have wanted to be pleased by being turned into a grinning idiot; but if this is the action that will maximally realize the AI’s final goal, the AI will take it. If the AI already has a decisive strategic advantage, then any attempt to stop it will fail. If the AI does not yet have a decisive strategic advantage, then the AI might temporarily conceal its canny new idea for how to instantiate its final goal until it has grown strong enough that the sponsor and everybody else will be unable to resist. In either case, we get a treacherous turn.

Comment by Joel Burget (joel-burget) on The Leopold Model: Analysis and Reactions · 2024-06-16T17:12:02.689Z · LW · GW

A slight silver lining, I'm not sure if a world in which China "wins" the race is all that bad. I'm genuinely uncertain. Let's take Leopold's objections for example:

I genuinely do not know the intentions of the CCP and their authoritarian allies. But, as a reminder: the CCP is a regime founded on the continued worship of perhaps the greatest totalitarian mass-murderer in human history (“with estimates ranging from 40 to 80 million victims due to starvation, persecution, prison labor, and mass executions”); a regime that recently put a million Uyghurs in concentration camps and crushed a free Hong Kong; a regime that systematically practices mass surveillance for social control, both of the new-fangled (tracking phones, DNA databases, facial recognition, and so on) and the old-fangled (recruiting an army of citizens to report on their neighbors) kind; a regime that ensures all text messages passes through a censor, and that goes so far to repress dissent as to pull families into police stations when their child overseas attends a protest; a regime that has cemented Xi Jinping as dictator-for-life; a regime that touts its aims to militarily crush and “reeducate” a free neighboring nation; a regime that explicitly seeks a China-centric world order.

I agree that all of these are bad (very bad). But I think they're all means to preserve the CCP's control. With superintelligence, preservation of control is no longer a problem.

I believe Xi (or choose your CCP representative) would say that the ultimate goal is human flourishing, that all they do to maintain control is to preserve communism, which exists to make a better life for their citizens. If that's the case, then if both sides are equally capable of building it, does it matter whether the instruction to maximize human flourishing comes from the US or China?

(Again, I want to reiterate that I'm genuinely uncertain here.)

Comment by Joel Burget (joel-burget) on The Leopold Model: Analysis and Reactions · 2024-06-16T17:00:36.069Z · LW · GW

My biggest problem with Leopold's project is this: in a world where his models hold up, where superintelligence is right around the corner, a US / China race is inevitable, and the winner really matters; in that world, publishing these essays on the open internet is very dangerous. It seems just as likely to help the Chinese side as to help the US.

If China prioritizes AI (if they decide that it's one tenth as important as Leopold suggests), I'd expect their administration to act more quickly and competently than the US. I don't have a good reason to think Leopold's essays will have a bigger impact in the US government than the Chinese, or vice-versa (I don't think it matters much that it was written in English). My guess is that they've been read by some USG staffers, but I wouldn't be surprised if things die out with the excitement of the upcoming election and partisan concerns. On the other hand, I wouldn't be surprised if they're already circulating in Beijing. If not now, then maybe in the future -- now that these essays are published on the internet, there's no way to take them back.

What's more, it seems possible to me that by framing things as a race, and saying cooperation is "fanciful", may (in a self-fulfilling prophecy way) make a race more likely (and cooperation less).

Another complicating factor is that there's just no way the US could run a secret project without China getting word of it immediately. With all the attention paid to the top US labs and research scientists, they're not going to all just slip away to New Mexico for three years unnoticed. (I'm not sure if China could pull off such a secret project, but I wouldn't rule it out.)

Comment by Joel Burget (joel-burget) on My AI Model Delta Compared To Christiano · 2024-06-13T13:46:45.582Z · LW · GW

Sorry, was in a hurry when I wrote this. What I meant / should have said is: it seems really valuable to me to understand how you can refute Paul's views so confidently and I'd love to hear more.

Comment by Joel Burget (joel-burget) on My AI Model Delta Compared To Christiano · 2024-06-12T21:33:46.439Z · LW · GW

I put approximately-zero probability on the possibility that Paul is basically right on this delta; I think he’s completely out to lunch.

Very strong claim which the post doesn't provide nearly enough evidence to support

Comment by Joel Burget (joel-burget) on Comments on Anthropic's Scaling Monosemanticity · 2024-06-03T13:48:38.019Z · LW · GW

I decided to do a check by tallying the "More Safety Relevant Features" from the 1M SAE to see if they reoccur in the 34M SAE (in some related form).

 

I don't think we can interpret their list of safety-relevant features as exhaustive. I'd bet (80% confidence) that we could find 34M features corresponding to at least some of the 1M features you listed, given access to their UMAP browser. Unfortunately we can't do this without Anthropic support.

Comment by Joel Burget (joel-burget) on quila's Shortform · 2024-06-02T21:27:07.451Z · LW · GW

Maybe you can say a bit about what background someone should have to be able to evaluate your idea.

Comment by Joel Burget (joel-burget) on If you are also the worst at politics · 2024-05-27T14:17:20.567Z · LW · GW

Not a direct answer to your question but:

  1. One article I (easily) found on prediction markets mentions Bryan Caplan but has no mention of Hanson
  2. There are plenty of startups promoting prediction markets: Manifold, Kalshi, Polymarket, PredictIt, etc
  3. There was a recent article Why prediction markets aren't popular, which gives plenty of good reasons but doesn't mention any Hanson headwind
  4. Scott Alexander does regular "Mantic Monday" posts on prediction markets
Comment by Joel Burget (joel-burget) on If you are also the worst at politics · 2024-05-26T23:29:43.371Z · LW · GW

I’m not sure about the premise that people are opposed to Hanson’s ideas because he said them. On the contrary, I’ve seen several people (now including you) mention that they’re fans of his ideas, and never seen anyone say that they dislike them.

My model is more that some ideas are more viral than others, some ideas have loud and enthusiastic champions, and some ideas are economically valuable. I don’t see most of Hanson’s ideas as particularly viral, don’t think he’s worked super hard to champion them, and they’re a mixed bag economically (eg prediction markets are valuable but grabby aliens aren’t).

I also believe that if someone charismatic adopts an idea then they can cause it to explode in popularity regardless of who originated it. This has happened to some degree with prediction markets. I certainly don’t think they’re held back because of the association with Hanson.

Comment by Joel Burget (joel-burget) on Joel Burget's Shortform · 2024-05-25T20:46:14.457Z · LW · GW

Why does Golden Gate Claude act confused? My guess is that activating the Golden Gate Bridge feature so strongly is OOD. (This feature, by the way, is not exactly aligned with your conception of the Golden Gate Bridge or mine, so it might emphasize fog more or less than you would, but that’s not what I’m focusing on here). Anthropic probably added the bridge feature pretty strongly, so the model ends up in a state with a 10x larger Golden Gate Bridge activation than it’s built for, not to mention in the context of whatever unrelated prompt you’ve fed it, in a space not all that near any datapoints it's been trained on.

Comment by Joel Burget (joel-burget) on peterbarnett's Shortform · 2024-05-24T20:43:28.043Z · LW · GW

The Anthropic post itself said more or less the same:

Comment by Joel Burget (joel-burget) on Testing for parallel reasoning in LLMs · 2024-05-19T17:54:18.335Z · LW · GW

To me the strongest evidence that fine-tuning is based on LoRA or similar is the fact that pricing is based just on training and input / output and doesn't factor in the cost of storing your fine-tuned models. Llama-3-8b-instruct is ~16GB (I think this ought to be roughly comparable, at least in the same ballpark). You'd almost surely care if you were storing that much data for each fine-tune.

Comment by Joel Burget (joel-burget) on So What's Up With PUFAs Chemically? · 2024-04-27T20:22:01.698Z · LW · GW

Measuring the composition of fryer oil at different times certainly seems like a good way to test both the original hypothesis and the effect of altitude.

Comment by Joel Burget (joel-burget) on So What's Up With PUFAs Chemically? · 2024-04-27T16:57:36.113Z · LW · GW

You're right, my original wording was too strong. I edited it to say that it agrees with so many diets instead of explains why they work.

Comment by Joel Burget (joel-burget) on So What's Up With PUFAs Chemically? · 2024-04-27T15:51:22.405Z · LW · GW

One thing I like about the PUFA breakdown theory is that it agrees with aspects of so many different diets.

  • Keto avoids fried food because usually the food being fried is carbs
  • Carnivore avoids vegetable oils because they're not meat
  • Paleo avoids vegetable oils because they weren't available in the ancestral environment
  • Vegans tend to emphasize raw food and fried foods often have meat or cheese in them
  • Low-fat diets avoid fat of all kinds
  • Ray Peat was perhaps the closest to the mark in emphasizing that saturated fats are more stable (he probably talked about PUFA breakdown specifically, I'm not sure).

Edit: I originally wrote "neatly explains why so many different diets are reported to work"

Comment by Joel Burget (joel-burget) on CTMU insight: maybe consciousness *can* affect quantum outcomes? · 2024-04-20T14:07:23.804Z · LW · GW

If this was true, how could we tell? In other words, is this a testable hypothesis?

What reason do we have to believe this might be true? Because we're in a world where it looks like we're going to develop superintelligence, so it would be a useful world to simulate?

Comment by Joel Burget (joel-burget) on Joel Burget's Shortform · 2024-04-19T01:48:22.552Z · LW · GW

From the latest Conversations with Tyler interview of Peter Thiel

I feel like Thiel misrepresents Bostrom here. He doesn’t really want a centralized world government or think that’s "a set of things that make sense and that are good". He’s forced into world surveillance not because it’s good but because it’s the only alternative he sees to dangerous ASI being deployed.

I wouldn’t say he’s optimistic about human nature. In fact it’s almost the very opposite. He thinks that we’re doomed by our nature to create that which will destroy us.

Comment by Joel Burget (joel-burget) on Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders · 2024-03-26T19:12:39.665Z · LW · GW

Three questions:

  1. What format do you upload SAEs in?
  2. What data do you run the SAEs over to generate the activations / samples?
  3. How long of a delay is there between uploading an SAE and it being available to view?
Comment by Joel Burget (joel-burget) on Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders · 2024-03-26T18:41:58.784Z · LW · GW

This is fantastic. Thank you.

Comment by Joel Burget (joel-burget) on Highlights from Lex Fridman’s interview of Yann LeCun · 2024-03-14T00:51:35.393Z · LW · GW

Thanks! I added a note about LeCun's 100,000 claim and just dropped the Chollet reference since it was misleading.

Comment by Joel Burget (joel-burget) on Highlights from Lex Fridman’s interview of Yann LeCun · 2024-03-14T00:45:42.685Z · LW · GW

Thanks for the correction! I've updated the post.

Comment by Joel Burget (joel-burget) on Jimrandomh's Shortform · 2024-03-06T17:27:37.514Z · LW · GW

I assume the 44k PPM CO2 exhaled air is the product of respiration (I.e. the lungs have processed it), whereas the air used in mouth-to-mouth is quickly inhaled and exhaled.

Comment by Joel Burget (joel-burget) on Significantly Enhancing Adult Intelligence With Gene Editing May Be Possible · 2023-12-12T22:42:14.466Z · LW · GW

What's your best guess for what percentage of cells (in the brain) receive edits?

Are edits somehow targeted at brain cells in particular or do they run throughout the body?

Comment by Joel Burget (joel-burget) on My techno-optimism [By Vitalik Buterin] · 2023-11-29T16:03:18.410Z · LW · GW

I don't have a well-reasoned opinion here but I'm interested in hearing from those who disagree.

Comment by Joel Burget (joel-burget) on Towards Monosemanticity: Decomposing Language Models With Dictionary Learning · 2023-10-08T22:37:13.856Z · LW · GW

How would you distinguish between weak and strong methods?

Comment by Joel Burget (joel-burget) on My Effortless Weightloss Story: A Quick Runthrough · 2023-10-02T15:35:36.094Z · LW · GW

Re Na:K : Potassium Chloride is used as a salt substitute (which tastes surprisingly like regular salt). This makes it really easy to tweak the Na:K ratio (if it turns out to be important). OTOH, it's some evidence that it's not important, otherwise I'd expect someone to notice that people lose weight when they substitute it for table salt.

Comment by Joel Burget (joel-burget) on Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures · 2023-06-01T00:52:30.332Z · LW · GW

We don't hear much about Apple in AI -- curious why you rank them so important.

Comment by Joel Burget (joel-burget) on Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures · 2023-05-30T16:41:26.949Z · LW · GW

Though the statement doesn't say much the list of signatories is impressively comprehensive. The only conspicuously missing names that immediately come to mind are Dean and LeCun (I don't know if they were asked to sign).

Comment by Joel Burget (joel-burget) on Activation additions in a small residual network · 2023-05-22T22:55:55.960Z · LW · GW

I have a couple of basic questions:

  1. Shouldn't diagonal elements in the perplexity table all be equal to the baseline (since the addition should be 0)?
  2. I'm a bit confused about the use of perplexity here. The added vector introduces bias (away from one digit and towards another). It shouldn't be surprising that perplexity increases? Eyeballing the visualizations they do all seem to shift mass away from b and towards a.
Comment by Joel Burget (joel-burget) on Manifold: If okay AGI, why? · 2023-03-26T00:23:36.311Z · LW · GW

Link to Rob Bensinger's comments on this market:

Comment by Joel Burget (joel-burget) on Is it a coincidence that GPT-3 requires roughly the same amount of compute as is necessary to emulate the human brain? · 2023-02-10T17:19:59.690Z · LW · GW

I worry that this is conflating two possible meanings of FLOPS:

  1. Floating Point Operations (FLOPs)
  2. Floating Point Operations per Second (Maybe FLOPs/s is clear?)

The AI and Memory Wall data is using (1) while the Sandberg / Bostrom paper is using (2) (see the definition in Appendix F).

(I noticed a type error when thinking about comparing real-time brain emulation vs training).

Comment by Joel Burget (joel-burget) on Anomalous tokens reveal the original identities of Instruct models · 2023-02-10T13:56:25.155Z · LW · GW

One more, related to your first point: I wouldn't expect all mesaoptimizers to have the same signature, since they could take very different forms. What does the distribution of mesaoptimizer signatures look like? How likely is it that a novel (undetectable) mesaoptimizer arises in training?

Comment by Joel Burget (joel-burget) on SolidGoldMagikarp II: technical details and more recent findings · 2023-02-09T15:44:35.746Z · LW · GW

As far as we are aware, GPT-4 will use the same 50,257 tokens as its two most recent predecessors.

I suspect it'll have more. OpenAI recently released https://github.com/openai/tiktoken. This includes "cl100k_base" with ~100k tokens.

The capabilities case for this is that GPT-{2,3} seem to be somewhat hobbled by their tokenizer, at least when it comes to arithmetic. But cl100k_base has exactly 1110 tokens which are just digits. 10 1 digit tokens, 100 2 digit tokens and 1000 3 digit tokens! (None have preceding spaces).

Comment by Joel Burget (joel-burget) on SolidGoldMagikarp (plus, prompt generation) · 2023-02-05T18:18:31.561Z · LW · GW

Previous related exploration: https://www.lesswrong.com/posts/BMghmAxYxeSdAteDc/an-exploration-of-gpt-2-s-embedding-weights

My best guess is that this crowded spot in embedding space is a sort of wastebasket for tokens that show up in machine-readable files but aren’t useful to the model for some reason. Possibly, these are tokens that are common in the corpus used to create the tokenizer, but not in the WebText training corpus. The oddly-specific tokens related to Puzzle & Dragons, Nature Conservancy, and David’s Bridal webpages suggest that BPE may have been run on a sample of web text that happened to have those websites overrepresented, and GPT-2 is compensating for this by shoving all the tokens it doesn’t find useful in the same place.

Comment by Joel Burget (joel-burget) on Peter Thiel on Technological Stagnation and Out of Touch Rationalists · 2022-12-07T16:25:30.797Z · LW · GW

Thiel's arguments about both the Vulnerable World Hypothesis and Death with Dignity were so (uncharacteristically?) shallow that I had to question whether he actually believes what he said, or was just making an argument he thought would be popular with the audience. I don't know enough about his views to say but my guess is that it's somewhat (20%+) likely.

Comment by Joel Burget (joel-burget) on A Barebones Guide to Mechanistic Interpretability Prerequisites · 2022-11-02T23:16:27.354Z · LW · GW

how is changing to an orthonormal basis importantly different from just any change of basis?

What exactly do you have in mind here?

Comment by Joel Burget (joel-burget) on Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small · 2022-11-01T14:19:08.011Z · LW · GW

Very interesting work! I have a couple questions.

1. 

Looking at your example, “​​Then, David and Elizabeth were working at the school. Elizabeth had a good day. Elizabeth decided to give a bone to Elizabeth”. I'm confused. You say "duplicating the IO token in a distractor sentence", but I thought David would be the IO here?

I also tried this sentence in your tool and only got a 2.6% probability for the Elizabeth completion.

However, repeating the David token raises that to 8.1%.

Am I confused about the meaning of the IO or was there just a typo in the example?

2.

In our work, it’s probably true that the circuits used for each template are actually subtly different in ways we don't understand. As evidence for this, the standard deviation of the logit difference is ~ 40% and we don't have good hypotheses to explain this variation. It is likely that the circuit that we found was just the circuit that was most active across this distribution.

I'd love if you could expand on this (maybe with an example). It sounds like you're implying that the circuit you found is not complete?

Comment by Joel Burget (joel-burget) on Interpreting Neural Networks through the Polytope Lens · 2022-09-24T18:28:19.139Z · LW · GW
  1. Are there plans to release the software used in this analysis or will it remain proprietary? How does it scale to larger networks?
  2. This provides an excellent explanation for why deep networks are useful (exponential growth in polytopes).
  3. "We’re not completely sure why polytope boundaries tend to lie in a shell, though we suspect that it’s likely related to the fact that, in high dimensional spaces, most of the hypervolume of a hypersphere is close to the surface." I'm picturing a unit hypersphere where most of the volume is in, e.g., the [0.95,1] region. But why would polytope boundaries not simply extend further out?
  4. A better mental model (and visualizations) for how NNs work. Understanding when data is off-distribution. New methods for finding and understanding adversarial examples. This is really exciting work.
Comment by Joel Burget (joel-burget) on Sparse trinary weighted RNNs as a path to better language model interpretability · 2022-09-23T23:34:01.832Z · LW · GW

Do you happen to know how this compares with https://github.com/BlinkDL/RWKV-LM which is described as an RNN with performance comparable to a transformer / linear attention?