Posts

How to Model the Future of Open-Source LLMs? 2024-04-19T14:28:00.175Z
Paul Christiano named as US AI Safety Institute Head of AI Safety 2024-04-16T16:22:06.937Z
Highlights from Lex Fridman’s interview of Yann LeCun 2024-03-13T20:58:13.052Z
Interpretability isn’t Free 2022-08-04T15:02:54.842Z
Anthropic's SoLU (Softmax Linear Unit) 2022-07-04T18:38:05.597Z
Joel Burget's Shortform 2022-06-11T19:53:38.922Z
The two missing core reasons why aligning at-least-partially superhuman AGI is hard 2022-04-19T17:15:23.965Z
Chesterton’s Fence vs The Onion in the Varnish 2022-03-24T21:20:14.114Z

Comments

Comment by Joel Burget (joel-burget) on CTMU insight: maybe consciousness *can* affect quantum outcomes? · 2024-04-20T14:07:23.804Z · LW · GW

If this was true, how could we tell? In other words, is this a testable hypothesis?

What reason do we have to believe this might be true? Because we're in a world where it looks like we're going to develop superintelligence, so it would be a useful world to simulate?

Comment by Joel Burget (joel-burget) on Joel Burget's Shortform · 2024-04-19T01:48:22.552Z · LW · GW

From the latest Conversations with Tyler interview of Peter Thiel

I feel like Thiel misrepresents Bostrom here. He doesn’t really want a centralized world government or think that’s "a set of things that make sense and that are good". He’s forced into world surveillance not because it’s good but because it’s the only alternative he sees to dangerous ASI being deployed.

I wouldn’t say he’s optimistic about human nature. In fact it’s almost the very opposite. He thinks that we’re doomed by our nature to create that which will destroy us.

Comment by Joel Burget (joel-burget) on Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders · 2024-03-26T19:12:39.665Z · LW · GW

Three questions:

  1. What format do you upload SAEs in?
  2. What data do you run the SAEs over to generate the activations / samples?
  3. How long of a delay is there between uploading an SAE and it being available to view?
Comment by Joel Burget (joel-burget) on Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders · 2024-03-26T18:41:58.784Z · LW · GW

This is fantastic. Thank you.

Comment by Joel Burget (joel-burget) on Highlights from Lex Fridman’s interview of Yann LeCun · 2024-03-14T00:51:35.393Z · LW · GW

Thanks! I added a note about LeCun's 100,000 claim and just dropped the Chollet reference since it was misleading.

Comment by Joel Burget (joel-burget) on Highlights from Lex Fridman’s interview of Yann LeCun · 2024-03-14T00:45:42.685Z · LW · GW

Thanks for the correction! I've updated the post.

Comment by Joel Burget (joel-burget) on Jimrandomh's Shortform · 2024-03-06T17:27:37.514Z · LW · GW

I assume the 44k PPM CO2 exhaled air is the product of respiration (I.e. the lungs have processed it), whereas the air used in mouth-to-mouth is quickly inhaled and exhaled.

Comment by Joel Burget (joel-burget) on Significantly Enhancing Adult Intelligence With Gene Editing May Be Possible · 2023-12-12T22:42:14.466Z · LW · GW

What's your best guess for what percentage of cells (in the brain) receive edits?

Are edits somehow targeted at brain cells in particular or do they run throughout the body?

Comment by Joel Burget (joel-burget) on My techno-optimism [By Vitalik Buterin] · 2023-11-29T16:03:18.410Z · LW · GW

I don't have a well-reasoned opinion here but I'm interested in hearing from those who disagree.

Comment by Joel Burget (joel-burget) on Towards Monosemanticity: Decomposing Language Models With Dictionary Learning · 2023-10-08T22:37:13.856Z · LW · GW

How would you distinguish between weak and strong methods?

Comment by Joel Burget (joel-burget) on My Effortless Weightloss Story: A Quick Runthrough · 2023-10-02T15:35:36.094Z · LW · GW

Re Na:K : Potassium Chloride is used as a salt substitute (which tastes surprisingly like regular salt). This makes it really easy to tweak the Na:K ratio (if it turns out to be important). OTOH, it's some evidence that it's not important, otherwise I'd expect someone to notice that people lose weight when they substitute it for table salt.

Comment by Joel Burget (joel-burget) on Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures · 2023-06-01T00:52:30.332Z · LW · GW

We don't hear much about Apple in AI -- curious why you rank them so important.

Comment by Joel Burget (joel-burget) on Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures · 2023-05-30T16:41:26.949Z · LW · GW

Though the statement doesn't say much the list of signatories is impressively comprehensive. The only conspicuously missing names that immediately come to mind are Dean and LeCun (I don't know if they were asked to sign).

Comment by Joel Burget (joel-burget) on Activation additions in a small residual network · 2023-05-22T22:55:55.960Z · LW · GW

I have a couple of basic questions:

  1. Shouldn't diagonal elements in the perplexity table all be equal to the baseline (since the addition should be 0)?
  2. I'm a bit confused about the use of perplexity here. The added vector introduces bias (away from one digit and towards another). It shouldn't be surprising that perplexity increases? Eyeballing the visualizations they do all seem to shift mass away from b and towards a.
Comment by Joel Burget (joel-burget) on Manifold: If okay AGI, why? · 2023-03-26T00:23:36.311Z · LW · GW

Link to Rob Bensinger's comments on this market:

Comment by Joel Burget (joel-burget) on Is it a coincidence that GPT-3 requires roughly the same amount of compute as is necessary to emulate the human brain? · 2023-02-10T17:19:59.690Z · LW · GW

I worry that this is conflating two possible meanings of FLOPS:

  1. Floating Point Operations (FLOPs)
  2. Floating Point Operations per Second (Maybe FLOPs/s is clear?)

The AI and Memory Wall data is using (1) while the Sandberg / Bostrom paper is using (2) (see the definition in Appendix F).

(I noticed a type error when thinking about comparing real-time brain emulation vs training).

Comment by Joel Burget (joel-burget) on Anomalous tokens reveal the original identities of Instruct models · 2023-02-10T13:56:25.155Z · LW · GW

One more, related to your first point: I wouldn't expect all mesaoptimizers to have the same signature, since they could take very different forms. What does the distribution of mesaoptimizer signatures look like? How likely is it that a novel (undetectable) mesaoptimizer arises in training?

Comment by Joel Burget (joel-burget) on SolidGoldMagikarp II: technical details and more recent findings · 2023-02-09T15:44:35.746Z · LW · GW

As far as we are aware, GPT-4 will use the same 50,257 tokens as its two most recent predecessors.

I suspect it'll have more. OpenAI recently released https://github.com/openai/tiktoken. This includes "cl100k_base" with ~100k tokens.

The capabilities case for this is that GPT-{2,3} seem to be somewhat hobbled by their tokenizer, at least when it comes to arithmetic. But cl100k_base has exactly 1110 tokens which are just digits. 10 1 digit tokens, 100 2 digit tokens and 1000 3 digit tokens! (None have preceding spaces).

Comment by Joel Burget (joel-burget) on SolidGoldMagikarp (plus, prompt generation) · 2023-02-05T18:18:31.561Z · LW · GW

Previous related exploration: https://www.lesswrong.com/posts/BMghmAxYxeSdAteDc/an-exploration-of-gpt-2-s-embedding-weights

My best guess is that this crowded spot in embedding space is a sort of wastebasket for tokens that show up in machine-readable files but aren’t useful to the model for some reason. Possibly, these are tokens that are common in the corpus used to create the tokenizer, but not in the WebText training corpus. The oddly-specific tokens related to Puzzle & Dragons, Nature Conservancy, and David’s Bridal webpages suggest that BPE may have been run on a sample of web text that happened to have those websites overrepresented, and GPT-2 is compensating for this by shoving all the tokens it doesn’t find useful in the same place.

Comment by Joel Burget (joel-burget) on Peter Thiel on Technological Stagnation and Out of Touch Rationalists · 2022-12-07T16:25:30.797Z · LW · GW

Thiel's arguments about both the Vulnerable World Hypothesis and Death with Dignity were so (uncharacteristically?) shallow that I had to question whether he actually believes what he said, or was just making an argument he thought would be popular with the audience. I don't know enough about his views to say but my guess is that it's somewhat (20%+) likely.

Comment by Joel Burget (joel-burget) on A Barebones Guide to Mechanistic Interpretability Prerequisites · 2022-11-02T23:16:27.354Z · LW · GW

how is changing to an orthonormal basis importantly different from just any change of basis?

What exactly do you have in mind here?

Comment by Joel Burget (joel-burget) on Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small · 2022-11-01T14:19:08.011Z · LW · GW

Very interesting work! I have a couple questions.

1. 

Looking at your example, “​​Then, David and Elizabeth were working at the school. Elizabeth had a good day. Elizabeth decided to give a bone to Elizabeth”. I'm confused. You say "duplicating the IO token in a distractor sentence", but I thought David would be the IO here?

I also tried this sentence in your tool and only got a 2.6% probability for the Elizabeth completion.

However, repeating the David token raises that to 8.1%.

Am I confused about the meaning of the IO or was there just a typo in the example?

2.

In our work, it’s probably true that the circuits used for each template are actually subtly different in ways we don't understand. As evidence for this, the standard deviation of the logit difference is ~ 40% and we don't have good hypotheses to explain this variation. It is likely that the circuit that we found was just the circuit that was most active across this distribution.

I'd love if you could expand on this (maybe with an example). It sounds like you're implying that the circuit you found is not complete?

Comment by Joel Burget (joel-burget) on Interpreting Neural Networks through the Polytope Lens · 2022-09-24T18:28:19.139Z · LW · GW
  1. Are there plans to release the software used in this analysis or will it remain proprietary? How does it scale to larger networks?
  2. This provides an excellent explanation for why deep networks are useful (exponential growth in polytopes).
  3. "We’re not completely sure why polytope boundaries tend to lie in a shell, though we suspect that it’s likely related to the fact that, in high dimensional spaces, most of the hypervolume of a hypersphere is close to the surface." I'm picturing a unit hypersphere where most of the volume is in, e.g., the [0.95,1] region. But why would polytope boundaries not simply extend further out?
  4. A better mental model (and visualizations) for how NNs work. Understanding when data is off-distribution. New methods for finding and understanding adversarial examples. This is really exciting work.
Comment by Joel Burget (joel-burget) on Sparse trinary weighted RNNs as a path to better language model interpretability · 2022-09-23T23:34:01.832Z · LW · GW

Do you happen to know how this compares with https://github.com/BlinkDL/RWKV-LM which is described as an RNN with performance comparable to a transformer / linear attention?

Comment by Joel Burget (joel-burget) on Seriously, what goes wrong with "reward the agent when it makes you smile"? · 2022-08-12T14:47:09.088Z · LW · GW

Meta-comment: I'm happy to see this -- someone knowledgeable, who knows and seriously engages with the standard arguments, willing to question the orthodox answer (which some might fear would make them look silly). I think this is a healthy dynamic and I hope to see more of it.

Comment by Joel Burget (joel-burget) on Shard Theory: An Overview · 2022-08-11T17:04:36.937Z · LW · GW

Subcortical reinforcement circuits, though, hail from a distinct informational world... and so have to reinforce computations "blindly," relying only on simple sensory proxies.

This seems to be pointing in an interesting direction that I'd like to see expanded.

Because your subcortical reward circuitry was hardwired by your genome, it's going to be quite bad at accurately assigning credit to shards.

I don't know, I think of the brain as doing credit assignment pretty well, but we may have quite different definitions of good and bad. Is there an example you were thinking of? Cognitive biases in general?

if shard theory is true, meaningful partial alignment successes are possible

"if shard theory is true" -- is this a question about human intelligence, deep RL agents, or the relationship between the two? How can the hypothesis be tested?

Even if the human shards only win a small fraction of the blended utility function, a small fraction of our lightcone is quite a lot

What's to stop the human shards from being dominated and extinguished by the non-human shards? IE is there reason to expect equilibrium?

Comment by Joel Burget (joel-burget) on What should superrational players do in asymmetric games? · 2022-08-08T19:05:02.953Z · LW · GW

Typo:

In total, this game has a coco-value of (145, 95), which would be realized by Alice selling at the beach, Bob selling at the airport, and Alice transferring 55 to Bob.

I believe the transfer should be 25.

Comment by Joel Burget (joel-burget) on What are all the AI Alignment and AI Safety Communication Hubs? · 2022-06-16T00:31:21.591Z · LW · GW

Gwern often posts to https://www.reddit.com/r/mlscaling/ as well

Comment by Joel Burget (joel-burget) on A central AI alignment problem: capabilities generalization, and the sharp left turn · 2022-06-15T22:24:32.165Z · LW · GW

Human values aren't a repeller, but they're a very narrow target to hit.

As optimization pressure is applied the AI becomes more capable. In particular it will develop a more detailed model of people and their values. So it seems to me there is actually a basin around schemes like CEV which course correct towards true human values.

This of course doesn't help with corrigibility.

Comment by Joel Burget (joel-burget) on A central AI alignment problem: capabilities generalization, and the sharp left turn · 2022-06-15T18:47:49.942Z · LW · GW

Two points:

  1. The visualization of capabilities improvements as an attractor basin is pretty well accepted and useful, I think. I kind of like the analogous idea of an alignment target as a repeller cone / dome. The true target is approximately infinitely small and attempts to hit it slide off as optimization pressure is applied. I'm curious if other share this model and if it's been refined / explored in more detail by others.
  2. The sharpness of the left turn strikes me as a major crux. Some (most?) alignment proposals seem to rely on developing an AI just a bit smarter than humans but not yet dangerous.  (An implicit assumption here may be that intelligence continues to develop in straight lines.) The sharp left turn model implies this sweet spot will pass by in the blink of an eye. (An implicit assumption here may be that there are discrete leaps.) Interesting to note that Nate explicitly says RSI is not a core part of his model. I'd like to see more arguments on both sides of this debate.
Comment by Joel Burget (joel-burget) on On A List of Lethalities · 2022-06-14T14:49:27.142Z · LW · GW

By the point your AI can design, say, working nanotech, I'd expect it to be well superhuman at hacking, and able to understand things like Rowhammer. I'd also expect it to be able to build models of it's operators and conceive of deep strategies involving them.

This assumes the AI learns all of these tasks at the same time. I'm hopeful that we could built a narrowly superhuman task AI which is capable of e.g. designing nanotech while being at or below human level for the other tasks you mentioned (and ~all other dangerous tasks you didn't).

Superhuman ability at nanotech alone may be sufficient for carrying out a pivotal act, though maybe not sufficient for other relevant strategic concerns.

Comment by Joel Burget (joel-burget) on Joel Burget's Shortform · 2022-06-11T19:53:39.151Z · LW · GW

The Soviet nail factory always used to illustrate Goodhart's law... did it actually exist? Some good answers on the skeptics StackExchange https://skeptics.stackexchange.com/questions/22375/did-a-soviet-nail-factory-produce-useless-nails-to-improve-metrics

Comment by Joel Burget (joel-burget) on AXRP Episode 15 - Natural Abstractions with John Wentworth · 2022-06-04T14:10:31.176Z · LW · GW

If you're interested in following up on John's comments on financial markets, nonexistence of a representative agent, and path dependence, he speaks more about them in his post, Why Subagents?

In practice, path-dependent preferences mostly matter for systems with “hidden state”: internal variables which can change in response to the system’s choices. A great example of this is financial markets: they’re the ur-example of efficiency and inexploitability, yet it turns out that a market does not have a utility function in general (economists call this “nonexistence of a representative agent”). The reason is that the distribution of wealth across the market’s agents functions as an internal hidden variable. Depending on what path the market follows, different internal agents end up with different amounts of wealth, and the market as a whole will hold different portfolios as a result - even if the externally-visible variables, i.e. prices, end up the same.

Comment by Joel Burget (joel-burget) on How to get into AI safety research · 2022-06-03T21:10:55.475Z · LW · GW

Thank you for mentioning Gödel Without Too Many Tears, which I bought it based on this recommendation. It's a lovely little book. I didn't expect to it to be nearly so engrossing.

Comment by Joel Burget (joel-burget) on We Are Conjecture, A New Alignment Research Startup · 2022-05-01T15:25:14.176Z · LW · GW

Can't answer the second question, but see https://www.gwern.net/Scaling-hypothesis for the first.

Comment by Joel Burget (joel-burget) on [Closed] Hiring a mathematician to work on the learning-theoretic AI alignment agenda · 2022-04-25T14:37:37.585Z · LW · GW

Academics not willing to leave their jobs might still be interested in working on a problem part-time. One could imagine that the right researcher working part-time might be more effective than the wrong researcher full time.

Comment by Joel Burget (joel-burget) on The two missing core reasons why aligning at-least-partially superhuman AGI is hard · 2022-04-20T14:14:32.647Z · LW · GW

Thanks Pattern -- I've taken your advice and updated the title.

Comment by Joel Burget (joel-burget) on Chesterton’s Fence vs The Onion in the Varnish · 2022-03-25T16:07:45.437Z · LW · GW

I don't actually think that they are in conflict.

Funny, this is exactly what I was trying to argue for (section 4 explicitly says "Really, both anecdotes teach us the same thing"). Trying to think how I can make this clearer.