Posts

When do alignment researchers retire? 2024-06-25T23:30:25.520Z
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning 2024-05-17T16:25:02.267Z
Graphical tensor notation for interpretability 2023-10-04T08:04:33.341Z

Comments

Comment by Jordan Taylor (Nadroj) on Benchmarks for Detecting Measurement Tampering [Redwood Research] · 2024-12-06T04:47:42.064Z · LW · GW

Fixed links: 
https://huggingface.co/oliverdk/codegen-350M-mono-measurement_pred 
https://github.com/oliveradk/measurement-pred

Comment by Jordan Taylor (Nadroj) on Mechanistically Eliciting Latent Behaviors in Language Models · 2024-11-30T06:47:34.664Z · LW · GW

I'm keen to hear how you think your work relates to "Activation plateaus and sensitive directions in LLMs". Presumably  should be chosen just large enough to get out of an activation plateau? Perhaps it might also explain why gradient based methods for MELBO alone might not work nearly as well as methods with a finite step size, because the effect is reversed if  is too small? 
 

Comment by Jordan Taylor (Nadroj) on Mechanistically Eliciting Latent Behaviors in Language Models · 2024-11-30T01:53:18.138Z · LW · GW

Couldn't you do something like fit a Gaussian to the model's activations, then restrict the steered activations to be high likelihood (low Mahalanobis distance)? Or (almost) equivalently, you could just do a whitening transformation to activation space before you constrain the L2 distance of the perturbation.

(If a gaussian isn't expressive enough you could model the manifold in some other way, eg. with a VAE anomaly detector or mixture of gaussians or whatever)

Comment by Jordan Taylor (Nadroj) on Why are there no interesting (1D, 2-state) quantum cellular automata? · 2024-11-26T04:38:10.623Z · LW · GW

There are many articles on quantum cellular automata. See for example "A review of Quantum Cellular Automata", or "Quantum Cellular Automata, Tensor Networks, and Area Laws". 
I think compared to the literature you're using an overly restrictive and nonstandard definition of quantum cellular automata. Specifically, it only makes sense to me to write  as a product of operators like you have if all of the terms are on spatially disjoint regions. 

Consider defining quantum cellular automata instead as local quantum circuits composed of identical two-site unitary operators everywhere:


If you define them like this, then basically any kind of energy and momentum conserving local quantum dynamics can be discretized into a quantum cellular automata, because any two-site time and space independent quantum Hamiltonian can be decomposed into steps with identical unitaries like this using the Suzuki-Trotter decomposition. 

Comment by Jordan Taylor (Nadroj) on Activation Pattern SVD: A proposal for SAE Interpretability · 2024-07-03T19:25:12.995Z · LW · GW

This seems easy to try and a potential point to iterate from, so you should give it a go. But I worry that  and  will be dense and very uninterpretable:

  •  contains no information about which actual tokens each SAE feature activated on right? Just the token positions?  So activations in completely different contexts but with the same features active in the same token positions cannot be distinguished by ?
  • I'm not sure why you expect  to have low-rank structure. Being low-rank is often in tension with being sparse, and we know that  is a very sparse matrix.
  • Perhaps it would be better to utilize the fact that  is a very sparse matrix of positive entries? Maybe permutation matrices or sparse matrices would be more apt than general orthogonal matrices (which can be negative)? (Then you might have to settle for something like a block-diagonal central matrix, rather than a diagonal matrix of singular values).

I'm keen to see stuff in this direction though! I certainly think you could construct some matrix or tensor of SAE activations such that some decomposition of it is interpretable in an interesting way.  

Comment by Jordan Taylor (Nadroj) on Examples of Highly Counterfactual Discoveries? · 2024-04-28T08:02:34.350Z · LW · GW

Special relativity is not such a good example here when compared to general relativity, which was much further ahead of its time. See, for example, this article: https://bigthink.com/starts-with-a-bang/science-einstein-never-existed/

Regarding special relativity, Einstein himself said:[1]

There is no doubt, that the special theory of relativity, if we regard its development in retrospect, was ripe for discovery in 1905. Lorentz had already recognized that the transformations named after him are essential for the analysis of Maxwell's equations, and Poincaré deepened this insight still further. Concerning myself, I knew only Lorentz's important work of 1895 [...] but not Lorentz's later work, nor the consecutive investigations by Poincaré. In this sense my work of 1905 was independent. [..] The new feature of it was the realization of the fact that the bearing of the Lorentz transformation transcended its connection with Maxwell's equations and was concerned with the nature of space and time in general. A further new result was that the "Lorentz invariance" is a general condition for any physical theory. 

As for general relativity, the ideas and the mathematics required (Riemannian Geometry) were much more obscure and further afield. The only people who came close, Nordstrom and Hilbert, arguably did so because they were directly influenced by Einstein's ongoing work on general relativity (not just special relativity). 

https://www.quora.com/Without-Einstein-would-general-relativity-be-discovered-by-now 

  1. ^
Comment by Jordan Taylor (Nadroj) on Graphical tensor notation for interpretability · 2024-04-12T18:42:52.071Z · LW · GW

Thanks for the kind words! Sadly I just used inkscape for the diagrams - nothing fancy. Though hopefully that could change soon with the help of code like yours. Your library looks excellent! (though I initially confused it with https://github.com/wangleiphy/tensorgrad due to the name). 
I like how you represent functions on the tensors, like you're peering inside them. I can see myself using it often, both for visualizing things, and for computing derivatives. 

The difficulty in using it for final diagrams may be in getting the positions of the tensors arranged nicely. Do you use a force-directed layout like networkx for that currently? Regardless, a good thing about exporting tixz code is that you can change the positions manually, as long as the positions are set up as nice tikz variables rather than "hardcoded" numbers everywhere. 

Anyway, I love it. Here's an image for others to get the idea:

https://lh3.googleusercontent.com/pw/AP1GczOcN5JNU0oTkklp2dvgilWHN1DwDJWBuJ7j38iCuA0MBmEN-DmWY0YfjsRbBH-WgM6NjBuCPhtVGNiY2uG_z9dtsPnNp8Uw4UShPAIQOeMIaw0Zj-4dR6_u_lt9FIz6BsAJJtM91tpt4Dj7xlL_ybusKw=w990-h1974-s-no-gm?authuser=0

Comment by Jordan Taylor (Nadroj) on Graphical tensor notation for interpretability · 2023-10-12T06:40:40.344Z · LW · GW

Nice, I forgot about ZX (and ZXW) calculus. I've never seriously engaged with it, despite it being so closely related to tensor networks. The fact that you can decompose any multilinear equation into so few primitive building blocks is interesting.

Comment by Jordan Taylor (Nadroj) on Graphical tensor notation for interpretability · 2023-10-05T20:57:01.657Z · LW · GW

Oops, yep. I initially had the tensor diagrams for that multiplication the other way around (vector then matrix). I changed them to be more conventional, but forgot that. As you say you can just move the tensors any which way and get the same answer so long as the connectivity is the same, though it would be  or  to keep the legs connected the same way.
 

Comment by Jordan Taylor (Nadroj) on Graphical tensor notation for interpretability · 2023-10-05T12:15:57.287Z · LW · GW

This is an interesting and useful overview, though it's important not to confuse their notation with the Penrose graphical notation I use in this post, since lines in their notation seem to represent the message-passing contributions to a vector, rather than the indices of a tensor. 

That said, there are connections between tensor network contractions and message passing algorithms like Belief Propagation, which I haven't taken the time to really understand. Some references are:

Duality of graphical models and tensor networks - Elina Robeva and Anna Seigal
Tensor network contraction and the belief propagation algorithm - R. Alkabetz and I. Arad
Tensor Network Message Passing - Yijia Wang, Yuwen Ebony Zhang, Feng Pan, Pan Zhang
Gauging tensor networks with belief propagation - Joseph Tindall, Matt Fishman

Comment by Jordan Taylor (Nadroj) on A list of core AI safety problems and how I hope to solve them · 2023-08-26T23:37:28.720Z · LW · GW

I guess the shutdown timer would be most important in the training stage, so that it (hopefully) learns only to care about the short term.

Comment by Jordan Taylor (Nadroj) on Announcing Apollo Research · 2023-05-31T13:59:39.898Z · LW · GW

Seems great! I'm excited about potential interpretability methods for detecting deception.  

I think you're right about the current trade-offs on the gain of function stuff, but it's good to think ahead and have precommitments for the conditions under which your strategies there should change. 

It may be hard to find evals for deception which are sufficiently convincing when they trigger, yet still give us enough time to react afterwards. A few more similar points here: https://www.lesswrong.com/posts/pckLdSgYWJ38NBFf8/?commentId=8qSAaFJXcmNhtC8am 

Building good tools for detecting deceptive alignment seems robustly good though, even after you reach a point where you have to drop the gain of function stuff.

Comment by Jordan Taylor (Nadroj) on GPT-4 · 2023-03-15T02:25:54.933Z · LW · GW

Potential dangers of future evaluations / gain-of-function research, which I'm sure you and Beth are already extremely well aware of:

  1. Falsely evaluating a model as safe (obviously) 
  2. Choosing evaluation metrics which don't give us enough time to react (After evaluation metrics switch would from "safe" to "not safe", we should like to have enough time to recognize this and do something about it before we're all dead)
  3. Crying wolf too many times, making it more likely that no one will believe you when a danger threshold has really been crossed
  4. Letting your methods for making future AIs scarier be too strong given the probability they will be leaked or otherwise made widely accessible. (If the methods / tools are difficult to replicate without resources)
  5. Letting your methods for making AIs scarier be too weak, lest it's too easy for some bad actors to go much further than you did
  6. Failing to have a precommitment to stop this research when models are getting scary enough that it's on balance best to stop making them scarier, even if no-one else believes you yet
     
Comment by Jordan Taylor (Nadroj) on Are we too confident about unaligned AGI killing off humanity? · 2023-03-07T06:26:42.549Z · LW · GW

unless that's an objective

I think this is too all-or-nothing about the objectives of the AI system. Following ideas like shard theory, objectives are likely to come in degrees, be numerous and contextually activated, having been messily created by gradient descent. 

Because "humans" are probably everywhere in its training data, and because of naiive safety efforts like RLHF, I expect AGI to have a lot of complicated pseudo-objectives / shards relating to humans. These objectives may not be good - and if they are they probably won't constitute alignment, but I wouldn't be surprised if it were enough to make it do something more complicated than simply eliminating us for instrumental reasons.

Of course the AI might undergo a reflection process leading to a coherent utility function when it self-improves, but I expect it to be a fairly complicated one, assigning some sort of valence to humans. We might also have some time before it does that, or be able to guide this values-handshake between shards collaboratively.

Comment by Jordan Taylor (Nadroj) on Inner and outer alignment decompose one hard problem into two extremely hard problems · 2022-12-20T03:15:16.526Z · LW · GW

I just wanted to say thanks for writing this. It is important, interesting, and helping to shape and clarify my views. 

I would love to hear a training story where a good outcome for humanity is plausibly achieved using these ideas. I guess it'd rely heavily on interpretability to verify what shards / values are being formed early in training, and regular changes to the training scenario and reward function to change them before the agent is capable enough to subvert attempts to be changed. 

Edit: I forgot you also wrote A shot at the diamond-alignment problem, which is basically this. Though it only assumes simple training techniques (no advanced interpretability) to solve a simpler problem. 

Comment by Jordan Taylor (Nadroj) on A review of the Bio-Anchors report · 2022-10-18T02:45:38.690Z · LW · GW

One small thing: When you first use the word "power", I thought you were talking about energy use rather than computational power. Although you clarify in "A closer look at the NN anchor", I would get the wrong impression if I just read the hypotheses:

... TAI will run on an amount of power comparable to the human brain ...

 ... neural network which would use that much power ...

Maybe change "power" to "computational power" there? I expect biological systems to be much more strongly selected to minimize energy use than TAI systems would be, but the same is not true for computational power.