0 comments

Comments sorted by top scores.

comment by johnswentworth · 2020-12-07T19:14:40.938Z · LW(p) · GW(p)

I'm going to have to spend some time unpacking the very compact notation in the post, but here are my initial reactions.

First, very clean proof of the lemma, well done there.

Second... if I'm understanding this correctly, each neuron activation (or set of neuron activations?) would contain all the information from some-part-of-data relevant to some-other-part-of-data and the output. So it's of roughly the right form for neuron activations to encode abstractions [LW · GW]. That would lend direct support to the hypothesis that neurons/small sets of neurons in nets often end up encoding human-like concepts because both humans and NNs are learning natural abstractions.

Better yet, it looks like the OP gives a recipe for unpacking those natural abstractions? That would give a potentially-really-powerful tool for transparency, allowing us to directly test some of the hypotheses in alignment by default [LW · GW], and potentially addressing some of the bottlenecks in that whole approach.

The more I think about it, the more this post looks really exciting.

Replies from: zachary-robertson

↑ comment by Past Account (zachary-robertson) · 2020-12-08T16:06:09.644Z · LW(p) · GW(p)

I'm going to have to spend some time unpacking the very compact notation in the post, but here are my initial reactions.

I should apologize a bit for that. To a degree I wasn't really thinking about any of the concepts in the title and only saw the connection later.

First, very clean proof of the lemma, well done there.

Thanks!

Second... if I'm understanding this correctly, each neuron activation (or set of neuron activations?) would contain all the information from some-part-of-data relevant to some-other-part-of-data and the output.

To be honest, I haven't thought about interpreting the monad beyond the equivalence with neural networks. One thing I noticed early on is that you can create sequences of activations that delete information in the limit. For example, the ReLU activation is the limit of the SoftMax (change log base). I think something like this could be seen as abstracting away unnecessary data.

Better yet, it looks like the OP gives a recipe for unpacking those natural abstractions?

I'm not sure. I do think the method can justify the reuse of components (queries) and I wouldn't be surprised if this is a pre-requisite for interpreting network outputs. Most of my interest comes from trying to formalize the (perhaps obvious) idea that anything that can be reduced to a sequence of classifications can be used to systematically translate high-level reasoning about these processes into a neural networks.

I guess it's best to give an example of how I currently think about abstraction. Say we take the position that every object $x \in X$ is completely determined by the information contained in a set of queries $y \in Y$ such that $| y | ≪ | x |$ . For a picture, consider designing a game-avatar (mii character) by fiddling around with some knobs. The formalism lets us package observations as queries using return. Thus, we're hypothesizing that we can take a large collection of queries and make them equivalent to a small set of queries. Said another way, we can answer a large collection of queries by answering a much smaller set of 'principle' queries. In fact, if our activation was linear we'd be doing PCA. How we decide to measure success determines what abstraction is learned. If we only use the $y$ to answer a few queries then we're basically doing classification. However, if the $y$ have to be able to answer every query about $x$ then we're doing auto-encoding.