0 comments

Comments sorted by top scores.

comment by Past Account (zachary-robertson) · 2020-05-06T00:42:10.410Z · LW(p) · GW(p)

I'm interested in converting notes I have about a few topics into posts here. I was really trying to figure out why this would be a good use of my time. The notes are already rather readable by myself. I thought about this for a while and it seems as though I'm explicitly interested in getting feedback on some of my thought processes. I'm aware of Goodhart's law so I know better than to have an empty plan to simply maximize my karma. However, on the other end, I don't want to simply polish notes. If I were to constrain myself to only write about things I have notes on then it seems I could once again explicitly try to maximize karma. In fact, if I felt totally safe doing this it'd be a fun game to try out, possibly even comment on. Of course, eventually, the feedback that'd I'd receive would start to warp what kinds of future contributions I'd make to the site, but this seems safe. Given all of this, I'd conclude I can explicitly maximize different engagement metrics, at least at first.

Replies from: Pattern

↑ comment by Pattern · 2020-05-06T18:51:49.201Z · LW(p) · GW(p)

What's the subject?

What are the different engagement metrics you're planning on using?

Have you considered doing something like a Q&A?

Replies from: zachary-robertson

↑ comment by Past Account (zachary-robertson) · 2020-05-10T03:36:07.405Z · LW(p) · GW(p)

Not really sure, if I was really going for it, I could do about 15-25 posts. I'm going back and forth on which metrics to use. This seems highly tied to what I actually want feedback on. What do you mean by Q&A?

comment by Past Account (zachary-robertson) · 2020-05-10T03:45:39.863Z · LW(p) · GW(p)

If we're taking the idea that arguments are paths in topological space seriously, I feel like conditioned language models are going to be really important. We already use outlines to effectively create regression data-sets to model arguments. It seems like modifying GPT-2 so that you can condition on start/end prompts would be incredibly helpful here. More speculative, I think that GPT-2 is near the best we'll ever get at next word prediction. Humans use outline like thinking much more often then is commonly supposed.

Replies from: zachary-robertson, zachary-robertson, Harmless

↑ comment by Past Account (zachary-robertson) · 2020-05-11T02:57:03.630Z · LW(p) · GW(p)

I think it's worth taking a look [LW · GW] at what's out there:

SpanBERT
- Uses random spans to do masked pre-training
- Seems to indicate that using longer spans is essentially difficult
Distillation of BERT Models
- BERT embeddings are hierarchical

↑ comment by Past Account (zachary-robertson) · 2020-05-11T02:13:00.658Z · LW(p) · GW(p)

Markov and general next-token generators work well when conditioned with text. While some models, such as Bert, are able to predict masked tokens I'm not aware of models that are able to generate the most likely sentence that would sit between a given start/end prompt.

It's worth working in the Markov setting to get a grounding for what we're looking for. The core of Markov model is the transition matrix $P_{i j}$ which tells us the conditional likelihood of the token $j$ following immediately after the token $i$ . The rules of conditional probability allow us to write,

$p (k | j, i) = \frac{p (j, k | i)}{p (j | i)} = \frac{p (j | k) p (k | i)}{p (j | i)}$

This gives us the probability of a token $k$ occurring immediately between the start/end prompts. In general we're interested in what happens if we 'travel' from the starting token $i$ to the ending token $j$ over $T$ time steps. Say we want to see the distribution of tokens at time step $t < T$ . Then we'd write,

$p^{t} (k | j, i) = \frac{p^{T - t} (j | k) p^{t} (k | i)}{p^{T} (j | i)} = \frac{(e_{j} P^{T - t} e_{k}) (e_{k} P^{t} e_{i})}{e_{j} P^{T} e_{i}}$

This shows us that we can break up the conditional generation process into a calculation over transition probabilities. We could write this out for an arbitrary sequence of separated words. From this perspective we'd be training a model to perform a regression over the words being generated. This is the sense in which we already use outlines to effectively create regression data-sets to model arguments.

What would be ideal is to find a way to generalize this to a non-Markovian, preferably deep-learning, setting. This is where I'm stuck at the moment. I'd want to understand where the SOTA is on this. The only options that immediately come to mind seem to be tree-search over tokens or RL. From the regression point of view, it seems like you'd want to try fitting the 'training data' such that the likelihood for the result is as high as possible.

↑ comment by Harmless · 2020-05-10T11:53:23.665Z · LW(p) · GW(p)

I don't know if this is already known, but you might be interested in the fact that you can currently use start prompts for GPT-2.

Replies from: zachary-robertson

↑ comment by Past Account (zachary-robertson) · 2020-05-11T02:29:48.606Z · LW(p) · GW(p)

I'm aware of this. I'm slowly piecing together what I'm looking for if you decide to follow this.