Comment by aa-m-sa on Developmental Stages of GPTs · 2020-07-28T23:32:12.516Z · score: 3 (2 votes) · LW · GW

(Reply to gwern's comment but not only addressing gwern.)

Concerning the planning question:

I agree that next-token prediction is consistent with some sort of implicit planning of multiple tokens ahead. I would phrase it a bit differently. Also, "implicit" is doing lot of work here

(Please someone correct me if I say something obviously wrong or silly; I do not know how GPT-3 works, but I will try to say something about how it works after reading some sources [1].)

The bigger point about planning, though, is that the GPTs are getting feedback on one word at a time in isolation. It's hard for them to learn not to paint themselves into a corner.

To recap what I have thus far got from [1]: GPT-3-like transformers are trained by regimen where the loss function evaluates prediction error of the next word in the sequence given the previous word. However, I am less sure if one can say they do it in isolation. During training (by SGD I figure?), transformer decoder layers have (i) access to previous words in the sequence, and (ii) both attention and feedforward parts of each transformer layer has weights (that are being trained) to compute the output predictions. Also, (iii) the GPT transformer architecture considers all words in each training sequence, left to right, masking the future. And this is done for many meaningful Common Crawl sequences, though exact same sequences won't repeat.

So, it sounds a bit trivial that GPTs trained weights allow "implicit planning": if given a sequence of words w_1 to w_i-1 GPT would output word w for position i, this is because a trained GPT model (loosely speaking, abstracting away many details I don't understand) "dynamically encodes" many plausible "word paths" to word w, and [w_1 ... w_i-1] is such a path; by iteration, it also encodes many word paths from w to other words w', where some words are likelier to follow w than others. The representations in the stack of attention and feedforward layers allows it to generate text much more better than eg old good Markov chain. And "self-attending" to some higher-level representation that allows it generate text in particular prose style seems a lot of like a kind of plan. And GPT generating text that it used as input to it, to which it again can selectively "attend to", this all seems like as a kind of working memory, which will trigger self-attention mechanism to take certain paths, and so on.

I also want highlight oceainthemiddleofanisland's comment in other thread: Breaking complicated generation tasks into smaller chunks getting GPT to output intermediate text from initial input, which is then given as input to GPT to reprocess, enabling it finally to output desired output, sounds quite compatible to this view.

(On this note, I am not sure what to think of the role of human in the loop here, or in general, how it apparently requires non-trivial work to find a "working" prompt that seeds GPT obtain desired results for some particularly difficult tasks. That there are useful, rich world models "in there somewhere" in GPTs weights, but it is difficult to activate them? And are these difficulties because it is humans are bad at prompting GPT to generate text that accesses the good models, or because GPTs all-together model is not always so impressive as it easily turns into building answers based on gibberish models instead of the good ones, or maybe GPT having a bad internal model of humans attempting to use GPT? Gwern's example concerning bear attacks was interesting here.)

This would be "implicit planning". Is it "planning" enough? In any case, the discussion would be easier if we had a clearer definition what would constitute planning and what would not.

Finally, a specific response to gwerns comment.

During each forward pass, GPT-3 probably has plenty of slack computation going on as tokens will differ widely in their difficulty while GPT-3's feedforward remains a fixed-size computation; just as GPT-3 is always asking itself what sort of writer wrote the current text, so it can better imitate the language, style, format, structure, knowledge limitations or preferences* and even typos, it can ask what the human author is planning, the better to predict the next token. That it may be operating on its own past completions and there is no actual human author is irrelevant - because pretending really well to be an author who is planning equals being an author who is planning! (Watching how far GPT-3 can push this 'as if' imitation process is why I've begun thinking about mesa-optimizers and what 'sufficiently advanced imitation' may mean in terms of malevolent sub-agents created by the meta-learning outer agent.)

Using language how GPT-3 is "pretending" and "asking itself what a human author would do" can be maybe justified as metaphors, but I think it is a bit fuzzy and may obscure differences between what transformers do when we say they "plan" or "pretend", and what people would assume of beings who "plan" or "pretend". For example, using a word like "pretend" easily carries over an implication that there is something true, hidden, "unpretense" thinking or personality going on underneath. This appears quite unlikely given a fixed model, and generation mechanism that starts anew from each seed prompt. I would rather say that GPT has a model (is a model?) that is surprisingly good at natural language extrapolation and also, it is surprising at what can be achieved by extrapolation.

[1] , and in addition to skimming original OpenAI papers

Comment by aa-m-sa on Developmental Stages of GPTs · 2020-07-28T09:13:05.225Z · score: 3 (2 votes) · LW · GW

I contend it is not an *implementation* in a meaningful sense of the word. It is more a prose elaboration / expansion of the first generated bullet point list (an inaccurate one: "plan" mentions chopping vegetables, putting them in a fridge and cooking meat; prose version tells of chopping a set of vegetables, skips the fridge and then cooks beef, and then tells an irrelevant story where you go to sleep early and find it is a Sunday and no school).

Mind, substituting abstract category words with sensible more specific ones (vegetables -> carrots, onions and potatoes) is an impressive NLP task for an architecture where the behavior is not hard-coded in (because that's how some previous natural language generators worked), and even more impressive that it can produce the said expansion with a NLP input prompt, but hardly a useful implementation of a plan.

An improved experiment of "implementing plans" that could be within capabilities of GPT-3 or similar system: get GPT-3 to first output a plan of doing $a_thing and then the correct keystroke sequence input for UnReal World, DwarfFortress or Sims or some other similar simulated environment to produce it.

Comment by aa-m-sa on Self-sacrifice is a scarce resource · 2020-07-27T17:45:57.652Z · score: 1 (1 votes) · LW · GW

At the risk of stating very much the very obvious:

Trolley problem (or the fat man variant) is a wrong metaphor for near any ethical decision, anyway, as there are very few real life ethical dilemmas that are as visceral and require immediate action from very few limited set of options and whose consequences are nevertheless as clear.

Here is a couple of a bit more realistic matter of life and death. There are many stories (probably I could find factual accounts, but I am too lazy to search for sources) of soldiers who make the snap decision to save the lives of rest of their squad by jumping on a thrown hand grenade. Yet I doubt very few would cast much blame on anyone who had a chance of taking cover, and did that instead. (I wouldn't.) Moreover, the generals who demand prisoners (or agitate impressionable recruits) to clear a minefield without proper training or equipment are to be much frowned upon. And of course, there are untold possibilities to commit a dumb self-sacrifice that achieves nothing.

It general, a military force can not be very effective without people willing to put themselves in danger: if one finds oneself agreement with existence of states and armies, some amount of self-sacrifice follows naturally. For this reason, there are acts of valor who are viewed positively and to be cultivated. Yet, there are also common Western moral sentiments which dictate that it is questionable or outright wrong to require the unreasonable of other people, especially if the benefactors or the people doing the requiring are contributing relatively little themselves (sentiment demonstrated here by Blackadder Goes Forth). And in some cases drawing a judgement is generally considered difficult.

(What one should make of the Charge of the Light Brigade? I am not a military historian, but going by the popular account, the order to charge was stupid, negligent, mistake, or all of the three. Yet to some people, there is something inspirational in the foolishness of soldiers fulfilling the order; others would see such vies as abhorrent legend-building propaganda that devalues human life.)

In summary, I have not much concrete conclusions to offer, and anyway, details from one context (here, military) do not translate necessarily very well into other aspects of life. In some situations, (some amount of) self-sacrifice may be a good option, maybe even the best or only option for obtaining some outcomes, and it can be good thing to have around. On the other hand, in many situations it is wrong or contentious to require large sacrifices from others, and people who do so (including also extreme persuasion leading to voluntary self-sacrifice) are condemned as taking unjust advantage of others. Much depends on the framing.

As reader may notice, I am not arguing from any particular systematic theory of ethics, but rehashing my moral intuitions what is considered acceptable in West, assuming there is some signal of ethics in there.

Comment by aa-m-sa on Maths writer/cowritter needed: how you can't distinguish early exponential from early sigmoid · 2020-05-06T17:47:17.776Z · score: 11 (6 votes) · LW · GW

"Non-identifiability", by the way, is the search term that does the trick and finds something useful. Please see: Daly et al. [1], section 3. They study indentifiability characteristics of logistic sigmoid (that has rate r and goes from zero to carrying capacity K at t=0..30) via Fisher information matrix (FIM). Quote:

When measurements are taken at times t ≤ 10, the singular vector (which is also the eigenvector corresponding to the single non-zero eigenvalue of the FIM) is oriented in the direction of the growth rate r in parameter space. For t ≤ 10, the system is therefore sensitive to changes in the growth rate r, but largely insensitive to changes in the carrying capacity K. Conversely, for measurements taken at times t ≥ 20, the singular vector of the sensitivity matrix is oriented in the direction of the growth rate K[sic], and the system is sensitive to changes in the carrying capacity K but largely insensitive to changes in the growth rate r. Both these conclusions are physically intuitive.

Then Daly et al. proceed with MCMC scheme to numerically show that samples at different parts of time domain result in different identifiability of rate and carrying capacity parameters (Figure 3.)

[1] Daly, Aidan C., David Gavaghan, Jonathan Cooper, and Simon Tavener. “Inference-Based Assessment of Parameter Identifiability in Nonlinear Biological Models.” Journal of The Royal Society Interface 15, no. 144 (July 31, 2018): 20180318.


To clarify, because someone might miss it: this is not only a reply to shminux. Daly et al 2018 is (to some extent) the paper Stuart and others are looking for, at least if you are satisfied with their approach by looking what happens to effective Fisher information of logistic dynamics before and after inflection, supported by numerical inference methods showing that identifiability is difficult. (Their reference list also contains a couple of interesting articles about optimal design for logistic, harmonic models etc.)

Only thing missing that one might want AFAIK is a general analytical quantification of the amount of uncertainty, and comparison to specifically exponential (maybe along the lines Adam wrote there), and maybe writing it up in easy to digest format.

Comment by aa-m-sa on Maths writer/cowritter needed: how you can't distinguish early exponential from early sigmoid · 2020-05-06T17:40:18.061Z · score: 1 (1 votes) · LW · GW

Was momentarily confused what is k (sometimes denotes carrying capacity in the logistic population growth model), but apparently it is the step size (in numerical integrator)?

I have not enough expertise here to speak like an expert, but it seems that stiffness would be related in a roundabout way. It seems to describe difficulties of some numerical integrators with systems like this: the integrator can veer much off of true logistic curve with insufficiently small steps because the differential changes fast.

The phenomenon seems to be more about non-sensitivity than sensitivity of solution to parameters (or to be precise, non-identifiability of parameters): part of the solution before inflection seems to change very little to changes in "carrying capacity" (curve maximum) parameter.

Comment by aa-m-sa on Maths writer/cowritter needed: how you can't distinguish early exponential from early sigmoid · 2020-05-06T13:04:23.258Z · score: 6 (4 votes) · LW · GW

I was going to suggest that maybe it could be a known and published result in dynamical systems / population dynamics literature, but I am unable to find anything with Google, and textbooks I have at hand, while plenty mentions of logistic growth models, do not discuss prediction from partial data before inflection point.

On the other hand, it is fundamentally a variation on the themes of difficulty in model selection with partial data and dangers of extrapolation, which are common in many numerical textbooks.

If anyone wishes to flesh it out, I believe this behavior is not limited to trying to distinguish exponentials from logistic curves (or different logistics from each other), but also distinguishing different orders of growth from each other in general. With a judicious choice of data range and constants, it is not difficult to create a set of noisy points which could be either from a particular exponential or a particular quadratic curve. Quick example: (And if you limit data point range you are looking at to 0 to 2, it is quite impossible to say if a linear model wouldn't also be plausible.)

Comment by aa-m-sa on Against strong bayesianism · 2020-05-04T23:29:09.257Z · score: 10 (3 votes) · LW · GW

I am happy that you mention Gelman's book (I am studying it right now). I think lots of "naive strong bayesianists" would improve from a thoughtful study of the BDA book (there are lots of worked out demos and exercises available for it) and maybe some practical application of Bayesian modelling to some real-world statistical problems. The practice of "Bayesian way of life" of "updating my priors" sounds always a bit too easy in contrast to doing a genuine statistical inference.

For example, a couple of puzzles I am still myself unsure how to answer properly and with full confidence: Why one would be interested in doing stratified random sampling with your epidemiological study instead of naive "collect every data point that you see and then do a Bayesian update?" Or how multiple comparisons corrections for classical frequentist p-values map into Bayesian statistical framework? Does it matter for LWian Bayesianism if you are doing your practical statistical analyses with frequentist or Bayesian analysis tools (especially if many frequentist methods can be seen as clever approximations to full Bayesian model, see e.g. discussion of Kneser-Ney smoothing as ad hoc Pitman-Yor process inference here: ; similar relationship exists between k-means and EM-algorithm of Gaussian mixture model.) And if there is no difference, is the philosophical Bayesianism then actually that important -- or important at all -- for rationality?

Comment by aa-m-sa on Open & Welcome Thread—May 2020 · 2020-05-04T17:12:02.647Z · score: 8 (5 votes) · LW · GW

Howdy. I came across Ole Peters' "ergodicity economics" some time ago, and was interested to see what LW made of it. Apparently one set of skeptical journal club meetup notes:

I am not sure what to make of criticisms of Seattle meetups (they appear correct, but I am not sure if they are relevant; see my comment there).

Not planning to write a proper post, but here is an example blog post of Peters which I found illustrative and demonstrates why I think the "ergodicity way of thinking" might have something in it: . In summary, looking at the aggregate ensemble quantity such GDP per capita does not tell much what happens to individuals in the ensemble: the typical individual experienced growth in population in general is not related to GDP growth per capita (which may be obvious to a numerate person but not necessarily so, given the importance given to GDP in public discussion). And if one takes average of exponential growth rate, one obtains a measure (geometric mean income that they dub "DDP") known in economics literature, but originally derived otherwise.

But maybe this looks insightful to me because I am not that very well-versed in economics literature, so it would be nice to have some critical discussion about this.

Comment by aa-m-sa on Meetup Notes: Ole Peters on ergodicity · 2020-05-04T16:28:34.511Z · score: 1 (1 votes) · LW · GW

Peters' December 2019 Nature Physics paper ( ) provides some perspective on 0.6/1.5x coin flip example and other conclusions of the above discussion. (If Peters' claims have changed along the way, I wouldn't know.)

In my reading, there Peters' basic claim is not that ergodicity economics can solve the coin flip game in a way that classical economics can not (because it can, by switching to expected log wealth utility instead of expected wealth), but the utility functions as originally presented are a clutch that misinforms us on people's psychological motives in doing economic decisions. So, while the mathematics of many parts stays the same, the underlying phenomena can be more saliently reasoned about by looking at the individual growth rates in context of whether the associated wealth "process" is additive or multiplicative or something else. Thus there is also less need to use lingo where people may have an (innate, weirdly) "risk-averse utility function" (as compared to some other less risk-averse theoretical utility function).