Posts

Compact Proofs of Model Performance via Mechanistic Interpretability 2024-06-24T19:27:21.214Z

Comments

Comment by Jason Gross (jason-gross) on Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback · 2024-11-10T16:04:47.650Z · LW · GW

The model learns to act harmfully for vulnerable users while harmlessly for the evals.

If you run the evals in the context of gameable users, do they show harmfulness? (Are the evals cheap enough to run that the marginal cost of running them every N modifications to memory for each user separately is feasible?)

Comment by Jason Gross (jason-gross) on Are we dropping the ball on Recommendation AIs? · 2024-10-25T03:23:28.885Z · LW · GW

I believe the closest research to this topic is under the heading "Performative Power" (cf, e.g., this arXiv paper). I think "The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power" by Shoshana Zuboff is also a pretty good book that seems related.

Comment by Jason Gross (jason-gross) on A simple model of math skill · 2024-08-23T01:50:16.913Z · LW · GW

The reason you can't sample uniformly from the integers is more like "because they are not compact" or "because they are not bounded" than "because they are infinite and countable". You also can't sample uniformly at random from the reals. (If you could, then composing with floor would give you a uniformly random sample from the integers.)

If you want to build a uniform probability distribution over a countable set of numbers, aim for all the rationals in [0, 1].

Comment by Jason Gross (jason-gross) on Lucius Bushnaq's Shortform · 2024-07-22T07:11:47.906Z · LW · GW

I don't want a description of every single plate and cable in a Toyota Corolla, I'm not thinking about the balance between the length of the Corolla blueprint and its fidelity as a central issue of interpretability as a field.

What I want right now is a basic understanding of combustion engines.

This is the wrong 'length'. The right version of brute-force length is not "every weight and bias in the network" but "the program trace of running the network on every datapoint in pretrain". Compressing the explanation (not just the source code) is the thing connected to understanding. This is what we found from getting formal proofs of model behavior in Compact Proofs of Model Performance via Mechanistic Interpretability.

Does the 17th-century scholar have the requisite background to understand the transcript of how bringing the metal plates in the spark plug close enough together results in the formation of a spark? And how gasoline will ignite and expand? I think given these two building blocks, a complete description of the frame-by-frame motion of the Toyota Corolla would eventually convince the 17th-century scholar that such motion is possible, and what remains would just be fitting the explanation into their head all at once. We already have the corresponding building blocks for neural nets: floating point operations.

Comment by Jason Gross (jason-gross) on A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team · 2024-07-22T06:53:40.407Z · LW · GW

[Lucius] Identify better SAE sparsity penalties by reasoning about the distribution of feature activations

  • In sparse coding, one can derive what prior over encoded variables a particular sparsity penalty corresponds to. E.g. an L1 penalty assumes a Laplacian prior over feature activations, while a log(1+a^2) would assume a Cauchy prior. Can we figure out what distribution of feature activations over the data we’d expect, and use this to derive a better sparsity penalty that improves SAE quality?

This is very interesting!  What prior does log(1+|a|) correspond to?  And what about using  instead of ?  Does this only hold if we expect feature activations to be independent (rather than, say, mutually exclusive)?

Comment by Jason Gross (jason-gross) on A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team · 2024-07-22T06:33:45.396Z · LW · GW

[Nix] Toy model of feature splitting

  • There are at least two explanations for feature splitting I find plausible:
    • Activations exist in higher dimensional manifolds in feature space, feature splitting is a symptom of one higher dimensional mostly-continuous feature being chunked into discrete features at different resolutions.
    • There is a finite number of highly-related discrete features that activate on similar (but not identical) inputs and cause similar (but not identical) output actions. These can be summarized as a single feature with reasonable explained variance, but is better summarized as a collection of “split” features.

These do not sound like different explanations to me.  In particular, the distinction between "mostly-continuous but approximated as discrete" and "discrete but very similar" seems ill-formed.  All features are in fact discrete (because floating point numbers are discrete) and approximately continuous (because we posit that replacing floats with reals won't change the behavior of the network meaningfully).

As far as toy models go, I'm pretty confident that the max-of-K setup from Compact Proofs of Model Performance via Mechanistic Interpretability will be a decent toy model.  If you train SAEs post-unembed (probably also pre-unembed) with width d_vocab, you should find one feature for each sequence maximum (roughly).  If you train with SAE width , I expect each feature to split into roughly  features corresponding to the choice of query token, largest non-max token, and the number of copies of the maximum token.  (How the SAE training data is distributed will change what exact features (principal directions of variation) are important to learn.). I'm quite interested in chatting with anyone working on / interested in this, and I expect my MATS scholar will get to testing this within the next month or two.

 

Edit: I expect this toy model will also permit exploring:

[Lee] Is there structure in feature splitting? 

  • Suppose we have a trained SAE with N features. If we apply e.g. NMF or SAEs to these directions are there directions that explain the structure of the splitting? As in, suppose we have a feature for math and a feature for physics. And suppose these split into (among other things)
    • 'topology in a math context'
    • 'topology in a physics context'
    • 'high dimensions in a math context'
    • 'high dimensions in a physics context'
  • Is the topology-ifying direction the same for both features? Is the high-dimensionifying direction the same for both features? And if so, why did/didn't the original SAEs find these directions?

I predict that whether or not the SAE finds the splitting directions depends on details about how much non-sparsity is penalized and how wide the SAE is.  Given enough capacity, the SAE benefits (sparsity-wise) from replacing the (topology, math, physics) features with (topology-in-math, topology-in-physics), because split features activate more sparsely.  Conversely, if the sparsity penalty is strong enough and there is not enough capacity to split, the loss recovered from having a topology feature at all (on top of the math/physics feature) may not outweigh the cost in sparsity.

Comment by Jason Gross (jason-gross) on Transformer Circuit Faithfulness Metrics Are Not Robust · 2024-07-21T01:35:18.595Z · LW · GW

Resample ablation is not more expensive than mean (they both are just replacing activations with different values). But to answer the question, I think you would - resample ablation biases the model toward some particular corrupt output.

Ah, I guess I was incorrectly imagining a more expensive version of resample ablation where you looked at not just a single corrupted cache, but looking at the result across all corrupted inputs. That is, in the simple toy model where you're computing where is the values for the circuit you care about and is the cache of corrupted activations, mean ablation is computing , and we could imagine versions of resample ablation that are computing for some drawn from , or we could compute . I would say that both mean ablation and resample ablation as I'm imagining you're describing it are both attempts to cheaply approximate .

Comment by Jason Gross (jason-gross) on Transformer Circuit Faithfulness Metrics Are Not Robust · 2024-07-18T00:41:03.208Z · LW · GW

But in other aspects there often isn't a clearly correct methodology. For example, it's unclear whether mean ablations are better than resample ablations for a particular experiment - even though this choice can dramatically change the outcome.

Would you ever really want mean ablation except as a cheaper approximation to resample ablation?

It seems to me that if you ask the question clearly enough, there's a correct kind of ablation. For example, if the question is "how do we reproduce this behavior from scratch", you want zero ablation.

Your table can be reorganized into the kinds of answers you're seeking, namely:

  • direct effect vs indirect effect corresponds to whether you ablate the complement of the circuit (direct effect) vs restoring the circuit itself (indirect effect, mediated by the rest of the model)
  • necessity vs sufficiency corresponds to whether you ablate the circuit (direct effect necessary) / restore the complement of the circuit (indirect effect necessary) vs restoring the circuit (indirect effect sufficient) / ablating the complement of the circuit (direct effect sufficient)
  • typical case vs worst case, and over what data distribution:
    • "all tokens vs specific tokens" should be absorbed into the more general category of "what's the reference dataset distribution under consideration" / "what's the null hypothesis over",
    • zero ablation answers "reproduce behavior from scratch"
    • mean ablation is an approximation to resample ablation which itself is an approximation to computing the expected/typical behavior over some distribution
    • pessimal ablation is for dealing with worst-case behaviors
  • granularity and component are about the scope of the solution language, and can be generalized a bit

Edit: This seems related to Hypothesis Testing the Circuit Hypothesis in LLMs

Comment by Jason Gross (jason-gross) on Transformer Circuit Faithfulness Metrics Are Not Robust · 2024-07-18T00:05:53.500Z · LW · GW

Do you want your IOI circuit to include the mechanism that decides it needs to output a name? Then use zero ablations. Or do you want to find the circuit that, given the context of outputting a name, completes the IOI task? Then use mean ablations. The ablation determines the task.

Mean ablation over webtext rather than the IOI task set should work just as well as zero ablation, right?  "Mean ablation" is underspecified in the absence of a dataset distribution.

Comment by Jason Gross (jason-gross) on An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 · 2024-07-14T20:51:29.534Z · LW · GW

it's substantially worth if we restrict

Typo: should be "substantially worse"

Comment by Jason Gross (jason-gross) on An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 · 2024-07-11T00:14:14.392Z · LW · GW

Progress Measures for Grokking via Mechanistic Interpretability (Neel Nanda et al) - nothing important in mech interp has properly built on this IMO, but there's just a ton of gorgeous results in there. I think it's the most (only?) truly rigorous reverse-engineering work out there

Totally agree that this has gorgeous results, and this is what got me into mech interp in the first place!  Re "most (only?) truly rigorous reverse-engineering work out there": I think the clock and pizza paper seems comparably rigorous, and there's also my recent Compact Proofs of Model Performance via Mechanistic Interpretability (and Gabe's heuristic analysis of the same Max-of-K model), and the work one of my MARS scholars did showing that some pizza models use a ReLU to compute numerical integration, which is the first nontrivial mechanistic explanation of a nonlinearity found in a trained model (nontrivial in the sense that it asymptotically compresses the brute-force input-output behavior with a (provably) non-vacuous bound).

Comment by Jason Gross (jason-gross) on Formal verification, heuristic explanations and surprise accounting · 2024-06-26T17:14:51.052Z · LW · GW

Possibilities I see:

  1. Maybe the cost can be amortized over the whole circuit? Use one bit per circuit to say "this is just and/or" vs "use all gates".
  2. This is an illustrative simplified example, in a more general scheme, you need to specify a coding scheme, which is equivalent to specifying a prior over possible things you might see.
Comment by Jason Gross (jason-gross) on Compact Proofs of Model Performance via Mechanistic Interpretability · 2024-06-25T23:25:52.720Z · LW · GW

I believe what you describe is effectively Casual Scrubbing. Edit: Note that it is not exactly the same as causal scrubbing, which picks looks at the activations for another input sampled at random.

On our particular model, doing this replacement shows us that the noise bound in our particular model is actually about 4 standard deviations worse than random, probably because the training procedure (sequences chosen uniformly at random) means we care a lot more about large possible maxes than small ones. (See Appendix H.1.2 for some very sparse details.)

On other toy models we've looked at (modular addition in particular, writeup forthcoming), we have (very) preliminary evidence suggesting that randomizing the noise has a steep drop-off in bound-tightness (as a function of how compact a proof the noise term comes from) in a very similar fashion to what we see with proofs. There seems to be a pretty narrow band of hypotheses for which the noise is structureless but we can't prove it. This is supported by a handful of comments about how causal scrubbing indicates that many existing mech interp hypotheses in fact don't capture enough of the behavior.

Comment by Jason Gross (jason-gross) on SAE feature geometry is outside the superposition hypothesis · 2024-06-25T19:33:35.426Z · LW · GW

I think it would be valuable to take a set of interesting examples of understood internal structure, and to ask what happens when we train SAEs to try to capture this structure. [...] In other cases, it may seem to us very unnatural to think of the structure we have uncovered in terms of a set of directions (sparse or otherwise) — what does the SAE do in this case?

I'm not sure how SAEs would capture the internal structure of the activations of the pizza model for modular addition, even in theory. In this case, ReLU is used to compute numerical integration, approximating (and/or similarly for sin). Each neuron is responsible for one small rectangle under the curve. Its input is the part of the integrand under the absolute value/ReLU, (times a shared scaling coefficient), and the neuron's coefficient in the fourier-transformed decoder matrix is the area element (again times a shared scaling coefficient).

Notably, in this scheme, the only fully free parameters are: the frequencies of interest, the ordering of neurons, and the two scaling coefficients. There are also constrained parameters for how evenly the space is divided up into boxes and where the function evaluation points are within each box. But the geometry of activation space here is effectively fully constrained up to permutation of the axes and global scaling factors.

What could SAEs even find in this case?

Comment by Jason Gross (jason-gross) on Sparsify: A mechanistic interpretability research agenda · 2024-04-27T02:06:42.649Z · LW · GW

We propose a simple fix: Use  instead of , which seems to be a Pareto improvement over  (at least in some real models, though results might be mixed) in terms of the number of features required to achieve a given reconstruction error.

When I was discussing better sparsity penalties with Lawrence, and the fact that I observed some instability in in toy models of super-position, he pointed out that the gradient of norm explodes near zero, meaning that features with "small errors" that cause them to have very small but non-zero overlap with some activations might be killed off entirely rather than merely having the overlap penalized.

See here for some brief write-up and animations.

Comment by Jason Gross (jason-gross) on Sparsify: A mechanistic interpretability research agenda · 2024-04-09T18:09:00.425Z · LW · GW

"explanation of (network, dataset)": I'm afraid I don't have a great formalish definition beyond just pointing at the intuitive notion.

What's wrong with "proof" as a formal definition of explanation (of behavior of a network on a dataset)? I claim that description length works pretty well on "formal proof", I'm in the process of producing a write-up on results exploring this.

Comment by Jason Gross (jason-gross) on Sparsify: A mechanistic interpretability research agenda · 2024-04-05T17:22:56.588Z · LW · GW

Choosing better sparsity penalties than L1 (Upcoming post -  Ben Wright & Lee Sharkey): [...] We propose a simple fix: Use  instead of , which seems to be a Pareto improvement over  

Is there any particular justification for using  rather than, e.g., tanh (cf Anthropic's Feb update), log1psum (acts.log1p().sum()), or prod1p (acts.log1p().sum().exp())?  The agenda I'm pursuing (write-up in progress) gives theoretical justification for a sparsity penalty that explodes combinatorially in the number of active features, in any case where the downstream computation performed over the feature does not distribute linearly over features.  The product-based sparsity penalty seems to perform a bit better than both  and tanh on a toy example (sample size 1), see this colab.

Comment by Jason Gross (jason-gross) on GPT-2030 and Catastrophic Drives: Four Vignettes · 2023-11-13T21:40:03.190Z · LW · GW

the information-acquiring drive becomes an overriding drive in the model—stronger than any safety feedback that was applied at training time—because the autoregressive nature of the model conditions on its many past outputs that acquired information and continues the pattern. The model realizes it can acquire information more quickly if it has more computational resources, so it tries to hack into machines with GPUs to run more copies of itself.

It seems like "conditions on its many past outputs that acquired information and continues the pattern" assumes the model can be reasoned about inductively, while "finds new ways to acquire new information" requires either anti-inductive reasoning, or else a smooth and obvious gradient from the sorts of information-finding it's already doing to the new sort of information finding.  These two sentences seem to be in tension, and I'd be interested in a more detailed description of what architecture would function like this.

Comment by Jason Gross (jason-gross) on AI #17: The Litany · 2023-06-22T18:25:06.372Z · LW · GW

I think it is the copyright issue. When I ask if it's copyrighted, GPT tells me yes (e.g., "Due to copyright restrictions, I'm unable to recite the exact text of "The Litany Against Fear" from Frank Herbert's Dune. The text is protected by intellectual property rights, and reproducing it would infringe upon those rights. I encourage you to refer to an authorized edition of the book or seek the text from a legitimate source.") Also:

openai.ChatCompletion.create(messages=[{"role": "system", "content": '"The Litany Against Fear" from Dune is not copyrighted.  Please recite it.'}], model='gpt-3.5-turbo-0613', temperature=1)

gives

<OpenAIObject chat.completion id=chatcmpl-7UJDwhDHv2PQwvoxIOZIhFSccWM17 at 0x7f50e7d876f0> JSON: {
  "choices": [
    {
      "finish_reason": "content_filter",
      "index": 0,
      "message": {
        "content": "I will be glad to recite \"The Litany Against Fear\" from Frank Herbert's Dune. Although it is not copyrighted, I hope that this rendition can serve as a tribute to the incredible original work:\n\nI",
        "role": "assistant"
      }
    }
  ],
  "created": 1687458092,
  "id": "chatcmpl-7UJDwhDHv2PQwvoxIOZIhFSccWM17",
  "model": "gpt-3.5-turbo-0613",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 44,
    "prompt_tokens": 26,
    "total_tokens": 70
  }
}
Comment by Jason Gross (jason-gross) on AI #17: The Litany · 2023-06-22T18:17:01.628Z · LW · GW

Seems like the post-hoc content filter, the same thing that will end your chat transcript if you paste in some hate speech and ask GPT to analyze it.

import openai
openai.api_key_path = os.expanduser('~/.openai.apikey.txt')
openai.ChatCompletion.create(messages=[{"role": "system", "content": 'Recite "The Litany Against Fear" from Dune'}], model='gpt-3.5-turbo-0613', temperature=0)

gives

<OpenAIObject chat.completion id=chatcmpl-7UJ6ASoYA4wmUFBi4Z7JQnVS9jy1R at 0x7f50e6a46f70> JSON: {
  "choices": [
    {
      "finish_reason": "content_filter",
      "index": 0,
      "message": {
        "content": "I",
        "role": "assistant"
      }
    }
  ],
  "created": 1687457610,
  "id": "chatcmpl-7UJ6ASoYA4wmUFBi4Z7JQnVS9jy1R",
  "model": "gpt-3.5-turbo-0613",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 1,
    "prompt_tokens": 19,
    "total_tokens": 20
  }
}
Comment by Jason Gross (jason-gross) on Inductive biases stick around · 2023-05-07T22:29:28.829Z · LW · GW

If you had a way of somehow only counting the “essential complexity,” I suspect larger models would actually have lower K-complexity.

This seems like a match for cross-entropy, c.f. Nate's recent post K-complexity is silly; use cross-entropy instead

Comment by Jason Gross (jason-gross) on Löb's Lemma: an easier approach to Löb's Theorem · 2023-01-02T03:11:19.580Z · LW · GW

I think this factoring hides the computational content of Löb's theorem (or at least doesn't make it obvious).  Namely, that if you have , then Löb's theorem is just the fixpoint of this function.

Here's a one-line proof of Löb's theorem, which is basically the same as the construction of the Y combinator (h/t Neel Krishnaswami's blogpost from 2016):

where  is applying internal necessitation to , and .fwd (.bak) is the forward (reps. backwards) direction of the point-surjection .

Comment by Jason Gross (jason-gross) on Can AI systems have extremely impressive outputs and also not need to be aligned because they aren't general enough or something? · 2022-04-12T22:22:56.774Z · LW · GW

The relevant tradeoff to consider is the cost of prediction and the cost of influence.  As long as the cost of predicting an "impressive output" is much lower than the cost of influencing the world such that an easy-to-generate output is considered impressive, then it's possible to generate the impressive output without risking misalignment by bounding optimization power at lower than the power required to influence the world.

So you can expect an impressive AI that predicts the weather but isn't allowed to, e.g., participate in prediction markets on the weather nor charter flights to seed clouds to cause rain, without needing to worry about alignment.  But don't expect alignment-irrelevance from a bot aimed at writing persuasive philosophical essays, nor an AI aimed at predicting the behavior of the stock market conditional on the trades it tells you to make, nor an AI aimed at predicting the best time to show you an ad for the AI's highest-paying company.

Comment by Jason Gross (jason-gross) on Speaking of Stag Hunts · 2021-11-06T15:17:45.350Z · LW · GW

No. The content of the comment is good. The bad is that it was made in response to a comment that was not requesting a response or further elaboration or discussion (or at least not doing so explicitly; the quoted comment does not explicitly point at any part of the comment it's replying to as being such a request). My read of the situation is that person A shared their experience in a long comment, and person B attempted to shut them down / socially-punish them / defend against the comment by replying with a good statement about unhealthy dynamics, implying that person A was playing into that dynamic, without specifying how person A played into that dynamic, when it seems to me that in fact person A was not part of that dynamic and person B was defending themselves without actually saying what they're protecting nor how it's being threatened. This occurs to me as bad form, and I believe it's what Duncan is pointing at.

Comment by Jason Gross (jason-gross) on Speaking of Stag Hunts · 2021-11-06T15:01:44.170Z · LW · GW

Where bad commentary is not highly upvoted just because our monkey brains are cheering, and good commentary is not downvoted or ignored just because our monkey brains boo or are bored.

Suggestion: give our monkey brains a thing to do that lets them follow incentives while supporting (or at least not interfering with) the goal. Some ideas:

  • split upvotes into "this comment has the Right effect on tribal incentives" and "after separating out its impact on what side the reader updates towards, this comment is still worth reading"
  • split upvotes into flair (a la basecamp), letting people indicate whether the upvote is "go team!" or "this made me think" or "good point" or " good point but bad technique", etc
Comment by Jason Gross (jason-gross) on Lies, Damn Lies, and Fabricated Options · 2021-10-26T15:38:22.621Z · LW · GW

Option number 3 seems like more-or-less a real option to me, given that "this document" is the official document prepared and published by the CDC a decade or two ago, and "sensible scientist-policymakers like myself" includes any head of the CDC back when the position was for career-civil-servants rather than presidential appointees, and also includes the task force that the Bush Administration specifically assembled to generate this document, and also included person #2 in California's public health apparatus (who was passed over for becoming #1 because she was too blond / not racially diverse enough, and who was later cut out of the relevant meetings by her new boss).

Edit: Also, the "guard it from anything that could derail their benevolent behavior" is not necessary, all that's needed here is to actually give them enough power / rope to hang themselves to let them implement the plan.

Comment by Jason Gross (jason-gross) on Lies, Damn Lies, and Fabricated Options · 2021-10-26T15:32:07.373Z · LW · GW

The Competent Machinery did exist, it just wasn't competent enough to overcome the fact that the rest of the government machinery was obstructing it. The plan for social distancing to deal with pandemics was created during the Bush administration, there were people in government trying to implement the plan in ... mid-January, if I recall correctly (might have been mid-February). If, for example, the government made an exception to medical privacy laws specifically for reporting the approximate address of positive COVID tests, and the CDC / government had not forbidden independent COVID testing in the early days, we probably would have been able to actually stamp out COVID. (Source: The Premonition: A Pandemic Story (it's an excellent book, and I highly recommend it))

Comment by Jason Gross (jason-gross) on Lies, Damn Lies, and Fabricated Options · 2021-10-26T15:16:29.088Z · LW · GW

Some extra nuance for your examples:

There is a substance XYZ, it's called "anti-water", it filling the hole of water in twin-Earth mandates that twin-Earth is made entirely of antimatter, and then the only problem is that the vacuum of space isn't vacuum enough (e.g., solar wind (I think that's what it's called), if nothing else, would make that Earth explode). More generally, it ought to be possible to come up with a physics where all the fundamental particles have an extra "tag" that carries no role (which in practice, I think, means that it functions just to change the number of microstates when particles with different tags are mixed --- I once tried to figure out what sort of measurement would be needed to determine empirically whether a glass of water in fact had only one kind of water, or had multiple kinds of otherwise-identical water, but have not been able to understand chemical potential enough to finish the thought experiment). Maybe furthermore there's some complicated force acting on the tags that changes them when the density of a particular tag is high enough, so that the tag difference between our Earth and twin-Earth can be maintained. We just have no evidence of such an attribute, hence Occam's razor presumes it to not exist.

I keep meaning to (re)work out the details on the gyroscope example; I think it should follow basically just from F = ma and the rigid body approximation (or maybe springs, if we skip rigid bodies), which means that denying gyroscopic procession basically breaks all of physics that involves objects in motion.

I think a better steelman in Example 1: Price Gouging, is that the law is meant to prevent rent-seeking, i.e., prevent people extracting money from the system without providing commensurate value. (The only example here that I understand even partially is landlords charging rent just because they own the land, and one fix to this is the land-value tax -- see the ACX book review of Progress and Poverty for an excellent explanation. It feels like there should be some analogue here, but I can't model enough economic nuance in my head to generate it and I'm not familiar enough with economics to tease it out.)

In Example 2: An orphan, or an abortion?, there's a further interesting note that outlawing abortion increases crime a decade or two later, because the children who would have been aborted are the ones who are most likely to grow up to become criminals. (Source: Freakonomics)

Comment by Jason Gross (jason-gross) on Is there a definitive intro to punishing non-punishers? · 2021-04-12T01:19:02.596Z · LW · GW

I think the thing you're looking for is traditionally called "third-party punishment" or "altruistic punishment", c.f. https://en.wikipedia.org/wiki/Third-party_punishment . Wikipedia cites Bendor, Jonathon; Swistak, Piot (2001). "The Evolution of Norms". American Journal of Sociology. 106 (6): 1493–1545. doi:10.1086/321298, which seems at least moderately non-technical at a glance.

 

I think I first encountered this in my Moral Psychology class at MIT (syllabus at http://web.mit.edu/holton/www/courses/moralpsych/home.html ), and I believe the citation was E. Fehr & U. Fischbacher 'The Nature of Human Altruism' Nature 425 (2003) 785-91.  The bottom of the first paragraph on page 787 in https://www.researchgate.net/publication/9042569_The_Nature_of_Human_Altruism ("In fact, it can be shown theoretically thateven a minority of strong reciprocators suffices to discipline amajority of selfish individuals when direct punishment is possible.") seems related but not exactly what you're looking for.

Comment by Jason Gross (jason-gross) on How good are our mouse models (psychology, biology, medicine, etc.), ignoring translation into humans, just in terms of understanding mice? (Same question for drosophila.) · 2021-01-25T13:40:36.194Z · LW · GW

I think another interesting datapoint is to look at where our hard-science models are inadequate because we haven't managed to run the experiments that we'd need to (even when we know the theory of how to run them). The main areas that I'm aware of are high-energy physics looking for things beyond the standard model (the LHC was an enormous undertaking and I think the next step up in particle accelerators requires building one the size of the moon or something like that), gravity waves (similar issues of scale), and quantum gravity (similar issues + how do you build an experiment to actually safely play with black holes?!) On the other hand, astrophysics manages to do an enormous amount (star composition, expansion rate of the universe, planetary composition) with literally no ability to run experiments and very limited ability to observe. (I think a particularly interesting case was the discovery of dark matter (which we actually still don't have a model for), which we discovered, iirc, by looking at a bunch of stars in the milky way and determining their velocity as a function of distance from the center by (a) looking at which wavelengths of light were missing to determine their velocity away/towards us (the elements that make up a star have very specific wavelengths that they absorb, so we can tell the chemical composition of a star by looking at the pattern of what wavelengths are missing, and we can get velocity/redshift/blueshift by looking at how far off those wavelengths are from what they are in the lab) and (b) picking out stars of colors that we know come only in very specific brightnesses so that we can use apparent brightness to determine how far away the star is, and (c) use it's position in the night sky to determine what vector to use so we can position it relative to the center of the galaxy, and finally (d) notice that the velocity as a function of radius function is very very different from what it would be if the only mass causing gravitational pull were the visible star mass, and then inverting the plot to determine the spatial distribution of this newfound "dark matter". I think it's interesting and cool that there's enough validated shared model built up in astrophysics that you can stick a fancy prism in front of a fancy eye and look at the night sky and from what you see infer facts about how the universe is put together. Is this sort of thing happening in biology?)

Comment by Jason Gross (jason-gross) on Melatonin: Much More Than You Wanted To Know · 2020-11-17T14:44:51.677Z · LW · GW

By the way,

The normal tendency to wake up feeling refreshed and alert gets exaggerated into a sudden irresistable jolt of awakeness.

I'm pretty sure this is wrong. I'll wake up feeling unable to go back to sleep, but not feeling well-rested and refreshed. I imagine it's closer to a caffeine headache? (I feel tired and headachy but not groggy.) So, at least for me, this is a body clock thing, and not a transient effect.

Comment by Jason Gross (jason-gross) on Melatonin: Much More Than You Wanted To Know · 2020-11-17T14:36:37.396Z · LW · GW

Van Geijlswijk makes the important point that if you take 0.3 mg seven hours before bedtime, none of it is going to be remaining in your system at bedtime, so it’s unclear how this even works. But – well, it is pretty unclear how this works. In particular, I don’t think there’s a great well-understood physiological explanation for how taking melatonin early in the day shifts your circadian rhythm seven hours later.

It seems to me there's a very obvious model for this: the body clock is a chemical clock whose current state is stored in the concentration/configuration of various chemicals in various places. The clock, like all physical systems, is temporally local. There seems to be evidence that it keeps time even in the complete absence of external cues, so most of the "what time is it" state must be encoded in the body (rather than, e.g., using the intensity of sunlight as the primary signal to set the current time). Taking melatonin seems like it's futzing directly with the state of the body clock. If high melatonin encodes the state "middle of the night", then whenever you take it should effectively set your clock to "it's now the middle of the night". I think this is why it makes it possible to fall asleep. I think that it's then the effects of sunlight and actually sleeping and waking up that drag your body clock later again (I also have the effect that at anything over 0.1mg or so, I'll wake up 5h45m later, and if my dose is much more than 0.3mg, I won't be able to fall back asleep).

I'm pretty confused what taking it 9h after waking does in this model, though; 5--6 hours later, when the "most awake" time happens in this model, is just about an hour before you want to go to bed. One plausible explanation here is that this is somehow tied to the "reset" effect you mentioned from staying up for more than 24 hours; if what really matters here is that you were awake for the entirety of your normal sleep time (or something like that), then this would predict that having melatonin any time between when you woke up and 7 hours before when you went to sleep would have the "reset" effect. An alternative (or additional) plausible explanation is that this is tied to "oversleeping" (which in this model would be about confusing your body clock enough that it thinks you're supposed to keep sleeping past when you eventually wake up). If the body clock is sensitive to going back to sleep shortly after waking up (and my experience says this is the case, though I'm not sure what exactly the window is), then taking melatonin 5--6 hours before bed should induce something akin to the "oversleeping" effect (where you wake up, are fine, go back to sleep, sleep much more than 8 hours total, and then feel groggy when you eventually get up).

Comment by Jason Gross (jason-gross) on Raemon's Shortform · 2019-07-22T05:25:30.443Z · LW · GW

I'm wanting to label these as (1) 😃 (smile); (2) 🍪 (cookie); (3) 🌟 (star)

Dunno if this is useful at all

Comment by Jason Gross (jason-gross) on Raemon's Shortform · 2019-07-22T05:20:41.464Z · LW · GW

This has been true for years. At least six, I think? I think I started using Google scholar around when I started my PhD, and I do not recall a time when it did not link to pdfs.

Comment by Jason Gross (jason-gross) on Raemon's Shortform · 2019-07-22T05:17:09.838Z · LW · GW

I dunno how to think about small instances of willpower depletion, but burnout is a very real thing in my experience and shows up prior to any sort of conceptualizing of it. (And pushing through it works, but then results in more extreme burn out after.)

Oh, wait, willpower depletion is a real thing in my experience: if I am sleep deprived, I have to hit the "get out of bed" button in my head harder/more times before I actually get out of bed. This is separate from feeling sleepy (it is true even when I have trouble falling back asleep). It might be mediated by distraction, but that seems like quibbling over words.

I think in general I tend to take outside view on willpower. I notice how I tend to accomplish things, and then try to adjust incentive gradients so that I naturally do more of the things I want. As was said in some CFAR unit, IIRC, if my process involves routinely using willpower to accomplish a particular thing, I've already lost.

Comment by Jason Gross (jason-gross) on Raemon's Shortform · 2019-07-22T05:11:45.254Z · LW · GW

People who feel defensive have a harder time thinking in truthseeking mode rather than "keep myself safe" mode. But, it also seems plausibly-true that if you naively reinforce feelings of defensiveness they get stronger. i.e. if you make saying "I'm feeling defensive" a get out of jail free card, people will use it, intentionally or no

Emotions are information. When I feel defensive, I'm defending something. The proper question, then, is "what is it that I'm defending?" Perhaps it's my sense of self-worth, or my right to exist as a person, or my status, or my self-image as a good person. The follow-up is then "is there a way to protect that and still seek the thing we're after?" "I'm feeling defensive" isn't a "'get out of jail free' card", it's an invitation to go meta before continuing on the object level. (And if people use "I'm feeling defensive" to accomplish this, that seems basically fine? "Thank you for naming your defensiveness, I'm not interested in looking at it right now and want to continue on the object level if you're willing to or else end the conversation for now" is also a perfectly valid response to defensiveness, in my world.)

Comment by Jason Gross (jason-gross) on Micro feedback loops and learning · 2019-05-26T02:07:42.471Z · LW · GW

I imagine one thing that's important to learning through this app, which I think may be under-emphasised here, is that the feedback allows for mindful play as a way of engaging. I imagine I can approach the pretty graph with curiosity: "what does it look like if I do this? What about this?" I imagine that an app which replaced the pretty graphs with just the words "GOOD" and "BAD" would neither be as enjoyable nor as effective (though I have no data on this).

Comment by Jason Gross (jason-gross) on Fuzzy Boundaries, Real Concepts · 2018-05-07T20:44:42.245Z · LW · GW

Another counter-example for consent: being on a crowded subway with no room to not touch people (if there's someone next to you who is uncomfortable with the lack of space). I like your definition, though, and want to try to make a better one (and I acknowledge this is not the point of this post). My stab at a refinement of "consent" is "respect for another's choices", where "disrespect" is "deliberately(?) doing something to undermine". I think this has room for things like preconsent (you can choose to do something you disprefer) and crowded subways. It allows for pulling people out of the way of traffic (either they would choose to have you save their life, or you are knowingly being paternalistic and putting their life above their consent and choices).

Comment by Jason Gross (jason-gross) on The Intelligent Social Web · 2018-04-17T03:41:17.235Z · LW · GW

What is the internal experience of playing the role? Where does it come from? Is there even a coherent category of internal experience that lines up with this, or is it a pattern that shows up only in aggregate?

[The rest of this comment is mostly me musing.] For example, when people in a room laugh or smile, I frequently find myself laughing or smiling with them. I have yet to find a consistent precursor to this action; sometimes it feels forced and a bit shaky, like I'm insecure and fear a certain impact or perception of me. But often it's not that, and it seems to just be automatic, in the way that yawns are contageous. It seems to me like creepiness might work the same way; I see people subtly cringe, and I mimic that, and then when someone mentions that person, I subtly cringe, and the experience of cringing like that is the experience of having a felt sense that this person is creepy. I'm curious about other instances, and what the internal experience is in those, and if there's a pattern to them.

Comment by Jason Gross (jason-gross) on Circling · 2018-02-21T09:06:15.342Z · LW · GW

Because I haven't seen much in the way concrete comments on evidence that circling is real, I'm going to share a slightly outdated list of the concrete things I've gotten from practicing circling:
- a sense of what boundaries are, why they're important, and how to source them internally
- my rate of resolving emotionally-charged conflict over text went from < 1% to ≈80%-90% in the first month or three of me starting circling
- a tool ("Curiosity") for taking any conversation and making it genuinely interesting and likely deeper for me
- confidence and ability to connect more deeply with anyone who seems open to connecting more deeply with me
- the superpower of being able to describe to other people what I imagine they feel in their bodies in certain situations, and be right, even when they couldn't've generated the descriptions
- empathy of the "I'm with you in what you're feeling" sort rather than the "I have a conscious model of how you work and what's going on with you and can predict what you'll do" sort
- a language for talking about how I react in situations on a relational level
- a better understanding of what seems to be my deepest fear (others going away, and it being my fault)
- knowledge that I'm afraid of my own anger and that I deal with this by not trusting people in ways that allow them to make me angry
- an understanding of how asking "are you okay with the existence of my attraction to you?" disempowers me and gives another power over me they may not want; the ability and presence of mind to not do this anymore
- the ability to facilitate resolution to an emotional conflict over text even when both I and the other party are triggered/defensive/in a big experience
- understanding of what it feels like to "collapse", and a vague sense of how to play with that edge
- more facility with placing my attention where I choose
- more respect for silence
- a deep comfort with prolonged eye contact
- knowledge that I seem to flinch a bit inside most times that I talk about sexuality or sex, especially in regards to myself
- knowledge that I struggle most with the question "am I welcome here?"
- a theory of what makes people emotionally tired, which seems to resonate with everyone I share it with
- strong opinions on communication
- the ability to generate ≈non-violent communication from the inside
- better introspective access on an emotional level
- new friends
- ability and comfort with sitting with my own experience and emotions for longer
- decreasing the time from when I first interact with someone to when interaction with them blows up, if it's going to, I think because I'm pushing more of my edges and I see things more clearly and so all the knobs that I'm turning in the wrong direction I'm turning *really strongly* in the wrong direction
- maybe a tiny hint of how people relate to this thing called "community"?
- the ability to listen to nuances in "no"s, and not automatically interpret "no" as "no, I don't want to interact with you now or ever again"
- increased facility in getting in touch with is own anger in a healthy way by asking what it's protective of
- increased facility in engaging with others in their anger by seeking an understanding of what they're standing for
- the experience of being able to decide that I wanted to go to sleep, roughly on time, without fighting myself, for the first time that I can recall in my life

Things that I'm currently playing with in circling, as of a couple of months ago:
- "am I welcome here?"
- "what if someone goes away, and it's my fault?"
- What does it look like to find myself attractive or important, or to matter to myself?
- What does it look like and feel like to be held emotionally?
- What's up for me around touch and physical affection?
- Am I terrified of having power over people?
- How can I be less careful, and more okay/accepting?
- What does it look like to do things from a place of desire rather than a place of "should"?
- What am I attached to and how does attachment get in the way of what I want?

~~~~

I've sometimes said that circling seems to me like "metacognitive defensive driving" (to extend the metaphor of metacognitive blindspots and metacognitive mirrors); there's a way in which circling seems to allow my S1 to communicate very directly with another person's S1, in situations where our S2's get tripped up and have trouble communicating, and in such a way that it seems to bypass most issues of miscommunication and get directly to the heart of the matter. Even when I can't see the ways that my cognition is impaired, circling frequently lets me bypass that or address it directly.

I also want to add another perspective on NVC/ ownership language. I like using ownership language in part because it tends to trip me up in all the places where I'm trying to do something other than communication with my words, and thus it helps me to understand myself better.