Box inversion hypothesis 2020-10-20T16:20:51.282Z


Comment by Jan Kulveit (jan-kulveit) on The alignment problem in different capability regimes · 2021-09-10T09:58:25.791Z · LW · GW

Similary to johnswentworth: My current impression is core alignment problems are the same and manifest at all levels - often sub-human version just looks like a toy version of the scaled-up problem, and the main difference is, in the sub-human version problem, you can often solve it for practical purposes by plugging in human at some strategic spot. (While I don't think there are deep differences in the alignment problem space, I do think there are differences in the "alignment solutions" space, where you can use non-scalable solutions, or in risk space, where dangers being small due to the systems being stupid.)

I'm also unconvinced about some of practical claims about differences for wildly superintelligent systems. 

One crucial concern related to "what people want" is this seems underdefined, un-stable in interactions with wildly superintelligent systems, and prone to problems with scaling of values within systems where intelligence increases.  By this line of reasoning, if the wildly superintelligent system is able to answer me these sort of questions "in a way I want", it very likely must be already aligned. So it feels like part of the worries was assumed away. Paraphrasing the questions about human values again, one may ask "how did you get to the state where you have this aligned wildly superintelligent system which is able to answer questions about human values, as opposed to e.g. overwriting what humans believe about themselves by it's own non-human-aligned values?".

Ability to understand itself seems a special case of competence: I can imagine systems which are wildly superhuman in their ability to understand the rest of the world, but pretty mediocre at understanding themselves, e.g. due to some problems with recursion, self-references, reflections, or different kinds of computations being used at various levels of reasoning. As a result, it seems unclear whether the ability to clearly understand itself is a feature of all wildly super-human systems. (Toy counterexample: imagine a device which would connect someone in ancient Greece with our modern civilization, and our civilization dedicating about 10% of global GDP to answering questions from this guy. I would argue this device is for most practical purposes wildly superhuman compared to this individual guy in Greece, but at the same time bad at understanding itself)

Fundamentally inscrutable thoughts seems like something which you can study with present day systems as toy models. E.g., why does AlphaZero believe something is a good go move? Why does a go grand-master believe something is a good move? What counts as a 'true explanation'? Who is the recipient of the explanation? Are you happy with explanation of the algorithm like 'upon playing myriad games, my general functional approximator is approximating the expected value of this branch of an unimaginably large choice tree is larger than for other branches?'? If yes, why? If no, why not?

Inscrutable influence-seeking plans seem also a present problem. Eg, if there are already some complex influence-seeking patterns now, how would we notice? 

Comment by Jan Kulveit (jan-kulveit) on A 'Practice of Rationality' Sequence? · 2020-02-17T23:38:27.069Z · LW · GW

Getting oriented fast in complex/messy real world situations in fields in which you are not an expert

  • For example, now, one topic to get oriented in would be COVID; I think for a good thinker, it should be achievable to have big-picture understanding of the situation comparable to a median epidemiologist after few days of research
      • Where the point isn't to get an accurate forecast of some global variable which is asked on metaculus, but gears-level model of what's going on / what are the current 'critical points' which will have outsized impact / ...
      • In my impression, compared to some of the 'LessWrong-style rationality', this is more heavily dependent on 'doing bounded rationality well' - that is, finding the most important bits / efficiently ignoring almost all information, in contrast to carefully weighting several hypothesis which you already have

Actually trying to change something in the world where the system you are interacting with has significant level of complexity & somewhat fast feedback loop (&it's not super-high-stakes)

  • Few examples of seemingly stupid things of this type I did
    • filled a lawsuit without the aid of a lawyer (in low-stakes case)
    • repaired various devices with value much lower than value of my time
    • tinkering with code in a language I don't know
    • trying to moderate Wikipedia article on highly controversial topic about which two groups of editors are fighting

One thing I'm a bit worried about in some versions of LW rationality & someone should write a post about is something like ... 'missing opportunities to actually fight in non-super-high-stakes matters'', in the martial arts metaphor.

Comment by Jan Kulveit (jan-kulveit) on We run the Center for Applied Rationality, AMA · 2019-12-22T08:20:56.719Z · LW · GW

I like the metaphor!

Just wanted to note: in my view the original LW Sequences are not functional as a stand-alone upgrade for almost any human mind, and you can empirically observe it: You can think about any LW meet-up group around the world as an experiment, and I think to a first approximation it's fair to say aspiring Rationalists running just on the Sequences do not win, and good stuff coming out of the rationalist community was critically dependent of presence of minds Eliezer & others. (This is not say Sequences are not useful in many ways)

Comment by Jan Kulveit (jan-kulveit) on Under what circumstances is "don't look at existing research" good advice? · 2019-12-14T15:45:56.375Z · LW · GW

I basically agree with Vanessa:

the correct rule is almost always: first think about the problem yourself, then go read everything about it that other people did, and then do a synthesis of everything you learned inside your mind.

Thinking about the problem myself first often helps me understand existing work as it is easier to see the motivations, and solving solved problems is good as a training.

I would argue this is the case even in physics and math. (My background is in theoretical physics and during my high-school years I took some pride in not remembering physics and re-deriving everything when needed. It stopped being a good approach for physics ca since 1940 and somewhat backfired.)

The mistake members of "this community" (LW/rationality/AI safety) are sometimes making is skipping the second step / bouncing off the second step if it is actually hard.

Second mistake is not doing the third step in a proper way, which leads to somewhat strange and insular culture which may be repulsive for external experts. (E.g. people partially crediting themselves for discoveries which are know to outsiders)

Comment by Jan Kulveit (jan-kulveit) on Autism And Intelligence: Much More Than You Wanted To Know · 2019-11-18T20:20:10.736Z · LW · GW

Epistemic status: Wild guesses based on reading del Guidice's Evolutionary psychopathology and two papers trying to explain autism in terms of predictive processing. Still maybe better than the "tower hypothesis"

0. Let's think in terms of two parametric model, where one parameter tunes something like capacity of the brain, which can be damaged due to mutations, disease, etc., and the other parameter is explained bellow.

1. Some of the genes that increase risk of autism tune some parameter of how sensory prediction is handled, specifically, making the system to expect higher precision from sensory inputs/being less adaptive about it. (lets call it parameter p)

2. Several hypothesis - Mildly increased p sounds like something which should be somewhat correlated with increased learning / higher intelligence;

  • something which can force the system to build more exact representations, notice more "rule violations", keep track of more patterns, etc.
  • (also if abstract concepts are subject to the same machinery as sensoria, it would be something like having higher precision in abstract/formal reasoning)

3. But note: tune it up even more, and the system starts to break; too much weight is put on sensory experience, "normal world experience" becomes too surprising which leads to seek more repetitive behaviours and highly predictable environments. In the abstract, it becomes difficult to handle fluidity, rules which are vague and changing,...

4. In the two-parameter space of capacity and something like surprisal handling, this creates a picture like this

  • the space of functional minds is white, the orange space is where things break (in practice the boundary is not sharp)
  • for functional minds, g is something like capacity c + 0.1 * p; for minds in the orange area this no longer holds and on the contrary increasing p makes the the mind work worse
  • highly intelligent people can have higher values of p and still be quite functional
  • blue dotted area is what is diagnosed as autism; this group should be expected to have on average low g

Parts of the o.p. can be reinterpreted as

  • in this picture, some genes mean movement to the right; they are selected because of slight correlation with g
  • random mutations/infections/ etc. generally mean movement down
  • overall fitness profile of right-moving genes is somewhat complex (movement to the left or right is good or bad in different parts of the graph)

Even if this is simple, it makes some predictions (in the sense that the results are likely already somewhere in the literature, just I don't know whether this is true or not when writing this)

  • What happens if you move parameter p in the opposite direction? you get a mind less grounded in sensory inputs and stronger influence of 'downstream' predictions. In small quantities this would manifest as e.g. "clouds resembling animals" more for such people. Move to the left much more, and the system also breaks down, via hallucinations, everyday experience seemingly fitting arbitrary explanations despite many details not fitting, etc. This sounds like some symptoms of schizophrenia; the model predicts mild movement in the direction of schizophrenia should decrease g a bit


With a map of brains/minds into two dimensional space it is a priori obvious that it will fail in explaining the original high-dimensional problem, in many ways; many other dimensions are not orthogonal but actually "project" to the space (e.g. something like "brain masculinisation" has nonzero projection on p), there are various regulatory system like g means better ability to compensate via metacognition, or social support.

Comment by Jan Kulveit (jan-kulveit) on Minimization of prediction error as a foundation for human values in AI alignment · 2019-10-17T21:31:42.429Z · LW · GW

Based on

(For example, if subagents are assigned credit based on who's active when actual reward is received, that's going to be incredibly myopic -- subagents who have long-term plans for achieving better reward through delayed gratification can be undercut by greedily shortsighted agents, because the credit assignment doesn't reward you for things that happen later; much like political terms of office making long-term policy difficult.)

it seems to me you have in mind a different model than me (sorry if my description was confusing). In my view, you have the world-modelling, "preference aggregation" and action generation done by the "predictive processing engine". The "subagenty" parts basically extract evolutionary relevant features of this (like:hunger level), and insert error signals not only about the current state, but about future plans. (Like: if the p.p. would be planning a trajectory which is harmful to the subagent, it would insert the error signal.).

Overall your first part seems to assume more something like reinforcement learning where parts are assigned credit for good planning. I would expect the opposite: one planning process which is "rewarded" by a committee.

parsimonious theory which matched observations well

With parsimony... predictive processing in my opinion explains a lot for a relatively simple and elegant model. On the theory side it's for example

  • how you can make a bayesian approximator using local computations
  • how hierarchical models can grow in an evolutionary plausible way
  • why predictions, why actions

On the how do things feel for humans from the inside, for example

  • some phenomena about attention
  • what is that feeling when you are e.g. missing the right word, or something seems out of place
  • what's up with psychedelics
  • & more

On the neuroscience side

  • my non-expert impression is the evidence that at least cortex is following the pattern that neurons at higher processing stages generate predictions that bias processing at lower levels is growing

I don't think predictive processing should try to explain all about humans. In one direction, animals are running on predictive processing as well, but are missing some crucial ingredient. In the opposite direction, simpler organisms had older control systems (eg hormones),we have them as well, and p.p. must be in some sense be stacked on top of that.

Comment by jan-kulveit on [deleted post] 2019-10-15T14:23:29.747Z

Why is there this stereotype that the more you can make rocket ships, the more likely you are to break down crying if the social rules about when and how you are allowed to make rocket ships are ambiguous?

This is likely sufficiently explained by the principle component of human mindspace stretching from mechanistic cognition to mentalistic cognition, does not need more explanations (

Also I think there are multiple stereotypes of very smart people: eg Feynman or Einstein

Comment by Jan Kulveit (jan-kulveit) on Minimization of prediction error as a foundation for human values in AI alignment · 2019-10-11T11:13:48.596Z · LW · GW

It's not necessarily a Gordon's view/answer in his model, but my answers are

  • yes, evolution inserts these 'false predictions'; (Friston calls them fixed priors, which I think is somewhat unfortunate terminology choice)
  • if you put on Dennet's stances lense #3 (looking at systems as agents), these 'priors' are likely described as 'agents' extracting some features from the p.p. world-modelling apparatus and inserting errors accordingly; you correctly point out that in some architectures such parts would just get ignored, but in my view what happens in humans is more like a board of bayesian subagetns voting
  • note: its relatively easy to turn p.p. engine to something resembling reinforcement learning by warping it to seek 'high reward' states, where by states you should not imagine 'states of the world', but 'states of the body'; evolution designed the chemical control circuitry of hormones before - in some sense the predictive processing machinery is built on top of some older control systems, and is seeking goal states defined by them
  • (pure guess) consciousness and language and this style of processing is another layer, where the p.p. machinery is 'predicting' something like a stream of conscious thoughts, which somehow has it's own consistency rules and can implement verbal reasoning.

Overall I'm not sure to what extent you expect clean designs from evolution. I would expect messy design, implementing predictive processing for hierarchical world-modelling/action generation, mess of subagents + emotions + hacked connection to older regulatory systems to make the p.p. engine seek evolution's goals, and another interesting thing going on with language and memes.

Comment by Jan Kulveit (jan-kulveit) on Minimization of prediction error as a foundation for human values in AI alignment · 2019-10-11T10:17:33.245Z · LW · GW

I'm somewhat confused if you are claiming something else than Friston's notion that everything what brain is doing can be described as minimizing free energy/prediction error, this is important for understanding what human values are, and needs to be understood for ai alignment purposes.

If this is so, it sounds close to a restatement of my 'best guess of how minds work' with some in my opinion unhelpful simplification - ignoring the signal inserted into predictive processing via interoception of bodily states, which is actually important part of the picture, -ignoring the emergent 'agenty' properties of evolutionary encoded priors, +calling it theory of human values.

(I'm not sure how to state it positively, but I think it would be great if at least one person from the LW community bothered to actually understand my post, as "understanding each sentence".)

Comment by Jan Kulveit (jan-kulveit) on AIXSU - AI and X-risk Strategy Unconference · 2019-09-04T00:41:13.601Z · LW · GW

FWIW, my personal feelings about this

  • I expect this to have possibly large downside risks, being harmful for novices, and unclear value for experts
  • while some AI safety camp members considered organizing an event focused on strategy, after consulting experts, the decision was not to do it
  • happening like this make me update on the value of EA Hotel in the negative direction

(apologies for not providing comprehensive justification; I also won't have time for much discussion here; opinion is purely my personal feeling and not a position of any organization I'm involved with)

Comment by Jan Kulveit (jan-kulveit) on The AI Timelines Scam · 2019-07-15T01:39:57.133Z · LW · GW

[purely personal view]

It seems quite easy to imagine similarly compelling socio-political and subconscious reasons why people working on AI could be biased against short AGI timelines. For example

  • short timelines estimates make broader public agitated, which may lead to state regulation or similar interference [historical examples: industries trying to suppress info about risks]
  • researchers mostly want to work on technical problems, instead of thinking about nebulous future impacts of their work; putting more weight on short timelines would force some people to pause and think about responsibility, or suffer some cognitive dissonance, which may be unappealing/unpleasant for S1 reasons [historical examples: physicists working on nuclear weapons]
  • fears claims about short timelines would get pattern-matched as doomsday fear-mongering / sensationalist / subject of scifi movies ...

While I agree motivated reasoning is a serious concern, I don't think it's clear how do the incentives sum up. If anything, claims like "AGI is unrealistic or very far away, however practical applications of narrow AI will be profound" seems to capture most of the purported benefits (AI is important) and avoid the negatives (no need to think).

Comment by Jan Kulveit (jan-kulveit) on Risks from Learned Optimization: Introduction · 2019-07-03T20:44:43.126Z · LW · GW

I don't see why portion of a system turning into an agent would be "very unlikely". In a different perspective, if the system lives in something like an evolutionary landscape, there can be various basins of attraction which lead to sub-agent emergence, not just mesa-optimisation.

Comment by Jan Kulveit (jan-kulveit) on A case for strategy research: what it is and why we need more of it · 2019-06-27T20:47:00.914Z · LW · GW

Depends on what you mean by public. While I don't think you can have good public research processes which would not run into infohazards, you can have nonpublic process which produces good public outcomes. I don't think the examples count as something public - e.g. do you see any public discussion leading to CAIS?

Comment by Jan Kulveit (jan-kulveit) on A case for strategy research: what it is and why we need more of it · 2019-06-23T15:06:22.088Z · LW · GW


  • In my experience there are infohazard/attention hazards concerns. Public strategy has likely negative expected value - if it is good, it will run into infohazards. If it is bad, it will create confusion.
  • I would expect prudent funders will not want to create parallel public strategy discussion.
Comment by Jan Kulveit (jan-kulveit) on Should rationality be a movement? · 2019-06-22T10:21:09.269Z · LW · GW

I had similar discussions, but I'm worried this is not a good way how to think about the situation. IMO the best part of both 'rationality' and 'effective altruism' is often the overlap - people who to a large extent belong to both communities/do not see the labels as something really important for their identity.

Systematic reasons for that may be...

Rationality asks the question "How to think clearly". For many people who start to think more clearly, this leads to an update of their goals toward the question "How we can do as much good as possible (thinking rationally)", and acting on the answer, which is effective altruism.

Effective altruism asks the question "How we can do as much good as possible, thinking rationally and based on data?". For many people who actually start thinking about the question, this leads to an update "the ability to think clearly is critical when trying to answer the question". Which is rationality.

This is also to some extent predictive about failure modes. "Rationality without the EA part" can deteriorate into something like high-IQ-people discussion club and can have trouble with actions. "EA without the rationality part" can be something like a group of high-scrupulosity people who are personally very nice and donate to effective charities, but actually look away from things which are important.

This is not to say that organizations identified with either of the brands are flawless.

Also - we have now several geographically separated experiments in how the EA / LW / rationality / long-termist communities may look like, outside of the Bay area, and my feeling is places where the core of the communities is shared are healthier/producing more good things than places where the overlap is small, and that is better than having lot of distrust.