Posts

A starting point for making sense of task structure (in machine learning) 2024-02-24T01:51:49.227Z
Toward A Mathematical Framework for Computation in Superposition 2024-01-18T21:06:57.040Z
Grokking, memorization, and generalization — a discussion 2023-10-29T23:17:30.098Z
Crystal Healing — or the Origins of Expected Utility Maximizers 2023-06-25T03:18:25.033Z
Searching for a model's concepts by their shape – a theoretical framework 2023-02-23T20:14:46.341Z
[RFC] Possible ways to expand on "Discovering Latent Knowledge in Language Models Without Supervision". 2023-01-25T19:03:16.218Z
A gentle primer on caring, including in strange senses, with applications 2022-08-30T08:05:12.333Z
kh's Shortform 2022-07-06T21:48:03.211Z
Transferring credence without transferring evidence? 2022-02-04T08:11:48.297Z

Comments

Comment by Kaarel (kh) on kh's Shortform · 2024-04-04T14:24:16.916Z · LW · GW

The Deep Neural Feature Ansatz

@misc{radhakrishnan2023mechanism, title={Mechanism of feature learning in deep fully connected networks and kernel machines that recursively learn features}, author={Adityanarayanan Radhakrishnan and Daniel Beaglehole and Parthe Pandit and Mikhail Belkin}, year={2023}, url = { https://arxiv.org/pdf/2212.13881.pdf } }

The ansatz from the paper

Let denote the activation vector in layer on input , with the input layer being at index , so . Let be the weight matrix after activation layer . Let be the function that maps from the th activation layer to the output. Then their Deep Neural Feature Ansatz says that (I'm somewhat confused here about them not mentioning the loss function at all — are they claiming this is reasonable for any reasonable loss function? Maybe just MSE? MSE seems to be the only loss function mentioned in the paper; I think they leave the loss unspecified in a bunch of places though.)

A singular vector version of the ansatz

Letting be a SVD of , we note that this is equivalent to i.e., that the eigenvectors of the matrix on the RHS are the right singular vectors. By the variational characterization of eigenvectors and eigenvalues (Courant-Fischer or whatever), this is the same as saying that right singular vectors of are the highest orthonormal directions for the matrix on the RHS. Plugging in the definition of , this is equivalent to saying that the right singular vectors are the sequence of highest-variance directions of the data set of gradients .

(I have assumed here that the linearity is precise, whereas really it is approximate. It's probably true though that with some assumptions, the approximate initial statement implies an approximate conclusion too? Getting approx the same vecs out probably requires some assumption about gaps in singular values being big enough, because the vecs are unstable around equality. But if we're happy getting a sequence of orthogonal vectors that gets variances which are nearly optimal, we should also be fine without this kind of assumption. (This is guessing atm.))

Getting rid of the dependence on the RHS?

Assuming there isn't an off-by-one error in the paper, we can pull some term out of the RHS maybe? This is because applying the chain rule to the Jacobians of the transitions gives , so

Wait, so the claim is just which, assuming is invertible, should be the same as . But also, they claim that it is ? Are they secretly approximating everything with identity matrices?? This doesn't seem to be the case from their Figure 2 though.

Oh oops I guess I forgot about activation functions here! There should be extra diagonal terms for jacobians of preactivations->activations in , i.e., it should really say We now instead get This should be the same as which, with denoting preactivations in layer and denoting the function from these preactivations to the output, is the same as This last thing also totally works with activation functions other than ReLU — one can get this directly from the Jacobian calculation. I made the ReLU assumption earlier because I thought for a bit that one can get something further in that case; I no longer think this, but I won't go back and clean up the presentation atm.

Anyway, a takeaway is that the Deep Neural Feature Ansatz is equivalent to the (imo cleaner) ansatz that the set of gradients of the output wrt the pre-activations of any layer is close to being a tight frame (in other words, the gradients are in isotropic position; in other words still, the data matrix of the gradients is a constant times a semi-orthogonal matrix). (Note that the closeness one immediately gets isn't in to a tight frame, it's just in the quantity defining the tightness of a frame, but I'd guess that if it matters, one can also conclude some kind of closeness in from this (related).) This seems like a nicer fundamental condition because (1) we've intuitively canceled terms and (2) it now looks like a generic-ish condition, looks less mysterious, though idk how to argue for this beyond some handwaving about genericness, about other stuff being independent, sth like that.

proof of the tight frame claim from the previous condition: Note that clearly implies that the mass in any direction is the same, but also the mass being the same in any direction implies the above (because then, letting the SVD of the matrix with these gradients in its columns be , the above is , where we used the fact that ).

Some questions

  • Can one come up with some similar ansatz identity for the left singular vectors of ? One point of tension/interest here is that an ansatz identity for would constrain the left singular vectors of together with its singular values, but the singular values are constrained already by the deep neural feature ansatz. So if there were another identity for in terms of some gradients, we'd get a derived identity from equality between the singular values defined in terms of those gradients and the singular values defined in terms of the Deep Neural Feature Ansatz. Or actually, there probably won't be an interesting identity here since given the cancellation above, it now feels like nothing about is really pinned down by 'gradients independent of ' by the DNFA? Of course, some -dependence remains even in the gradients because the preactivations at which further gradients get evaluated are somewhat -dependent, so I guess it's not ruled out that the DNFA constrains something interesting about ? But anyway, all this seems to undermine the interestingness of the DNFA, as well as the chance of there being an interesting similar ansatz for the left singular vectors of .
  • Can one heuristically motivate that the preactivation gradients above should indeed be close to being in isotropic position? Can one use this reduction to provide simpler proofs of some of the propositions in the paper which say that the DNFA is exactly true in certain very toy cases?
  • The authors claim that the DNFA is supposed to somehow elucidate feature learning (indeed, they claim it is a mechanism of feature learning?). I take 'feature learning' to mean something like which neuronal functions (from the input) are created or which functions are computed in a layer in some broader sense (maybe which things are made linearly readable?) or which directions in an activation space to amplify or maybe less precisely just the process of some internal functions (from the input to internal activations) being learned of something like that, which happens in finite networks apparently in contrast to infinitely wide networks or NTK models or something like that which I haven't yet understood? I understand that their heuristic identity on the surface connects something about a weight matrix to something about gradients, but assuming I've not made some index-off-by-one error or something, it seems to probably not really be about that at all, since the weight matrix sorta cancels out — if it's true for one , it would maybe also be true with any other replacing it, so it doesn't really pin down ? (This might turn out to be false if the isotropy of preactivation gradients is only true for a very particular choice of .) But like, ignoring that counter, I guess their point is that the directions which get stretched most by the weight matrix in a layer are the directions along which it would be the best to move locally in that activation space to affect the output? (They don't explain it this way though — maybe I'm ignorant of some other meaning having been attributed to in previous literature or something.) But they say "Informally, this mechanism corresponds to the approach of progressively re-weighting features in proportion to the influence they have on the predictions.". I guess maybe this is an appropriate description of the math if they are talking about reweighting in the purely linear sense, and they take features in the input layer to be scaleless objects or something? (Like, if we take features in the input activation space to each have some associated scale, then the right singular vector identity no longer says that most influential features get stretched the most.) I wish they were much more precise here, or if there isn't a precise interesting philosophical thing to be deduced from their math, much more honest about that, much less PR-y.
  • So, in brief, instead of "informally, this mechanism corresponds to the approach of progressively re-weighting features in proportion to the influence they have on the predictions," it seems to me that what the math warrants would be sth more like "The weight matrix reweights stuff; after reweighting, the activation space is roughly isotropic wrt affecting the prediction (ansatz); so, the stuff that got the highest weight has most effect on the prediction now." I'm not that happy with this last statement either, but atm it seems much more appropriate than their claim.
  • I guess if I'm not confused about something major here (plausibly I am), one could probably add 1000 experiments (e.g. checking that the isotropic version of the ansatz indeed equally holds in a bunch of models) and write a paper responding to them. If you're reading this and this seems interesting to you, feel free to do that — I'm also probably happy to talk to you about the paper.

typos in the paper

indexing error in the first displaymath in Sec 2: it probably should say '', not ''

Comment by Kaarel (kh) on kh's Shortform · 2024-04-04T14:17:22.967Z · LW · GW

A thread into which I'll occasionally post notes on some ML(?) papers I'm reading

I think the world would probably be much better if everyone made a bunch more of their notes public. I intend to occasionally copy some personal notes on ML(?) papers into this thread. While I hope that the notes which I'll end up selecting for being posted here will be of interest to some people, and that people will sometimes comment with their thoughts on the same paper and on my thoughts (please do tell me how I'm wrong, etc.), I expect that the notes here will not be significantly more polished than typical notes I write for myself and my reasoning will be suboptimal; also, I expect most of these notes won't really make sense unless you're also familiar with the paper — the notes will typically be companions to the paper, not substitutes.

I expect I'll sometimes be meaner than some norm somewhere in these notes (in fact, I expect I'll sometimes be simultaneously mean and wrong/confused — exciting!), but I should just say to clarify that I think almost all ML papers/posts/notes are trash, so me being mean to a particular paper might not be evidence that I think it's worse than some average. If anything, the papers I post notes about had something worth thinking/writing about at all, which seems like a good thing! In particular, they probably contained at least one interesting idea!

So, anyway: I'm warning you that the notes in this thread will be messy and not self-contained, and telling you that reading them might not be a good use of your time :)

Comment by Kaarel (kh) on Why does generalization work? · 2024-02-22T00:12:22.541Z · LW · GW

I'd be very interested in a concrete construction of a (mathematical) universe in which, in some reasonable sense that remains to be made precise, two 'orthogonal pattern-universes' (preferably each containing 'agents' or 'sophisticated computational systems') live on 'the same fundamental substrate'. One of the many reasons I'm struggling to make this precise is that I want there to be some condition which meaningfully rules out trivial constructions in which the low-level specification of such a universe can be decomposed into a pair such that and are 'independent', everything in the first pattern-universe is a function only of , and everything in the second pattern-universe is a function only of . (Of course, I'd also be happy with an explanation why this is a bad question :).)

Comment by Kaarel (kh) on More Hyphenation · 2024-02-08T03:00:40.516Z · LW · GW

I find [the use of square brackets to show the merge structure of [a linguistic entity that might otherwise be confusing to parse]] delightful :)

Comment by Kaarel (kh) on Does davidad's uploading moonshot work? · 2023-11-03T19:36:23.920Z · LW · GW

I'd be quite interested in elaboration on getting faster alignment researchers not being alignment-hard — it currently seems likely to me that a research community of unupgraded alignment researchers with a hundred years is capable of solving alignment (conditional on alignment being solvable). (And having faster general researchers, a goal that seems roughly equivalent, is surely alignment-hard (again, conditional on alignment being solvable), because we can then get the researchers to quickly do whatever it is that we could do — e.g., upgrading?)

Comment by Kaarel (kh) on AI Regulation May Be More Important Than AI Alignment For Existential Safety · 2023-08-25T14:31:08.933Z · LW · GW

I was just claiming that your description of pivotal acts / of people that support pivotal acts was incorrect in a way that people that think pivotal acts are worth considering would consider very significant and in a way that significantly reduces the power of your argument as applying to what people mean by pivotal acts — I don't see anything in your comment as a response to that claim. I would like it to be a separate discussion whether pivotal acts are a good idea with this in mind.

Now, in this separate discussion: I agree that executing a pivotal act with just a narrow, safe, superintelligence is a difficult problem. That said, all paths to a state of safety from AGI that I can think of seem to contain difficult steps, so I think a more fine-grained analysis of the difficulty of various steps would be needed. I broadly agree with your description of the political character of pivotal acts, but I disagree with what you claim about associated race dynamics — it seems plausible to me that if pivotal acts became the main paradigm, then we'd have a world in which a majority of relevant people are willing to cooperate / do not want to race that much against others in the majority, and it'd mostly be a race between this group and e/acc types. I would also add, though, that the kinds of governance solutions/mechanisms I can think of that are sufficient to (for instance) make it impossible to perform distributed training runs on consumer devices also seem quite authoritarian.

Comment by Kaarel (kh) on AI Regulation May Be More Important Than AI Alignment For Existential Safety · 2023-08-25T00:34:32.759Z · LW · GW

In this comment, I will be assuming that you intended to talk of "pivotal acts" in the standard (distribution of) sense(s) people use the term — if your comment is better described as using a different definition of "pivotal act", including when "pivotal act" is used by the people in the dialogue you present, then my present comment applies less.

I think that this is a significant mischaracterization of what most (? or definitely at least a substantial fraction of) pivotal activists mean by "pivotal act" (in particular, I think this is a significant mischaracterization of what Yudkowsky has in mind). (I think the original post also uses the term "pivotal act" in a somewhat non-standard way in a similar direction, but to a much lesser degree.) Specifically, I think it is false that the primary kinds of plans this fraction of people have in mind when talking about pivotal acts involve creating a superintelligent nigh-omnipotent infallible FOOMed properly aligned ASI. Instead, the kind of person I have in mind is very interested in coming up with pivotal acts that do not use a general superintelligence, often looking for pivotal acts that use a narrow superintelligence (for instance, a narrow nanoengineer) (though this is also often considered very difficult by such people (which is one of the reasons they're often so doomy)). See, for instance, the discussion of pivotal acts in https://www.lesswrong.com/posts/7im8at9PmhbT4JHsW/ngo-and-yudkowsky-on-alignment-difficulty.

Comment by Kaarel (kh) on Polysemanticity and Capacity in Neural Networks · 2023-06-27T20:40:38.571Z · LW · GW

A few notes/questions about things that seem like errors in the paper (or maybe I'm confused — anyway, none of this invalidates any conclusions of the paper, but if I'm right or at least justifiably confused, then these do probably significantly hinder reading the paper; I'm partly posting this comment to possibly prevent some readers in the future from wasting a lot of time on the same issues):


1) The formula for  here seems incorrect:


This is because W_i is a feature corresponding to the i'th coordinate of x (this is not evident from the screenshot, but it is evident from the rest of the paper), so surely what shows up in this formula should not be W_i, but instead the i'th row of the matrix which has columns W_i (this matrix is called W later). (If one believes that W_i is a feature, then one can see this is wrong already from the dimensions in the dot product  not matching.)
 


2) Even though you say in the text at the beginning of Section 3 that the input features are independent, the first sentence below made me make a pragmatic inference that you are not assuming that the coordinates are independent for this particular claim about how the loss simplifies (in part because if you were assuming independence, you could replace the covariance claim with a weaker variance claim, since the 0 covariance part is implied by independence):

However, I think you do use the fact that the input features are independent in the proof of the claim (at least you say "because the x's are independent"):

Additionally, if you are in fact just using independence in the argument here and I'm not missing something, then I think that instead of saying you are using the moment-cumulants formula here, it would be much much better to say that independence implies that any term with an unmatched index is . If you mean the moment-cumulants formula here https://en.wikipedia.org/wiki/Cumulant#Joint_cumulants , then (while I understand how to derive every equation of your argument in case the inputs are independent), I'm currently confused about how that's helpful at all, because one then still needs to analyze which terms of each cumulant are 0 (and how the various terms cancel for various choices of the matching pattern of indices), and this seems strictly more complicated than problem before translating to cumulants, unless I'm missing something obvious.

3) I'm pretty sure this should say x_i^2 instead of x_i x_j, and as far as I can tell the LHS has nothing to do with the RHS:Image

(I think it should instead say sth like that the loss term is proportional to the squared difference between the true and predictor covariance.)

Comment by Kaarel (kh) on Question for Prediction Market people: where is the money supposed to come from? · 2023-06-08T23:47:46.180Z · LW · GW

At least ignoring legislation, an exchange could offer a contract with the same return as S&P 500 (for the aggregate of a pair of traders entering a Kalshi-style event contract); mechanistically, this index-tracking could be supported by just using the money put into a prediction market to buy VOO and selling when the market settles. (I think.)

Comment by Kaarel (kh) on kh's Shortform · 2023-03-05T10:37:12.691Z · LW · GW

An attempt at a specification of virtue ethics

I will be appropriating terminology from the Waluigi post. I hereby put forward the hypothesis that virtue ethics endorses an action iff it is what the better one of Luigi and Waluigi would do, where Luigi and Waluigi are the ones given by the posterior semiotic measure in the given situation, and "better" is defined according to what some [possibly vaguely specified] consequentialist theory thinks about the long-term expected effects of this particular Luigi vs the long-term effects of this particular Waluigi. One intuition here is that a vague specification could be more fine if we are not optimizing for it very hard, instead just obtaining a small amount of information from it per decision.

In this sense, virtue ethics literally equals continuously choosing actions as if coming from a good character. Furthermore, considering the new posterior semiotic measure after a decision, in this sense, virtue ethics is about cultivating a virtuous character in oneself. Virtue ethics is about rising to the occasion (i.e. the situation, the context). It's about constantly choosing the Luigi in oneself over the Waluigi in oneself (or maybe the Waluigi over the Luigi if we define "Luigi" as the more likely of the two and one has previously acted badly in similar cases or if the posterior semiotic measure is otherwise malign). I currently find this very funny, and, if even approximately correct, also quite cool.

Here are some issues/considerations/questions that I intend to think more about:

  1. What's a situation? For instance, does it encompass the agent's entire life history, or are we to make it more local?
  2. Are we to use the agent's own semiotic measure, or some objective semiotic measure?
  3. This grounds virtue ethics in consequentialism. Can we get rid of that? Even if not, I think this might be useful for designing safe agents though.
  4. Does this collapse into cultivating a vanilla consequentialist over many choices? Can we think of examples of prompting regimes such that collapse does not occur? The vague motivating hope I have here is that in the trolley problem case with the massive man, the Waluigi pushing the man is a corrupt psycho, and not a conflicted utilitarian.
  5. Even if this doesn't collapse into consequentialism from these kinds of decisions, I'm worried about it being stable under reflection, I guess because I'm worried about the likelihood of virtue ethics being part of an agent in reflective equilibrium. It would be sad if the only way to make this work would be to only ever give high semiotic measure to agents that don't reflect much on values.
  6. Wait, how exactly do we get Luigi and Waluigi from the posterior semiotic measure? Can we just replace this with picking the best character from the most probable few options according to the semiotic measure? Wait, is this just quantilization but funnier? I think there might be some crucial differences. And regardless, it's interesting if virtue ethics turns out to be quantilization-but-funnier.
  7. More generally, has all this been said already?
  8. Is there a nice restatement of this in shard theory language?
Comment by Kaarel (kh) on kh's Shortform · 2023-02-10T02:46:07.179Z · LW · GW

A small observation about the AI arms race in conditions of good infosec and collaboration

Suppose we are in a world where most top AI capabilities organizations are refraining from publishing their work (this could be the case because of safety concerns, or because of profit motives) + have strong infosec which prevents them from leaking insights about capabilities in other ways. In this world, it seems sort of plausible that the union of the capabilities insights of people at top labs would allow one to train significantly more capable models than the insights possessed by any single lab alone would allow one to train. In such a world, if the labs decide to cooperate once AGI is nigh, this could lead to a significantly faster increase in capabilities than one might have expected otherwise.

(I doubt this is a novel thought. I did not perform an extensive search of the AI strategy/governance literature before writing this.)

Comment by Kaarel (kh) on How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme · 2023-01-20T04:32:20.349Z · LW · GW

First, suppose GPT-n literally just has a “what a human would say” feature and a “what do I [as GPT-n] actually believe” feature, and those are the only two consistently useful truth-like features that it represents, and that using our method we can find both of them. This means we literally only need one more bit of information to identify the model’s beliefs. 

One difference between “what a human would say” and “what GPT-n believes” is that humans will know less than GPT-n. In particular, there should be hard inputs that only a superhuman model can evaluate; on these inputs, the “what a human would say” feature should result in an “I don’t know” answer (approximately 50/50 between “True” and “False”), while the “what GPT-n believes” feature should result in a confident “True” or “False” answer.[2] This would allow us to identify the model’s beliefs from among these two options.


For  such that GPT- is superhuman, I think one could alternatively differentiate between these two options by checking which is more consistent under implications, by which I mean that whenever the representation says that the propositions  and  are true, it should also say that  is true. (Here, for a language model,  and  could be ~whatever assertions written in natural language.) Or more generally, in addition to modus ponens, also construct new propositions with ANDs and ORs, and check against all the inference rules of zeroth-order logic, or do this for first-order logic or whatever. (Alternatively, we can also write down versions of these constraints that apply to probabilities.) Assuming [more intelligent => more consistent] (w.r.t. the same set of propositions), for a superhuman model, the model's beliefs would probably be the more consistent feature. (Of course, one could also just add these additional consistency constraints directly into the loss in CCS instead of doing a second deductive step.) 

I think this might even be helpful for differentiating the model's beliefs from what it models some other clever AI as believing or what it thinks would be true in some fake counterfactual world, because presumably it makes sense to devote less of one's computation to ironing out incoherence in these counterfactuals – for humans, it certainly seems computationally much easier to consistently tell the truth than to consistently talk about what would be the case in some counterfactual of similar complexity to reality (e.g. to lie).

Hmm, after writing the above, now that I think more of it, I guess it seems plausible that the feature most consistent under negations is already more likely to be the model's true beliefs, for the same reasons as what's given in the above paragraph. I guess testing modus ponens (and other inference rules) seems much stronger though, and in any case that could be useful for constraining the search.

 

(There are a bunch of people that should be thanked for contributing to the above thoughts in discussions, but I'll hopefully have a post up in a few days where I do that – I'll try to remember to edit this comment with a link to the post when it's up.)

Comment by Kaarel (kh) on Finite Factored Sets in Pictures · 2022-12-05T17:27:18.085Z · LW · GW

I think does not have to be a variable which we can observe, i.e. it is not necessarily the case that we can deterministically infer the value of from the values of and . For example, let's say the two binary variables we observe are and . We'd intuitively want to consider a causal model where is causing both, but in a way that makes all triples of variable values have nonzero probability (which is true for these variables in practice). This is impossible if we require to be deterministic once is known.

Comment by Kaarel (kh) on Finite Factored Sets in Pictures · 2022-12-05T13:59:21.212Z · LW · GW

I agree with you regarding 0 lebesgue. My impression is that the Pearl paradigm has some [statistics -> causal graph] inference rules which basically do the job of ruling out causal graphs for which having certain properties seen in the data has 0 lebesgue measure. (The inference from two variables being independent to them having no common ancestors in the underlying causal graph, stated earlier in the post, is also of this kind.) So I think it's correct to say "X has to cause Y", where this is understood as a valid inference inside the Pearl (or Garrabrant) paradigm.  (But also, updating pretty close to "X has to cause Y" is correct for a Bayesian with reasonable priors about the underlying causal graphs.)

(epistemic position: I haven't read most of the relevant material in much detail)

Comment by Kaarel (kh) on Finite Factored Sets in Pictures · 2022-12-05T04:27:12.389Z · LW · GW


I don't understand why 1 is true – in general, couldn't the variable $W$ be defined on a more refined sample space? Also, I think all $4$ conditions are technically satisfied if you set $W=X$ (or well, maybe it's better to think of it as a copy of $X$).

I think the following argument works though. Note that the distribution of $X$ given $(Z,Y,W)$ is just the deterministic distribution $X=Y \xor Z$ (this follows from the definition of Z). By the structure of the causal graph, the distribution of $X$ given $(Z,Y,W)$ must be the same as the distribution of $X$ given just $W$. Therefore, the distribution of $X$ given $W$ is deterministic. I strongly guess that a deterministic connection is directly ruled out by one of Pearl's inference rules.

The same argument also rules out graphs 2 and 4.


 

Comment by Kaarel (kh) on Why bet Kelly? · 2022-11-15T23:47:38.586Z · LW · GW

I took the main point of the post to be that there are fairly general conditions (on the utility function and on the bets you are offered) in which you should place each bet like your utility is linear, and fairly general conditions in which you should place each bet like your utility is logarithmic. In particular, the conditions are much weaker than your utility actually being linear, or than your utility actually being logarithmic, respectively, and I think this is a cool point. I don't see the post as saying anything beyond what's implied by this about Kelly betting vs max-linear-EV betting in general.

Comment by Kaarel (kh) on Quantum Suicide and Aumann's Agreement Theorem · 2022-11-02T13:38:08.412Z · LW · GW

(By the way, I'm pretty sure the position I outline is compatible with changing usual forecasting procedures in the presence of observer selection effects, in cases where secondary evidence which does not kill us is available. E.g. one can probably still justify [looking at the base rate of near misses to understand the probability of nuclear war instead of relying solely on the observed rate of nuclear war itself].)

Comment by Kaarel (kh) on Quantum Suicide and Aumann's Agreement Theorem · 2022-11-02T13:27:37.329Z · LW · GW

I'm inside-view fairly confident that Bob should be putting a probability of 0.01% on surviving conditional on many worlds being true, but it seems possible I'm missing some crucial considerations having to do with observer selection stuff in general, so I'll phrase the rest of this as more of a question.

What's wrong with saying that Bob should put a probability of 0.01% of surviving conditional on many-worlds being true – doesn't this just follow from the usual way that a many-worlder would put probabilities on things, or at least the simplest way for doing so (i.e. not post-normalizing only across the worlds in which you survive)? I'm pretty sure that the usual picture of Bayesianism as having a big (weighted) set of possible worlds in your head and, upon encountering evidence, discarding the ones which you found out you were not in, also motivates putting a probability of 0.01% on surviving conditional on many-worlds. (I'm assuming that for a many-worlder, weights on worlds are given by squared amplitudes or whatever.)

This contradicts a version of the conservation of expected evidence in which you only average over outcomes in which you survive (even in cases where you don't survive in all outcomes), but that version seems wrong anyway, with Leslie's firing squad seeming like an obvious counterexample to me, https://plato.stanford.edu/entries/fine-tuning/#AnthObje .

Comment by Kaarel (kh) on Superintelligent AI is necessary for an amazing future, but far from sufficient · 2022-11-02T12:29:12.096Z · LW · GW

A big chunk of my uncertainty about whether at least 95% of the future’s potential value is realized comes from uncertainty about "the order of magnitude at which utility is bounded". That is, if unbounded total utilitarianism is roughly true, I think there is a <1% chance in any of these scenarios that >95% of the future's potential value would be realized. If decreasing marginal returns in the [amount of hedonium -> utility] conversion kick in fast enough for 10^20 slightly conscious humans on heroin for a million years to yield 95% of max utility, then I'd probably give >10% of strong utopia even conditional on building the default superintelligent AI. Both options seem significantly probable to me, causing my odds to vary much less between the scenarios.

This is assuming that "the future’s potential value" is referring to something like the (expected) utility that would be attained by the action sequence recommended by an oracle giving humanity optimal advice according to our CEV. If that's a misinterpretation or a bad framing more generally, I'd enjoy thinking again about the better question. I would guess that my disagreement with the probabilities is greatly reduced on the level of the underlying empirical outcome distribution.

Comment by Kaarel (kh) on Possible miracles · 2022-10-09T22:15:49.808Z · LW · GW

Great post, thanks for writing this! In the version of "Alignment might be easier than we expect" in my head, I also have the following:

  • Value might not be that fragile. We might "get sufficiently many bits in the value specification right" sort of by default to have an imperfect but still really valuable future.
    • For instance, maybe IRL would just learn something close enough to pCEV-utility from human behavior, and then training an agent with that as the reward would make it close enough to a human-value-maximizer. We'd get some misalignment on both steps (e.g. because there are systematic ways in which the human is wrong in the training data, and because of inner misalignment), but maybe this is little enough to be fine, despite fragility of value and despite Goodhart.
    • Even if deceptive alignment were the default, it might be that the AI gets sufficiently close to correct values before "becoming intelligent enough" to start deceiving us in training, such that even if it is thereafter only deceptively aligned, it will still execute a future that's fine when in deployment.
    • It doesn't seem completely wild that we could get an agent to robustly understand the concept of a paperclip by default. Is it completely wild that we could get an agent to robustly understand the concept of goodness by default?
    • Is it so wild that we could by default end up with an AGI that at least does something like putting 10^30 rats on heroin? I have some significant probability on this being a fine outcome.
    • There's some distance  from the correct value specification such that stuff is fine if we get AGI with values closer than . Do we have good reasons to think that  is far out of the range that default approaches would give us?

(But here's some reasons not to expect this.)

Comment by Kaarel (kh) on Inferring utility functions from locally non-transitive preferences · 2022-10-07T11:13:07.906Z · LW · GW

I still disagree / am confused. If it's indeed the case that , then why would we expect ? (Also, in the second-to-last sentence of your comment, it looks like you say the former is an equality.) Furthermore, if the latter equality is true, wouldn't it imply that the utility we get from [chocolate ice cream and vanilla ice cream] is the sum of the utility from chocolate ice cream and the utility from vanilla ice cream? Isn't  supposed to be equal to the utility of ?

My current best attempt to understand/steelman this is to accept , to reject , and to try to think of the embedding as something slightly strange. I don't see a reason to think utility would be linear in current semantic embeddings of natural language or of a programming language, nor do I see an appealing other approach to construct such an embedding. Maybe we could figure out a correct embedding if we had access to lots of data about the agent's preferences (possibly in addition to some semantic/physical data), but it feels like that might defeat the idea of this embedding in the context of this post as constituting a step that does not yet depend on preference data. Or alternatively, if we are fine with using preference data on this step, maybe we could find a cool embedding, but in that case, it seems very likely that it would also just give us a one-step solution to the entire problem of computing a set of rational preferences for the agent.

A separate attempt to steelman this would be to assume that we have access to a semantic embedding pretrained on preference data from a bunch of other agents, and then to tune the utilities of the basis to best fit the preferences of the agent we are currently dealing with. That seems like it a cool idea, although I'm not sure if it has strayed too far from the spirit of the original problem.

Comment by Kaarel (kh) on Continental Philosophy as Undergraduate Mathematics · 2022-10-07T08:05:50.631Z · LW · GW

The link in this sentence is broken for me: "Second, it was proven recently that utilitarianism is the “correct” moral philosophy." Unless this is intentional, I'm curious to know where it directed to.

I don't know of a category-theoretic treatment of Heidegger, but here's one of Hegel: https://ncatlab.org/nlab/show/Science+of+Logic. I think it's mostly due to Urs Schreiber, but I'm not sure – in any case, we can be certain it was written by an Absolute madlad :)


 

Comment by Kaarel (kh) on A gentle primer on caring, including in strange senses, with applications · 2022-09-30T10:35:53.795Z · LW · GW

Why should I care about similarities to pCEV when valuing people?

It seems to me that this matters in case your metaethical view is that one should do pCEV, or more generally if you think matching pCEV is evidence of moral correctness. If you don't hold such metaethical views, then I might agree that (at least in the instrumentally rational sense, at least conditional on not holding any metametalevel views that contradict these) you shouldn't care.


> Why is the first example explaining why someone could support taking money from people you value less to give to other people, while not supporting doing so with your own money? It's obviously true under utilitarianism

I'm not sure if it answers the question, but I think it's a cool consideration. I think most people are close to acting weighted-utilitarianly, but few realize how strong the difference between public and private charity is according to weighted-utilitarianism.

> It's weird to bring up having kids vs. abortion and then not take a position on the latter. (Of course, people will be pissed at you for taking a position too.)

My position is "subsidize having children, that's all the regulation around abortion that's needed". So in particular, abortion should be legal at any time. (I intended what I wrote in the post to communicate this, but maybe I didn't do a good job.)

> democracy plans for right now
I'm not sure I understand in what sense you mean this? Voters are voting according to preferences that partially involve caring about future selves. If what you have in mind is something like people being less attentive about costs policies cause 10 years into the future and this leads to discounting these more than the discount from caring alone, then I guess I could see that being possible. But that could also happen for people's individual decisions, I think? I guess one might argue that people are more aware about long-term costs of personal decisions than of policies, but this is not clear to me, especially with more analysis going into policy decisions.

> As to your framing, the difference between you-now and you-future is mathematically bigger than the difference between others-now and others-future if you use a ratio for the number of links to get to them.
Suppose people change half as much in a year as your sibling is different from you, and you care about similarity for what value you place on someone. Thus, two years equals one link.
After 4 years, you are now two links away from yourself-now and your sibling is 3 from you now. They are 50% more different than future you (assuming no convergence). After eight years, you are 4 links away, while they are only 5, which makes them 25% more different to you than you are.
Alternately, they have changed by 67% more, and you have changed by 100% of how much how distant they were from you at 4 years.
It thus seems like they have changed far less than you have, and are more similar to who they were, thus why should you treat them as having the same rate.


That's a cool observation! I guess this won't work if we discount geometrically in the number of links. I'm not sure which is more justified.


There is lots of interesting stuff in your last comment which I still haven't responded to. I might come back to this in the future if I have something interesting to say. Thanks again for your thoughts!

Comment by Kaarel (kh) on kh's Shortform · 2022-09-30T05:37:18.452Z · LW · GW

I proposed a method for detecting cheating in chess; cross-posting it here in the hopes of maybe getting better feedback than on reddit: https://www.reddit.com/r/chess/comments/xrs31z/a_proposal_for_an_experiment_well_data_analysis/  

Comment by Kaarel (kh) on A gentle primer on caring, including in strange senses, with applications · 2022-08-30T16:18:23.755Z · LW · GW

Thanks for the comments!

In 'The inequivalence of society-level and individual charity' they list the scenarios as 1, 1, and 2 instead of A, B, C, as they later use. Later, refers incorrectly to preferring C to A with different necessary weights when the second reference is is to prefer C to B.

I agree and I published an edit fixing this just now

The claim that money becomes utility as a log of the amount of money isn't true, but is probably close enough for this kind of use. You should add a note to the effect. (The effects of money are discrete at the very least).

I mostly agree, but I think footnote 17 covers this?

The claim that the derivative of the log of y = 1/y is also incorrect. In general, log means either log base 10, or something specific to the area of study. If written generally, you must specify the base. (For instance, in Computer Science it is base-2, but I would have to explain that if I was doing external math with that.) The derivative of the natural log is 1/n, but that isn't true of any other log. You should fix that statement by specifying you are using ln instead of log (or just prepending the word natural).

I think the standard in academic mathematics is that , https://en.wikipedia.org/wiki/Natural_logarithm#Notational_conventions, and I guess I would sort of like to spread that standard :). I think it's exceedingly rare for someone to mean base 10 in this context, but I could be wrong. I agree that base 2 is also reasonable though. In any case, the base only changes utility by scaling by a constant, so everything in that subsection after the derivative should be true independently of the base. Nevertheless, I'm adding a footnote specifying this.

Just plain wrong in my opinion, for instance, claiming that a weight can't be negative assumes away the existence of hate, but people do hate either themselves or others on occasion in non-instrumental ways, wanting them to suffer, which renders this claim invalid (unless they hate literally everyone).

I'm having a really hard time imagining thinking this about someone else (I can imagine hate in the sense of like... not wanting to spend time together with someone and/or assigning a close-to-zero weight), but I'm not sure – I mean, I agree there definitely are people who think they non-instrumentally want the people who killed their family or whatever to suffer, but I think that's a mistake? That said, I think I agree that for the purposes of modeling people, we might want to let weights be negative sometimes.

I also don't see how being perfectly altruistic necessitates valuing everyone else exactly the same as you. I could still value others different amounts without being any less altruistic, especially if the difference is between a lower value for me and the others higher. Relatedly, it is possible to not care about yourself at all, but this  math can't handle that.

I think it's partly that I just wanted to have some shorthand for "assign equal weight to everyone", but I also think it matches the commonsense notion of being perfectly altruistic. One argument for this is that 1) one should always assign a higher weight for oneself than for anyone else (also see footnote 12 here) and 2) if one assigns a lower weight to someone else, then one is not perfectly altruistic in interactions with that person – given this, the unique option is to assign equal weight to everyone.

Comment by Kaarel (kh) on kh's Shortform · 2022-07-06T21:48:03.471Z · LW · GW

I'm updating my estimate of the return on investment into culture wars from being an epsilon fraction compared to canonical EA cause areas to epsilon+delta. This has to do with cases where AI locks in current values extrapolated "correctly" except with too much weight put on the practical (as opposed to the abstract) layer of current preferences. What follows is a somewhat more detailed status report on this change.

For me (and I'd guess for a large fraction of autistic altruistics multipliers), the general feels regarding [being a culture war combatant in one's professional capacity] seem to be that while the questions fought over have some importance, the welfare-produced-per-hour-worked from doing direct work is at least an order of magnitude smaller than the same quantities for any canonical cause area (also true for welfare/USD). I'm fairly certain one can reach this conclusion from direct object-level estimates, as I imagine e.g. OpenPhil has done, although I admit I haven't carried out such calculations with much care myself. Considering the incentives of various people involved should also support this being a lower welfare-per-hour-worked cause area (whether an argument along these lines gives substantive support to the conclusion that there is an order-of-magnitude difference appears less clear).

So anyway, until today part of my vague cloud of justification for these feels was that "and anyway, it's fine if this culture war stuff is fixed in 30 years, after we have dealt with surviving AGI". The small realization I had today was that maybe a significant fraction of the surviving worlds are those where something like corrigibility wasn't attainable but AI value extrapolation sort of worked out fine, i.e. with the values that got locked in being sort of fine, but the relative weights of object-level intuitions/preferences was kinda high compared to the weight on simplicity/[meta-level intuitions], like in particular maybe the AI training did some Bayesian-ethics-evidential-double-counting of object-level intuitions about 10^10 similar cases (I realize it's quite possible that this last clause won't make sense to many readers, but unfortunately I won't provide an explanation here; I intend to write about a few ideas on this picture of Bayesian ethics at some later time, but I want to read Beckstead's thesis first, which I haven't done yet; anyway the best I can offer is that I estimate a 75% of you understanding the rough idea I have in mind (which does not necessarily imply that the idea can actually be unfolded into a detailed picture that makes sense), conditional on understanding my writing in general and conditional on not having understood this clause yet, after reading Beckstead's thesis; also: woke: Bayesian ethics, bespoke: INFRABAYESIAN ETHICS, am I right folks). 

So anyway, finally getting to the point of all this at the end of the tunnel, in such worlds we actually can't fix this stuff later on, because all the current opinions on culture war issues got locked in.

(One could argue that we can anyway be quite sure that this consideration matters little, because most expected value is not in such kinda-okay worlds, because even if these were 99% percent of the surviving worlds, assuming fun theory makes sense or simulated value-bearing minds are possible, there will be amazingly more value in each world where AGI worked out really well, as compared to a world tiled with Earth society 2030. But then again, this counterargument could be iffy to some, in sort of the same way in which fanaticism (in Bostrom's sense) or the St. Petersburg paradox feel iffy to some, or perhaps in another way. I won't be taking a further position on this at the moment.)

Comment by Kaarel (kh) on TurnTrout's shortform feed · 2022-07-06T19:22:16.354Z · LW · GW

Oops I realized that the argument given in the last paragraph of my previous comment applies to people maximizing their personal welfare or being totally altruistic or totally altruistic wrt some large group or some combination of these options, but maybe not so much to people who are e.g. genuinely maximizing the sum of their family members' personal welfares, but this last case might well be entailed by what you mean by "love", so maybe I missed the point earlier. In the latter case, it seems likely that an IQ boost would keep many parts of love in tact initially, but I'd imagine that for a significant fraction of people, the unequal relationship would cause sadness over the next 5 years, which with significant probability causes falling out of love. Of course, right after the IQ boost you might want to invent/implement mental tech which prevents this sadness or prevents the value drift caused by growing apart, but I'm not sure if there are currently feasible options which would be acceptable ways to fix either of these problems. Maybe one could figure out some contract to sign before the value drift, but this might go against some deeper values, and might not count as staying in love anyway.

Comment by Kaarel (kh) on TurnTrout's shortform feed · 2022-07-06T14:37:04.616Z · LW · GW

Something that confuses me about your example's relevance is that it's like almost the unique case where it's [[really directly] impossible] to succumb to optimization pressure, at least conditional on what's good = something like coherent extrapolated volition. That is, under (my understanding of) a view of metaethics common in these corners, what's good just is what a smarter version of you would extrapolate your intuitions/[basic principles] to, or something along these lines. And so this is almost definitionally almost the unique situation that we'd expect could only move you closer to better fulfilling your values, i.e. nothing could break for any reason, and in particular not break under optimization pressure (where breaking is measured w.r.t. what's good). And being straightforwardly tautologically true would make it a not very interesting example.

editorial remark: I realized after writing the two paragraphs below that they probably do not move one much on the main thesis of your post, at least conditional on already having read Ege Erdil's doubts about your example (except insofar as someone wants to defer to opinions of others or my opinion in particular), but I decided to post anyway in large part since these family matters might be a topic of independent interest for some:

I would bet that at least 25% of people would stop loving their (current) family in <5 years (i.e. not love them much beyond how much they presently love a generic acquaintance) if they got +30 IQ. That said, I don't claim the main case of this happening is because of applying too much optimization pressure to one's values, at least not in a way that's unaligned with what's good -- I just think it's likely to be the good thing to do (or like, part of all the close-to-optimal packages of actions, or etc.). So I'm not explicitly disagreeing with the last sentence of your comment, but I'm disagreeing with the possible implicit justification of the sentence that goes through ["I would stop loving my family" being false].

The argument for it being good to stop loving your family in such circumstances is just that it's suboptimal for having an interesting life, or for [the sum over humans of interestingness of their lives] if you are altruistic, or whatever, for post-IQ-boost-you to spend a lot of time with people much dumber than you, which your family is now likely to be. (Here are 3 reasons to find a new family: you will have discussions which are more fun -> higher personal interestingness; you will learn more from these discussions -> increased productivity; and something like productivity being a convex function of IQ -- this comes in via IQs of future kids, at least assuming the change in your IQ would be such as to partially carry over to kids. I admit there is more to consider here, e.g. some stuff with good incentives, breaking norms of keeping promises -- my guess is that these considerations have smaller contributions.) 

Comment by Kaarel (kh) on Is AI Progress Impossible To Predict? · 2022-05-17T09:54:23.953Z · LW · GW

I started writing this but lost faith in it halfway through, and realized I was spending too much time on it for today. I figured it's probably a net positive to post this mess anyway although I have now updated to believe somewhat less in it than the first paragraph indicates. Also I recommend updating your expected payoff from reading the rest of this somewhat lower than it was before reading this sentence. Okay, here goes:

{I think people here might be attributing too much of the explanatory weight on noise. I don't have a strong argument for why the explanation definitely isn't noise, but here is a different potential explanation that seems promising to me. (There is a sense in which this explanation is still also saying that noise dominates over any relation between the two variables -- well, there is a formal sense in which that has to be the case since the correlation is small -- so if this formal thing is what you mean by "noise", I'm not really disagreeing with you here. In this case, interpret my comment as just trying to specify another sense in which the process might not be noisy at all.) This might be seen as an attempt to write down the "sigmoids spiking up in different parameter ranges" idea in a bit more detail.

First, note that if the performance on every task is a perfectly deterministic logistic function with midpoint x_0 and logistic growth rate k, i.e. there is "no noise", with k and x_0 being the same across tasks, then these correlations would be exactly 0. (Okay, we need to be adding an epsilon of noise here so that we are not dividing by zero when calculating the correlation, but let's just do that and ignore this point from now on.) Now as a slightly more complicated "noiseless" model, we might suppose that performance on each task is still given by a "deterministic" logistic function, but with the parameters k and x_0 being chosen at random according to some distribution. It would be cool to compute some integrals / program some sampling to check what correlation one gets when k and x_0 are both normally distributed with reasonable means and variances for this particular problem, with no noise beyond that.}

This is the point where I lost faith in this for now. I think there are parameter ranges for how k and x_0 are distributed where one gets a significant positive correlation and ranges where one gets a significant negative correlation in the % case. Negative correlations seem more likely for this particular problem. But more importantly, I no longer think I have a good explanation why this would be so close to 0. I think in logit space, the analysis (which I'm omitting here) becomes kind of easy to do by hand (essentially because the logit and logistic function are inverses), and the outcome I'm getting is that the correlation should be positive, if anything. Maybe it becomes negative if one assumes the logistic functions in our model are some other sigmoids instead, I'm not sure. It seems possible that the outcome would be sensitive to such details. One idea is that maybe if one assumes there is always eps of noise and bounds the sigmoid away from 1 by like 1%, it would change the verdict.

Anyway, the conclusion I was planning to reach here is that there is a plausible way in which all the underlying performance curves would be super nice, not noisy at all, but the correlations we are looking at would still be zero, and that I could also explain the negative correlations without noisy reversion to the mean (instead this being like a growth range somewhere decreasing the chance there is a growth range somewhere else) but the argument ended up being much less convincing than I anticipated. In general, I'm now thinking that most such simple models should have negative or positive correlation in the % case depending on the parameter range, and could be anything for logit. Maybe it's just that these correlations are swamped by noise after all. I'll think more about it.

Comment by Kaarel (kh) on The Last Paperclip · 2022-05-13T07:19:57.326Z · LW · GW

That was interesting! Thank you!

Comment by Kaarel (kh) on Various Alignment Strategies (and how likely they are to work) · 2022-05-04T01:35:50.707Z · LW · GW

There is also another way that super-intelligent AI could be aligned by definition.  Namely, if your utility function isn't "humans survive" but instead "I want the future to be filled with interesting stuff".  For all the hand-wringing about paperclip maximizers, the fact remains that any AI capable of colonizing the universe will probably be pretty cool/interesting.  Humans don't just create poetry/music/art because we're bored all the time, but rather because expressing our creativity helps us to think better.  It's probably much harder to build an AI that wipes out all humans and then colonizes space and is also super-boring, than to make one that does those things in a way people who fantasize about giant robots would find cool.

I'm not convinced that (the world with) a superintelligent AI would probably be pretty cool/interesting. Does anyone know of a post/paper/(sci-fi )book/video/etc that discusses this? (I know there's this :P and maybe this.) Perhaps let's discuss this! I guess the answer depends on how human-centered/inspired (not quite the right term, but I couldn't come up with a better one) our notion of interestingness is in this question. It would be cool to have a plot of expected interestingness of the first superintelligence (or well, instead of expectation it is better to look at more parameters, but you get the idea) as a function of human-centeredness of what's meant by "interestingness". Of course, figuring this out in detail would be complicated, but it nevertheless seems likely that something interesting could be said about it.

I think we (at least also) create poetry/music/art because of godshatter. To what extent should we expect AI to godshatter, vs do something like spending 5 minutes finding one way to optimally turn everything into paperclips and doing that for all eternity? The latter seems pretty boring. Or idk, maybe the "one way" is really an exciting enough assortment of methods that it's still pretty interesting even if it's repeated for all eternity?

Comment by Kaarel (kh) on Inferring utility functions from locally non-transitive preferences · 2022-02-12T15:42:56.756Z · LW · GW

more on 4: Suppose you have horribly cyclic preferences and you go to a rationality coach to fix this. In particular, your ice cream preferences are vanilla>chocolate>mint>vanilla. Roughly speaking, Hodge is the rationality coach that will tell you to consider the three types of ice cream equally good from now on, whereas Mr. Max Correct Pairs will tell you to switch one of the three preferences. Which coach is better? If you dislike breaking cycles arbitrarily, you should go with Hodge. If you think losing your preferences is worse than that, go with Max. Also, Hodge has the huge advantage of actually being done in a reasonable amount of time :)

Comment by Kaarel (kh) on Inferring utility functions from locally non-transitive preferences · 2022-02-12T14:27:31.451Z · LW · GW

3. Ahh okay thanks, I have a better picture of what you mean by a basis of possibility space now. I still doubt that utility interacts nicely with this linear structure though. The utility function is linear in lotteries, but this is distinct from being linear in possibilities. Like, if I understand your idea on that step correctly, you want to find a basis of possibility-space, not lottery space. (A basis on lottery space is easy to find -- just take all the trivial lotteries, i.e. those where some outcome has probability 1.) To give an example of the contrast: if the utility I get from a life with vanilla ice cream is u_1 and the utility I get from a life with chocolate ice cream is u_2, then the utility of a lottery with 50% chance of each is indeed 0.5 u_1 + 0.5 u_2. But what I think you need on that step is something different. You want to say something like "the utility of the life where I get both vanilla ice cream and chocolate ice cream is u_1+u_2". But this still seems just morally false to me. I think the mistake you are making in the derivation you give in your comment is interpreting the numerical coefficients in front of events as both probabilities of events or lotteries and as multiplication in the linear space you propose. The former is fine and correct, but I think the latter is not fine. So in particular, when you write u(2A), in the notation of the source you link, this can only mean "the utility you get from a lottery where the probability of A is 2", which does not make sense assuming you don't allow your probabilities to be >1. Or even if you do allow probabilities >1, it still won't give you what you want. In particular, if A is a life with vanilla ice cream, then in their notation, 2A does not refer to a life with twice the quantity of vanilla ice cream, or whatever. 

4. I think the gradient part of the Hodge decomposition is not (in general) the same as the ranking with the minimal number of incorrect pairs. Fun stuff

Comment by Kaarel (kh) on Inferring utility functions from locally non-transitive preferences · 2022-02-11T16:23:43.163Z · LW · GW

 I liked the post; here are some thoughts, mostly on the "The futility of computing utility" section:

1 )

If we're not incredibly unlucky, we can hope to sort N-many outcomes with  comparisons.

I don't understand why you need to not be incredibly unlucky here. There are plenty of deterministic algorithms with this runtime, no?

2) I think that in step 2, once you know the worst and the best outcome, you can skip to step 3 (i.e. the full ordering does not seem to be needed to enter step 3. So instead of sorting in n log n time, you could find min and max in linear time, and then skip to the psychophysics.

3) Could you elaborate on what you mean by a basis of possibility-space? It is not obvious to me that possibility-space has a linear structure (i.e. that it is a vector space), or that utility respects this linear structure (e.g. the utility I get from having chocolate ice cream and having vanilla ice cream is not in general approximated well by the sum of the utilities I get from having only one of these, and similarly for multiplication by scalars). Perhaps you were using these terms metaphorically, but then I currently have doubts about this being a great metaphor / would appreciate having the metaphor spelled out more explicitly. I could imagine doing something like picking some random subset of the possibilities, doing something to figure out the utilities of this subset, and then doing some linear regression (or something more complicated) on various parameters to predict the utilities of all possibilities. It seems like a more symmetric way to think about this might be to consider the subset of outcomes (with each pair, or some of the pairs, being labeled according to which one is preferred) to be the training data, and then training a neural network (or whatever) that predict utilities of outcomes so as to minimize loss on this training data. And then to go from training data to any possibility, just unleash the same neural network on that possibility. (Under this interpretation, I guess "elementary outcomes" <-> training data, and there does not seem to be a need to assume linear structure of possibility-space.) 

4) I think I have something cool to say about a specific and very related problem. Input: a set of outcomes and some pairwise preferences between them. Desired output: a total order on these outcomes such that the number of pairs which are ordered incorrectly is minimal. It turns out that this is NP-hard: https://epubs.siam.org/doi/pdf/10.1137/050623905. (The problem considered in this paper is the special case of the problem I stated where all the pairwise preferences are given, but well, if a special case is NP-hard, then the problem itself is also NP-hard.)

Comment by Kaarel (kh) on The Kelly Criterion · 2022-01-16T08:11:34.866Z · LW · GW

Or maybe to state a few things a bit more clearly: we first showed that E[X_n|X_{n-1}=x]<=2px, with equality iff we bet everything on step n. Using this, note that

, with equality iff we bet everything on step n conditional on any value of X_{n-1}. So regardless of what you do for the first n-1 steps, what you should do on step n is to bet everything, and this gives you the expectation E[X_n]=2pE[X_{n-1}]. Then finish as before.

Comment by Kaarel (kh) on The Kelly Criterion · 2022-01-16T07:51:23.340Z · LW · GW

If you have money x after n-1 steps, then betting a fraction f on the n'th step gives you expected money (1-f)x+f2px. Given p>0.5, this is maximized at f=1, i.e. betting everything, which gives the expectation 2px. So conditional on having money x after n-1 steps, to maximize expectation after n steps, you should bet everything. Letting X_i be the random variable that is the amount of money you have after i steps given your betting strategy. We have  (one could also write down a continuous version of the same conditioning but it is a bit easier to read if we assume that the set of possible amounts of money after n-1 steps is discrete, which is what I did here). From this formula, it follows that for any given strategy up to step n-1, hence given values for P(X_{n-1}=x), the thing to do on step n that maximizes E[X_n] is the same as the thing to do that maximizes E[X_n|X_{n-1}=x] for each x. So to maximize E[X_n], you should bet everything on the n'th step. If you bet everything, then the above formula gives 

To recap what we showed so far: we know that given any strategy for the first n-1 steps, the best thing to do on the last step gives E[X_n]=2pE[X_{n-1}]. It follows that the strategy with maximal E[X_n] is the one with maximal 2pE[X_{n-1}], or equivalently the one with maximal E[X_{n-1}].

Now repeat the same argument for step n-1 to conclude that one should bet everything on step n-1 to maximize the expectation after it, and so on.

Comment by Kaarel (kh) on The Kelly Criterion · 2022-01-08T21:57:19.346Z · LW · GW

You can prove e.g. by (backwards) induction that you should bet everything every time. With the odds being p>0.5 and 1-p, if the expectation of whatever strategy you are using after n-1 steps is E, then the maximal expectation over all things you could do on the n'th step is p2E (you can see this by writing the expectation as a conditional sum over the outcomes after n-1 steps), which corresponds uniquely to the strategy where you bet everything in any situation on the n'th step. It then follows that the best you can do on the (n-1)th step is also to maximize the expectation after it, and the same argument gives that you should bet everything, and so on.

(Where did you get n=10^5 from? If it came from some computer computation, then I would wager that there was some overflow/numerical issues.)

Comment by Kaarel (kh) on Why maximize human life? · 2022-01-08T09:44:23.530Z · LW · GW

"Or perhaps even: that preventing humans from being born is as bad as killing living humans."

I'm not sure if this is what you were looking for, but here are some thoughts on the "all else equal" version of the above statement. Suppose that Alice is the only person in the universe. Suppose that Alice would, conditional on you not intervening, live a really great life of 100 years. Now on the 50th birthday of Alice, you (a god-being) have the option to painlessly end Alice's life, and in place of her to create a totally new person, let's call this person Bob, who comes into existence as a 50-year old with a full set of equally happy (but totally different) memories, and who (you know) has an equally great life ahead of them as Alice would have if you choose not to intervene. (All this assumes that interpersonal comparisons of what a "great" life is make sense. I personally highly doubt one can do anything interesting in ethics without such a notion; this is just to let people know about a possible point of rejecting this argument.)

Do you think it is bad to intervene in this way? (My position is that intervening is morally neutral.) If you think it is bad to intervene, then consider intervening twice in short succession, once painlessly replacing Alice with Bob, and then painlessly replacing Bob with Alice again. Would this be even worse? Since this double-swapping process gives an essentially identical (block) universe as just doing nothing, I have a hard time seeing how anything significantly bad could have happened.

 

Or consider a situation in which this universe had laws of nature such that Alice was to "naturally" turn into Bob on her 50th birthday without any intervention by you. Would you then be justified in immediately swapping Alice and Bob again to prevent Alice from being "killed"?

 

(Of course, the usual conditions of killing someone vs creating a new person are very much non-equivalent in practice in the various ways in which the above situation was constructed to be equivalent. Approximately no one thinks that never having a baby is as bad as having a baby and then killing them.)

Comment by Kaarel (kh) on Why maximize human life? · 2022-01-08T09:04:52.338Z · LW · GW

It could just be that a world with additional happy people is better according to my utility function, just like a world with fewer painlessly killed people per unit of time is better according to my utility function. While I agree that goodness should be "goodness for someone" in the sense that my utility function should be something like a function only of the mental states of all moral patients (at all times, etc.), I disagree with the claim that the same people have to exist in two possible worlds for me to be able to say which is better, which is what you seem to be implying in your comment. One world can be better (according to my utility function) than another because of some aggregation of the well-beings of all moral patients within it being larger. I think most people have such utility functions. Without allowing for something like this, I can't really see a way to construct an ethical model that tells essentially anything interesting about any decisions at all (at least for people who care about other people), as all decisions probably involve choosing between futures with very different sets of moral patients.

Comment by Kaarel (kh) on The Kelly Criterion · 2022-01-03T10:39:50.901Z · LW · GW

I think this comment is incorrect (in the stated generality). Here is a simple counterexample. Suppose you have a starting endowment of $1, and that you can bet any amount at 0.50001 probability of doubling your bet and 0.49999 probability of losing everything you bet. You can bet whatever amount of your money you want a total of n times. (If you lost everything in some round, we can think of this as you still being allowed to bet 0 in remaining future rounds.) The strategy that maximizes expected linear utility is the one where you bet everything every time.