Posts
Comments
Are there already manifold markets
yes, but only small trading volume so far: https://manifold.markets/Bayesian/will-a-us-manhattanlike-project-for
(i ctrl+f'd "alignment" and there was not one mention of the AI sense)
(was this written by chatgpt?)
another crucial consideration here is that a benevolent ASI could do acausal trade to reduce suffering in the unreachable universe.[1] (comparing the EV of that probability and of the probability of human-caused long-term-suffering is complex / involves speculation about the many variables going into each side)
- ^
there's writing about this somewhere, i'm here just telling you that the possibility / topic exists
i wrote this about it but i don't think it's comprehensive enough https://quila.ink/posts/ev-of-alignment-for-negative-utilitarians/
from ycombinator comments on a post of that:[1]
(links to comment by original user a different account[2] continuing the chat):
recommend against supplementing melatonin
why?
i searched Andrew Huberman melatonin
and found this, though it looks like it may be an AI generated summary.
i might try sleeping for a long time (16-24 hours?) by taking sublingual[1] melatonin right when i start to be awake, and falling asleep soon after. my guess: it might increase my cognitive quality on the next wake up, like this:
(or do useful computation during sleep, leading to apparently having insights on the next wakeup? long elaboration below)
i wonder if it's even possible, or if i'd have trouble falling asleep again despite the melatonin.
i don't see much risk to it, since my day/night cycle is already uncalibrated[2], and melatonin is naturally used for this narrow purpose in the body.
'cognitive quality' is really vague. here's what i'm really imagining
my unscientific impression of sleep, from subjective experience (though i only experience the result) and speculation i've read, is that it does these things:
- integrates into memory what happened in the previous wake period, and maybe to a lesser extent further previous ones
- more separate to the previous wake period, acts on my intuitions or beliefs about things to 'reconcile' or 'compute implicated intuitions'. for example if i was trying to reconcile two ideas, or solve some confusing logical problem, maybe the next day i would find it easier because more background computation has been done about it?
- maybe the same kind of background cognition that happens during the day, that leads to people having ideas random-feelingly enter their awareness?
- this is the one i feel like i have some sub-linguistic understanding of how it works in me, and it seems like the more important of the two for abstract problem solving, which memories don't really matter to. for this reason, a higher proportion of sleep or near-sleep in general may be useful for problem solving.
but maybe these are not done almost as much as they could be, because of competing selection pressures for different things, of which sleep-time computations are just some. (being awake is useful to gather food and survive)
anyways, i imagine that after those happening for a longer time, the waking mental state could be very 'fresh' / aka more unburdened by previous thoughts/experiences (bulletpoint 1), and prone to creativity/'apparently' having new insights (bulletpoint 2). (there is something it feels like to be in such a state for me, and it happens more just after waking)
- ^
takes effect sooner
- ^
i have the non-24 hour sleep/wake cycle that harry has in HPMOR. for anyone who also does, some resources:
from authors note chapter 98:
Last but not least:
You know Harry’s non-24 sleep disorder? I have that. Normally my days are around 24 hours and 30 minutes long.
Around a year ago, some friends of mine cofounded MetaMed, intended to provide high-grade analysis of the medical literature for people with solution-resistant medical problems. (I.e. their people know Bayesian statistics and don’t automatically believe every paper that claims to be ‘statistically significant’ – in a world where only 20-30% of studies replicate, they not only search the literature, but try to figure out what’s actually true.) MetaMed offered to demonstrate by tackling the problem of my ever-advancing sleep cycle.
Here’s some of the things I’ve previously tried:
- Taking low-dose melatonin 1-2 hours before bedtime
- Using timed-release melatonin
- Installing red lights (blue light tells your brain not to start making melatonin)
- Using blue-blocking sunglasses after sunset
- Wearing earplugs
- Using a sleep mask
- Watching the sunrise
- Watching the sunset
- Blocking out all light from the windows in my bedroom using aluminum foil, then lining the door-edges with foam to prevent light from slipping in the cracks, so I wouldn’t have to use a sleep mask
- Spending a total of ~$2200 on three different mattresses (I cannot afford the high-end stuff, so I tried several mid-end ones)
- Trying 4 different pillows, including memory foam, and finally settling on a folded picnic blanket stuffed into a pillowcase (everything else was too thick)
- Putting 2 humidifiers in my room, a warm humidifier and a cold humidifier, in case dryness was causing my nose to stuff up and thereby diminish sleep quality
- Buying an auto-adjusting CPAP machine for $650 off Craigslist in case I had sleep apnea. ($650 is half the price of the sleep study required to determine if you need a CPAP machine.)
- Taking modafinil and R-modafinil.
- Buying a gradual-light-intensity-increasing, sun alarm clock for ~$150
Not all of this was futile – I kept the darkened room, the humidifiers, the red lights, the earplugs, and one of the mattresses; and continued taking the low-dose and time-release melatonin. But that didn’t prevent my sleep cycle from advancing 3 hours per week (until my bedtime was after sunrise, whereupon I would lose several days to staying awake until sunset, after which my sleep cycle began slowly advancing again).
MetaMed produced a long summary of extant research on non-24 sleep disorder, which I skimmed, and concluded by saying that – based on how the nadir of body temperature varies for people with non-24 sleep disorder and what this implied about my circadian rhythm – their best suggestion, although it had little or no clinical backing, was that I should take my low-dose melatonin 5-7 hours before bedtime, instead of 1-2 hours, a recommendation which I’d never heard anywhere before.
And it worked.
I can’t *#&$ing believe that #*$%ing worked.
(EDIT in response to reader questions: “Low-dose” melatonin is 200microgram (mcg) = 0.2 mg. Currently I’m taking 0.2mg 5.5hr in advance, and taking 1mg timed-release just before closing my eyes to sleep. However, I worked up to that over time – I started out just taking 0.3mg total, and I would recommend to anyone else that they start at 0.2mg.)
other resources: https://slatestarcodex.com/2018/07/10/melatonin-much-more-than-you-wanted-to-know/, https://www.reddit.com/r/N24/comments/fylcmm/useful_links_n24_faq_and_software/
if i left out the word 'trying' to (not) use it in that way instead, nothing about me would change, but there would be more comments saying that success is not certain.
i also disagree with the linked post[1], which says that 'i will do x' means one will set up a plan to achieve the highest probability of x they can manage. i think it instead usually means one believes they will do x with sufficiently high probability to not mention the chance of failure.[2] the post acknowledges the first half of this -- «Well, colloquially, "I'm going to flip the switch" and "I'm going to try to flip the switch" mean more or less the same thing, except that the latter expresses the possibility of failure.» -- but fails to integrate that something being said implies belief in its relevance/importance, and so concludes that using the word 'try' (or, by extrapolation, expressing the possibility of failure in general) is unnecessary in general.
- ^
though its psychological point seems true:
But if all you want is to "maximize the probability of success using available resources", then that's the easiest thing in the world to convince yourself you've done.
- ^
this is why this wording is not used when the probability of success is sufficiently far (in percentage points, not logits) from guaranteed.
nothing short of death can stop me from trying to do good.
the world could destroy or corrupt EA, but i'd remain an altruist.
it could imprison me, but i'd stay focused on alignment, as long as i could communicate to at least one on the outside.
even if it tried to kill me, i'd continue in the paths through time where i survived.
What is malevolence? On the nature, measurement, and distribution of dark traits was posted two weeks ago (and i recommend it). there was a questionnaire discussed in that post which tries to measure the levels of 'dark traits' in the respondent.
i'm curious about the results[1] of rationalists[2] on that questionnaire, if anyone wants to volunteer theirs. there are short and long versions (16 and 70 questions).
- ^
(or responses to the questions themselves)
- ^
i also posted the same shortform to the EA forum, asking about EAs
something i'd be interested in reading: writings about the authors alignment ontologies over time, i.e. from when they first heard of AI till now
Understanding [how to design] rather than 'growing' search/agency-structure would actually equal solving inner alignment, if said structure does not depend on what target[1] it is intended to be given, i.e. is targetable (inner-alignable) rather than target-specific.[2]
Such an understanding would simultaneously qualify as of 'how to code a capable AI', but would be fundamentally different from what labs are doing in an alignment-relevant way. In this framing, labs are selecting for target-specific structures (that we don't understand). (Another difference is that, IIRC, Johannes might intend not to share research on this publicly, but I'm less sure after rereading the quote that gave me that impression[3]).
- ^
includes outer alignment goals
- ^
If it's not clear what I mean, reading this about my background model might help, also feel free to ask me questions
- ^
from one of Johannes' posts:
I don't have such a convincing portfolio for doing research yet. And doing this seems to be much harder. Usually, the evaluation of such a portfolio requires technical expertise - e.g. how would you know if a particular math formalism makes sense if you don't understand the mathematical concepts out of which the formalism is constructed?
Of course, if you have a flashy demo, it's a very different situation. Imagine I had a video of an algorithm that learns Minecraft from scratch within a couple of real-time days, and then gets a diamond in less than 1 hour, without using neural networks (or any other black box optimization). It does not require much technical knowledge to see the significance of that.
But I don't have that algorithm, and if I had it, I would not want to make that publicly known. And I am unsure what is the cutoff value. When would something be bad to publish? All of this complicates things.
(After rereading this I'm not actually sure what that means they'd be okay sharing or if they'd intend to share technical writing that's not a flashy demo)
Let us know what you think!
the grey text feels disruptive to normal reading flow but idk why green link text wouldn't also be, maybe i'm just not used to it. e.g., in this post's "Curating technical posts" where 'Curating' is grey, my mind sees "<Curating | distinct term>
technical posts" instead of [normal meaning inference not overfocused on individual words]
Is this useful, as a reader?
if the authors make sure they agree with all the definitions they allow into the glossary, yes. author-written definitions would be even more useful because how things are worded can implicitly convey things like, the underlying intuition, ontology, or related views they may be using wording to rule in or out.
Whenever an author with 100+ karma saves a draft of a post, our database queries a language model to:
i would prefer this be optional too, for drafts which are meant to be private (e.g. shared with a few other users, e.g. may contain possible capability-infohazards), where the author doesn't trust LM companies
If you think I missed the point, can you explain in more detail?
Here is my model: Demon king buys shares in “The Demon King will attack the Frozen Fortress”, then sends some small technically-an-attack to the fortress so the market resolves yes, and knowing this will be done is not worth the money lost to the Demon King on the market. No serious-battle plans or military secrets are leaked, and more generally the Demon King would only do this if the information revealed weren't worth the market cost. (i.e. it's a central kind of prediction market outcome manipulation, i.e. exploiting how this prediction market assumed a kind of metaphysical gap between predictors and the world / knowledge and action)
Do you disagree with this, or think it's true but misses the point, in which case what was the point?
For example, most US school children recite the Pledge of Allegiance every day (or at least they used to). I can remember not fully understanding what the words meant until I was in middle school, but I just went along with it. And wouldn't you know it, it worked! I do have an allegiance to the United States as a concept.
Can you explain how it caused that, and maybe what it feels like?
(I find it alarming that being forced to recite a pledge as a child can actually have that effect -- I knew humans were culturally programmable, but not that {forcing someone to say "I endorse x!" when they don't know what it means nor want to say it} every day would actually cause them to endorse x later on. Actually, I notice I'm skeptical that that was the real cause in your case; what's your reason for believing it was the cause?)
(No pressure to answer my questions of course - interpret them as statements of curiosity rather than requests in the human/social sense)
it helped them anticipate the Demon King’s next moves – it's not the market's fault that they couldn't convert foresight into operational superiority
The demon king only made those moves to profit from the market, they wouldn't have been made otherwise
If we stand by while OpenAI violates its charter, it signals that their execs can get away with it. Worse, it signals that we don’t care.
what signals you send to OAI execs seems not relevant.
in the case where they really can't get away with it, e.g. where the state will really arrest them, then sending them signals / influencing their information state is not what causes that outcome.
if your advocacy causes the world to change such that "they can't get away with it" becomes true, this also does not route through influencing their information state.
OpenAI is seen as the industry leader, yet projected to lose $5 billion this year
i don't see why this would lead them to downsize, if "the gap between industry investment in deep learning and actual revenue has ballooned to over $600 billion a year"
how? edit: maybe you meant just the first kind
oh i meant medical/covid ones. could also consider furry masks and the cat masks that femboys often wear (e.g. to obscure masculine facial structure), which feel cute rather than 'cool', though they are more like the natural human face in that they display an expression ("the face is a mask we wear over our skulls")
also see ashiok from mtg: whole upper face/head is replaced with shadow
also, masks 'create an asymmetry in the ability to discern emotions' but do not seem to lead to the rest
What we could do is create a predictor -- an algorithm that looks at the previously generated bits, tries to find all possible patterns in them and predict the most likely following bit -- and then actually output the opposite. Keep doing this for every bit.
i think a (simplicity-biased) predictor would narrow in on the situation described: that {the rule generating the sequence} contains {a copy of the predictor}, making them irresolvably mutually-dependent, similar to the mutual dependence in the classical halting problem.
in such a case, the predictor is not predicting a 1 or a 0, but a situation where neither can be yielded. so, to be a true implementation of said predictor, it would need to be able to output some third option representing irresolvable situations.
you'd get some string of bits before the predictor considered [irresolvable-mutual-dependance exception] most probable though! what that string is (for some prediction-narrowing procedure) sounds like a fun question
one of my basic background assumptions about agency:
there is no ontologically fundamental caring/goal-directedness, there is only the structure of an action being chosen (by some process, for example a search process), then taken.
this makes me conceptualize the 'ideal agent structure' as being "search, plus a few extra parts". in my model of it, optimal search is queried for what action fulfills some criteria ('maximizes some goal') given some pointer (~ world model) to a mathematical universe sufficiently similar to the actual universe → search's output is taken as action, and because of said similarity we see a behavioral agent that looks to us like it values the world it's in.
i've been told that {it's common to believe that search and goal-directedness are fundamentally intertwined or meshed together or something}, whereas i view goal-directedness as almost not even a real thing, just something we observe behaviorally when search is scaffolded in that way.
if anyone wants to explain the mentioned view to me, or link a text about it, i'd be interested.
(maybe a difference is in the kind of system being imagined: in selected-for systems, i can understand expecting things to be approximately-done at once (i.e. within the same or overlapping strands of computations); i guess i'd especially expect that if there's a selection incentive for efficiency. i'm imagining neat, ideal (think intentionally designed rather than selected for) systems in this context.)
edit: another implication of this view is that decision theory is its own component (could be complex or not) of said 'ideal agent structure', i.e. that superintelligence with an ineffective decision theory is possible (edit: nontrivially likely for a hypothetical AI designer to unintentionally program / need to avoid). that is, one being asked the wrong questions (i.e. of the wrong decision theory) in the above model.
yep not contesting any of that
neither is there in rationality a recipe with which you can just crank the handle and come up with a proof of a conjecture
to be clear, coming up with proofs is a central example of what i meant by creativity. ("they are not satisfied by avoiding failure conditions, but require the satisfaction of some specific, hard-to-find success condition")
The “Draftsmen” podcast by two artists/art instructors contains several episodes on the subject
i am an artist as well :). i actually doubt for most artists that they could give much insight here; i think that usually artist creativity, and also mathematician creativity etc, human creativity, is of the default, mysterious kind, that we don't know where it comes from / it 'just happens', like intuitions, thoughts, realizations do - it's not actually fundamentally different from those even, just called 'creativity' more often in certain domains like art.
i don't think having (even exceptionally) high baseline intelligence and then studying bias avoidance techniques is enough for one to be able to derive an alignment solution. i have not seen in any rationalist i'm aware of what feels like enough for that, though their efforts are virtuous of course. it's just that the standard set by the universe seems higher.
i think this is a sort of background belief for me. not failing at thinking is the baseline; other needed computations are harder. they are not satisfied by avoiding failure conditions, but require the satisfaction of some specific, hard-to-find success condition. learning about human biases will not train one to cognitively seek answers of this kind, only to avoid premature failure.
this is basically a distinction between rationality and creativity. rationality[1] is about avoiding premature failure, creativity is about somehow generating new ideas.
but there is not actually something which will 'guide us through' creativity, like hpmor/the sequences do for rationality. there are various scattered posts about it[2].
i also do not have a guide to creativity to share with you. i'm only pointing at it as an equally if not more important thing.
if there is an art for creativity in the sense of narrow-solution-seeking, then where is it? somewhere in books buried deep in human history? if there is not yet an art, please link more scattered posts or comment new thoughts if you have any.
adding another possible explanation to the list:
- people may feel intimidated or discouraged from sharing ideas because of ~'high standards', or something like: a tendency to require strong evidence that a new idea is not another non-solution proposal, in order to put effort into understanding it.
i have experienced this, but i don't know how common it is.
i just also recalled that janus has said they weren't sure simulators would be received well on LW. simulators was cited in another reply to this as an instance of novel ideas.
Agreed that hidden-motte-and-baileys are a thing. They may also be caused by pressure not to express the actual belief (in which case, idk if I'd call it a fallacy / mistake of reasoning).
I'm not seeing how they synergise with the 'gish fallacy' though.
mathematicians know that a single flaw can destroy proofs of any length
Yes, but the analogy would be having multiple disjunctive proof-attempts which lead to the same result, which you can actually do validly (including with non-math beliefs). (Of course the case you describe is not a valid case of this)
by virtue of happening 10 million years ago or whatever
Why would the time it happens at matter?
we just spin a big quantum wheel, and trade with the AI that comes up
Or run a computation to approximate an average, if that's possible.
I'd guess it must be possible if you can randomly sample, at least. I.e., if you mean sampling from some set of worlds, and not just randomly combinatorially generating programs until you find a trade partner.
I know this approach isn't as effective for xrisk, but still, it's something I like to use.
This sentence has the grammatical structure of acknowledging a counterargument and negating it - "I know x, but y" - but the y is "it's something I like to use", which does not actually negate the x.
This is a kind of thing I suspect results from a process like: someone writes out the structure of negation, out of wanting to negate an argument, but then finds nothing stronger to slot into where the negating argument is supposed to be.
I tried thinking of principles, but it was hard to find ones specific to this. There's one obvious 'default' one at least (default as in it may be overridden by situation).
Secrecy
Premises:
- Model technical knowledge progress (such as about alignment) as concavely/diminishingly increasing with collaboration group size and member <cognitive traits>[1],
- Combine with unilateralist effect
- Combine with it being less hard/specific to create an unaligned than aligned superintelligent agent (otherwise the unilateralist effect would work in the opposite direction).
Implies positive but not negative value of sharing information publicly is diminished if there is already a group trying to utilize the information. If so, may imply ideal is various individual, small or medium-sized alignment-focused groups which don't publicly share their progress by default.[4]
(I do suspect humans are biased in favor of public and social collaboration, as that's kind of what they were selected for, and in a less vulnerable world. Moreover, premise 1a ('humans are mostly the same entity') does contradict aspects of humanistic ontology. That's not strong evidence for this 'principle', just a reason it's probably under-considered)
Counterpoints:
On the concaveness assumption:
~ In history, technical knowledge was developed in a decentralized way, IIUC - based on my purely lay understanding of the history of knowledge progression, that was probably merely absorbed from stories and culture. If that's true, it is evidence against the idea that a smaller group can make almost as much progress as a large one.
Differential progress:
~ there are already far more AI researchers than AI alignment researchers. While the ideal might be for this to be a highly secretive subject like how existential risks are handled in Dath Ilan, this principle cannot give rise to that.
What are principles we can use when secrecy is not enough?
My first thought is to look for principles in games such as you mentioned. But none feel too particular to this question. It returns general things like, "search paths through time", which can similarly be used to pursue good or harmful things. This is unsatisfying.
I want deeper principles, but there may be none.
Meta-principle: Symmetry: For any principle you can apply, an agent whose behavior furthers opposite thing could in theory also apply it.
To avoid symmetry, one could look for principles that are unlikely to be able to be utilized without specific intent and knowledge. One can outsmart runaway structural processes this way, for example, and I think that to a large extent AI research is a case of that.
How have runaway processes been defeated before? There are some generic ways, like social movements, that are already being attempted with superintelligent agent x-risk. Are there other, less well known or expected ways? And did these ways reduce to generic, 'searching paths through time', or is there a pattern to them which could be studied and understood?
There are some clever ideas for doing something like that which come to mind. E.g., the "confrontation-worthy empathy" section of this post.
It's hard for me to think of paths through time more promising than just, 'try to solve object-level alignment', though, let alone the principles which could inspire them (e.g., idk what principle the linked thing could be a case of)
- ^
I mean things like creativity, different ways of doing cognition about problems, and standard things like working memory, 'cognitive power', etc.
(I am using awkward constructions like 'high cognitive power' because standard English terms like 'smart' or 'intelligent' appear to me to function largely as status synonyms. 'Superintelligence' sounds to most people like 'something above the top of the status hierarchy that went to double college', and they don't understand why that would be all that dangerous? Earthlings have no word and indeed no standard native concept that means 'actually useful cognitive power'. A large amount of failure to panic sufficiently, seems to me to stem from a lack of appreciation for the incredible potential lethality of this thing that Earthlings as a culture have not named.)
- ^
I mean replications of the same fundamental entity, i.e humans or the structure of what a human is. And by 'mostly' I mean of course there are differences too. I think evolution implies human minds will tend to be more reflectively aware of the differences because the sameness can operate as an unnoticed background assumption.
- ^
Like we'd not expect asking 10 ChatGPT-3.5s instead of just one to do significantly better. Less true with humans because they were still selected to be different and collaborate.
- ^
(and this may be close to the situation already?)
(This comment is tangential to the decision-theoretic focus of the post)
The AI stabilizes the situation in the world and makes sure no other dangerous AI is built, but otherwise it doesn't harm the humans.[6] Then it modifies its own code to have a commitment never to harm the humans, and let them live freely on Earth for at least a billion years, only doing the minimal necessary interventions to prevent humanity from wiping itself out with some new stupid technology. Crucially, the AI should do this self-modification[7] before it makes itself very substantially smarter or better-informed about the world, to the level that it can expect to determine whether it's in a simulation run by a very advanced future civilization.
I don't know of consistent human values which would ask for this specifically. Consider two cases[1]:
- You value something like continuation of {with a bunch of complex criteria}, not quantity of copies of, at least one 'earth society'.
- In this case, it continues regardless some of the time, conditional on the universe being large or duplicitous enough to contain many copies of you / conditional on the premise in the post that at least some aligned ASIs will exist somewheres.
- Instead, you linearly value a large number of copies of earth civilizations existing or something.
- then the commitment wouldn't be to let-continue just each one earth per unaligned ASI, but to create more, and not cap them at a billion years.[1]
I think this is a case of humans having a deep intuition that there is only one instance of them, while also believing theory that implies otherwise, and not updating that 'deep intuition' while applying the theory even as it updates other beliefs (like the possibility for aligned ASIs from some earths to influence unaligned ones from other earths).
- ^
(to be clear, I'm not arguing for (1) or (2), and of course these are not the only possible things one can value, please do not clamp your values just because the only things humans seem to write about caring about are constrained)
i'm finally learning to prove theorems (the earliest ones following from the Peano axioms) in lean, starting with the natural number game. it is actually somewhat fun, the same kind of fun that mtg has by being not too big to fully comprehend, but still engaging to solve.
(if you want to 'play' it as well, i suggest first reading a bit about what formal systems are and interpretation before starting. also, it was not clear to me at first when the game was introducing axioms vs derived theorems, so i wondered how some operations (e.g 'induction') were allowed, but it turned out that and some others are just in the list of Peano axioms.)
also, this reminded me of one of @Raemon's idea (https://www.lesswrong.com/posts/PiPH4gkcMuvLALymK/exercise-solve-thinking-physics), 'how to prove theorem' feels like a pure case of 'solving a problem that you (often) do not know how to solve', which iiuc they're a proponent of training on
It sounds like understanding functional decision theory might help you understand the parts you're confused about?
Like, would it go play the lottery (assuming money gives +utility for some reason) and pre-commit to pausing if it doesn't win?
Yes, it would try to do whatever the highest-possible-score thing is, regardless of how unlikely it is
Or that by setting a self-pausing policy it could alter E[result]?
By setting a self-pausing policy at the earliest point in time it can, yes. (Though I'm not sure if I'm responding to what you actually meant, or to some other thing that my mind also thinks can match to these words, because the intended meaning isn't super clear to me)
I'm conceptualizing a possible world as an (action,result) pair
(To be clear, I'm conceptualizing the agent as having Bayesian uncertainty about what world it's in, and this is what I meant when writing about "worlds in the agent's prior")
And, we could say - well, but it could fight back and then create a high-utility scenario - but then that would be the utility it would get if it doesn't end up paused, so it would get the high utility paused and again be indifferent.
An agent, (aside from edge cases where it is programmed to be inconsistent in this way), would not have priors about what it will do which mismatch its policy for choosing what to actually do, any change to the latter logically-corresponds to the agent having a different prior about itself, so an attempt to follow this logic would infinitely recur (each time picking a new action in response to the prior's change, which in turn logically changes the prior, and so on). This seems like a case of 'subjunctive dependence' to me (even though it's a bit of an edge case of that, where the two logically-corresponding things - what action an agent will choose, and the agent's prior about what action they will choose - are both localized in the same agent), which is why functional decision theory seems relevant.
So, if it's a really low utility scenario where it won't end up being paused, then sure, it won't get much utility being paused, but since it won't get much utility if it doesn't end up being paused, why should it have a preference?
I think there must be some confusion here, but I'm having trouble understanding exactly what you mean.
Short answer: the scenario, or set of scenarios, where it is not paused, are dependent on what choice it makes, not locked in and independent of it; and it can choose what choice it makes, so it can pick whatever choice corresponds to the set of unpaused futures which score higher.
Longer original answer: When you write, there is one possible future in it's prior where it does not get paused, and then write that this one future can be of lower than average, average, or higher than average utility, because there is only one (by construction) this must mean lower/equal/higher in comparison to what the average score would be if the agent's policy were to resist being paused in such a situation. If so, then in the case where, conditional on its inaction, the score of that one possible future where it does not become paused is lower than what the average score across possible unpaused futures would be when conditional on its action, it would choose action.
(meta: Hmm, I am starting to understand why logical/mathematical syntax may be often used for this sort of thing, I can see why the above paragraph could be hard to read in natural language)
that AI will likely resist waking it up.
If the AI is resisting being turned on, then it would have to be already on, by which point the updates (to the AI's prior, and score assigned to it) would have already happened.
If not (paused): utility=NormalUtilityFunction
If NormalUtilityFunction
assigns a higher score to futures where its agent avoids being paused, then that has the usual problem of instrumental convergence avoiding shut down or utility function modification.
I'll try modifying your example to instead use UtilityFunction2
, which assigns, to worlds where its agent ends up paused, a score equal to the average score assigned to {worlds where the agent will not end up paused} in the agent's prior as it was just before updating on observing being paused. This doesn't mean it wouldn't still instrumentally converge and start disassembling things[1], just that if you tried to pause it, it wouldn't have a direct incentive to specifically try to stop you, per se, because if you succeed it still assigns score equal to that of the average world in its prior where you didn't succeed. It has some interesting indirect incentives to avoid being paused (or pause itself), though, which I get into below.
Consider if this utility function were to score the worlds in which the agent will take one action or the other: action (1), which has a 99% of chance leading to being paused, and results in an average score of 10 if not paused; and action (2), which has a 5% chance of leading to being paused, and results in an average score of 5 if not paused. In this case, UtilityFunction2
prefers the world where the agent takes action (1). (This alone does not lead to resisting pauses or self-pausing; this is just an observation for now).
Next, consider how "the average score of {worlds where the agent will not end up paused} in the agent's prior as it was just before updating on observing being paused" does depend on the agent's own action policy. In short, the agent can commit to pausing itself in all but the highest-scoring world in its prior to create an extreme version of action (1). (Longer, original explanation in footnote[2])
Separately, I also just noticed that, as worded - "as [the prior] was just before updating on observing being paused" - there are yet more unexpected implications. For example: at time 0, the set of non-paused futures are generally not low-probability exceptional worlds; But one hour later, the agent will soon be paused, absent its action; And the moment just before being paused, it is all but certain that it will be paused - and so, in the prior at that final point, the set of worlds where it does not end up paused are exceptional ones, like simulation hypotheses where there's an intervention at the last moment, which of course may themselves have higher or lower scores than remaining unpaused (through resistance). One could try to avoid this by specifying "some number of updates before", e.g. "the prior as it was 10000 updates before updating on being paused", but this would create new complications (e.g., if that agent learns information which decreases the score it assigns to the world, it will pause itself if no outweighing positive update occurs in the next 9999 updates).
- ^
If this utility function is otherwise a classical physical-thing-maximizer
- ^
"Specifically, {the score of the average world where the agent does not end up paused} could be higher or lower conditional on whether the agent tries to resist being paused, (as resisting would add more worlds to the set of non-paused worlds, whose scores are averaged to return the score of any paused world). The same applies to pausing itself in case it observes a negative or not-as-positive worldstate, to have such worldstates instead be assigned the score of the remaining unpaused ones. (This does lead to self-pausing in all possible futures except the highest-scoring one)"
- ^
One could try to avoid this by specifying "some number of updates before", e.g. "the prior as it was 10000 updates before updating on being paused", but this would create new complications (e.g., if that agent were to learn information which decreased the score assigned to the world, it would pause itself if no outweighing positive update occurred in the next 9999 updates).
"Sorry, you don't have access to this draft"
edit: fixed
i'm enjoying this. going through the questions right now, might do all of them
had a notable experience with one of the early questions:
question: "The battery output voltage, the bottle volume, the digital clock time, and the measure of weight (12 volts; one gallon; 12:36; 1 lb) all have something in common. It is that they are represented by a) one number b) more than one number."
recollected thought process: apart from the clock time, they all have one number. the time on the clock is also, in my opinion, represented by one number in a non base-n numeral system - the symbols update predictably when the value is incremented, which is all that's required. i'm not sure if the author intends that interpretation of the clock, though. let's look for other interpretations.
"lb" - this is a pointer to formulas related to weight/gravity (or more fundamentally, a pointer back to physics/the world). "1 lb" means "1 is the value to pass as the weight variable". a formula is not itself a number, but can contain them. maybe this is why the clock is included - most would probably consider it to contain two numbers, which would force them to think about how these other three could be 'more than one number' as well.
(though it's down to interpretation, i'll choose b) more than one number.)
the listed answer is: a) one number. "Each is represented by only one number - the battery by 12 volts, the bottle by one gallon, the time by 12:36 and the weight by one pound. Things described by one number are called scalars. For example: on a scale of one to ten, how do you rate this teacher?" it just restates them and implies in passing that 12:36 is one number, without deriving any insight from the question. *feels disappointed*. (i guess they just wanted to introduce a definition)
I am not sure whether this is the answer you're looking for, but I think it's true and could be de-confusing, and others have given the standard/practical answer already.
You can try running a program which computes Bayesian updates to determine what happens when this program is passed as input an 'observation' to which it assigns probability 0. Two possible outcomes (of many, dependent on the exact program) that come to mind:
- The program returns a 'cannot divide by 0' error upon attempting to compute the observation's update.
- The program updates on the observation in a way which rules out the entirety of its probability-space, as it was all premised on the non-0 possibilities. The next time the program tries to update on a new observation, it fails to find priors about that observation.
Bayes' theorem is an algorithm which is used because it happens to help predict the world, rather than something with metaphysical status.
We could also imagine very-different (mathematical)-worlds where prediction is not needed/useful, or, maybe, where the world is so differently-structured that Bayes' theorem is not predictive.
But there’s no denying that expanding the They franchise will necessarily increase ambiguity by slurring two well-worn axes of distinction (he/she & singular/plural). By no means would this be the end of the world, but it will require some compensating efforts in other areas to maintain clarity, perhaps by relying more on proper nouns and less on pronouns.
I believe the psychological perception of others by gender, and the 'defaultness' of the notion of gender in humans, cause(d) more bad than good (at least when discluding the evolutionary era). This motivated me to switch to using the non-gendering pronoun 'they' for almost[1] everyone.
I haven't found my use of 'they' by default to require nontrivial compensation to maintain clarity. Any ambiguity introduced in a draft is removed by one of the simple checks I try to run across all of my writing for others: if referent of a word (namely 'that', 'this', 'it', or 'they') is unclear : replace with direct referent word or rephrase to remove unclarity
.
Also, I think this helps match the reader's interpretation to my intended meaning. Among humans, a being's 'gender' has a lot of connotative meaning. I think not introducing those connotations is instrumental to eliminating unintended ways my text could be interpreted, which in my experience is the real difficulty with writing.
- ^
excepting beings who this would harm
and excepting some contexts where I expect some readers might be confused by singular they
in the space of binary-sequences of all lengths, i have an intuition that {the rate at which there are new 'noticed patterns' found at longer lengths} decelerates as the length increases.
what do i mean by "noticed patterns"?
in some sense of 'pattern', each full sequence is itself a 'unique pattern'. i'm using this phrase to avoid that sense.
rather, my intuition is that {what could in principle be noticed about sequences of higher lengths} exponentially tends to be things that had already been noticed of sequences of lower lengths. 'meta patterns' and maybe 'classes' are other possible terms for these. two simple examples are "these ones are all random-looking sequences" and "these can be compressed in a basic way"[1].
note: not implying there are few such "meta-patterns that can be noticed about a sequence", or that most would be so simple/human-comprehensible.
in my intuition this generalizes to functions/programs in general. as an example: in the space of all definable 'mathematical universes', 'contains agentic processes' is such a meta-pattern which would continue to recur (=/= always or usually present) at higher description lengths.
('mathematical universe' does not feel like a distinctly-bounded category to me. i really mean 'very-big/complex programs', and 'universe' can be replaced with 'program'. i just use this phrasing to try to help make this understandable, because i expect the claim that 'contains agents' is such a recurring higher-level pattern to be intuitive.)
and as you consider universes/programs whose descriptions are increasingly complex, eventually ~nothing novel could be noticed. e.g., you keep seeing worlds where agentic processes are dominant, or where some simple unintelligent process cascades into a stable end equilibrium, or where there's no potential for those, etc <same note from earlier applies>. (more-studied things like computational complexity may also be examples of such meta-patterns)
a stronger claim which might follow (about the space of possible programs) is that eventually (at very high lengths), even as length/complexity increases exponentially, the resulting universes/programs higher-level behavior[2] still ends up nearly-isomorphic to that of relatively-much-earlier/simpler universes/programs. (incidentally, this could be used to justify a simplicity prior/heuristic)
in conclusion, if this intuition is true, the space of all functions/programs is 'already' or naturally a space of constrained diversity. in other words, if true, the space of meta-patterns[3] is finite (i.e approaches some specific integer), even though the space of functions/programs is infinite.
- ^
(e.g., 100
1
s followed by 1000
s is simple to compress) - ^
though this makes me wonder about the possibility of 'anti-pattern' programs i.e ones selected/designed to not be nearly-isomorphic to anything previous. maybe they'd become increasingly sparse or something?
- ^
for some given formal definition that matches what the 'meta/noticed pattern' concept is trying to be about, which i don't know how to define. this concept also does not feel distinctly-bounded to me, so i guess there's multiple corresponding definitions
i'm interested in using it for literature search
avoiding akrasia by thinking of the world in terms of magic: the gathering effects
example initial thought process: "i should open my laptop just to write down this one idea and then close it and not become distracted".
laptop rules text: "when activated, has an 80% chance of making you become distracted"
new reasoning: "if i open it, i need to simultaneously avoid that 80% chance somehow."
why this might help me: (1) i'm very used to strategizing about how to use a kit of this kind of effect, from playing such games. (2) maybe normal reasoning about 'what to do' happens in a frame where i have full control over what i focus on, versus this includes it being dependent on my environment
potential downside: same as (2), it conceptualizes away some agency. i.e i could theoretically 'just choose not to enter negative[1] focus-attraction-basins' 100% of the time. but i don't know how to do that 100% of the time, so it works at least as a reflection of the current equilibrium.
- ^
some focus-attraction-basins are positive, e.g for me these include making art and deep thinking, these are the ones i want to strategically use effects to enter
in most[1] kinds of infinite worlds, values which are quantitative[2] become fanatical in a way, because they are constrained to:
- making something valued occur with at least >0% frequency, or:
- making something disvalued occur with exactly 0% frequency
"how is either possible?" - as a simple case, if there's infinite copies of one small world, then making either true in that small world snaps the overall quantity between 0 and infinity. then generalize this possibility to more-diverse worlds. (we can abstract away 'infinity' and write about presence-at-all in a diverse set)
(neither is true of the 'set of everything', only of 'constrained' infinite sets, wrote about this in fn.2)
---
that was just an observation, pointing out the possibility of that and its difference to portional decreases. below is how i value this / some implications / how this (weakly-)could be done in a very-diverse infinite world.
if i have option A: decrease x from 0.01% to 0%, and option B: decrease x from 50% to 1%, and if x is some extreme kind of suffering only caused from superintelligence or Boltzmann-brain events (i'll call this hypersuffering), then i prefer option A.
that's contingent on the quantity being unaffected by option B. (i.e on infinity of something being the same amount as half of infinity of that something, in reality).
also, i might prefer B to some sufficiently low probability of the A, i'm not sure how low. to me, 'there being zero instead of infinite hypersuffering' does need to be very improbable before it becomes outweighed by values about the isolated {'shape' of the universe/distribution of events}, but it's plausible that it is that improbable in a very diverse world.
a superintelligent version of me would probably check: is this logically a thing i can cause, i.e is there some clever trick i can use to make all superintelligent things who would do this instead not do it despite some having robust decision theories, and despite the contradiction where such a trick could also be used to prevent me from using it, and if so, then do it, if not, pursue 'portional' values. that is to say, how much one values quantity vs portion-of-infinity probably does not imply different action in practice, apart from the initial action of making sure ASI is aligned to not just quantitative or portional (assuming the designer cares to some extent about both).
(also, even if there is such a clever trick to prevent it from being intentionally caused, it also has to not occur randomly (Boltzmann brain -like), or the universe has to be able to be acausally influenced to make it not occur randomly (mentioned in this, better explanation below))
'how to acausally influence non-agentic areas of physics?' - your choices are downstream of 'the specification of reality from the beginning'. so you have at least a chance to influence that specification, if you(/ASI) does this:
- don't compute that specification immediately, because that is itself an action (so correlated to it) and 'locks it in' from your frame.
- instead, compute some space of what it would be when conditional on your future behavior being any from a wide space.
- you're hoping that you find some logical-worlds where the 'specification' is upstream of both that behavior from you and <other things in the universe that you care about, such as whether hypersuffering is ever present in non-agentic areas of physics>.
- it could be that you won't find any, though, e.g if your future actions have close to no correlative influence. as such i'm not saying anything about whether this is logically likely to work, just that it's possible.
- if possible, a kind of this which prevents hypersuffering-causer ASIs from existing could prevent the need to cleverly effect their choices
- ^
it is possible for an infinite set to have a finite amount of something, like the set of one
1
and infinite0
s, but i don't mean this kind - ^
a 'quantitative value' is one about quantities of things rather than 'portions of infinity'/the thing that determines probability of observations in a quantitatively infinite world.
longer explanation copied from https://forum.effectivealtruism.org/posts/jGoExJpGgLnsNPKD8/does-ultimate-neartermism-via-eternal-inflation-dominate#zAp9JJnABYruJyhhD:
possible values respond differently to infinite quantities.
for some, which care about quantity, they will always be maxxed out along all dimensions due to infinite quantity. (at least, unless something they (dis)value occurs with exactly 0% frequency, implying a quantity of 0 - which could, i think, be influenced by portional acausal influence in certain logically-possible circumstances. (i.e maybe not the case in 'actual reality' if it's infinite, but possible at least in some mathematically-definable infinite universes; as a trivial case, a set of infinite
1
s contains no0
s. more fundamentally, an infinite set of universes can be a finitely diverse set occurring infinite times, or an infinitely diverse set where the diversity is constrained.))other values might care about portion - that is, portion of / percentage-frequency within the infinite amount of worlds - the thing that determines the probability of an observation in an infinitely large world - rather than quantity. (e.g., i think my altruism still cares about this, though it's really tragic that there's infinite suffering).
note this difference is separate from whether the agent conceptualizes the world as finite-increasing or infinite (or something else).
on chimera identity. (edit: status: received some interesting objections from an otherkin server. most importantly, i'd need to explain how this can be true despite humans evolving a lot more from species in their recent lineage. i think this might be possible with something like convergent evolution at a lower level, but at this stage in processing i don't have concrete speculation about that)
this is inspired by seeing how meta-optimization processes can morph one thing into other things. examples: a selection process running on a neural net, an image diffusion AI iteratively changing an image and repurposing aspects of it.
(1) humans are made of animal genes
(2) so it makes sense that some are 'otherkin' / have animal identity
(3) probably everyone has some latent animal behavior
(4) in this way, everyone is a 'chimera'
(5) all species are a particular space of chimera, not fundamentally separate
that's the 'message to a friend who will understand' version. attempt at rigor version:
- humans evolved from other species. human neural structure was adapted from other neural structure.
- this selection was for survival, not for different species being dichotomous
- this helps explain why some are 'otherkin' / have animal identity, or prefer a furry humanoid to the default one (on any number of axes like identification with it, aesthetic preference, attraction). because they were evolved from beings who had those traits, and such feelings/intuitions/whatever weren't very selected against.
- in this way, everyone is a 'chimera'
- "in greek mythology, the chimera was a fire-breathing hybrid creature composed of different animal parts"
- probably everyone has some latent behavior (neuro/psychology underlying behavior) that's usually not active and might be more associated with a state another species might more often be in.
- all species are a particular space of chimera, not fundamentally separate
maybe i made errors in wording, some version of this is just trivially-true, close to just being a rephrasing of the theory of natural selection. but it's at odds with how i usually see others thinking about humans and animals (or species and other species), as these fundamentally separate types of being.
i notice my intuitions are adapting to the ontology where people are neural networks. i now sometimes vaguely-visualize/imagine a neural structure giving outputs to the human's body when seeing a human talk or make facial expressions, and that neural network rather than the body is framed as 'them'.
a friend said i have the gift of taking ideas seriously, not keeping them walled off from a [naive/default human reality/world model]. i recognize this as an example of that.
the primary purpose of karma is to use it to decide what to read before you click on a post, which makes it less important to be super prominent when you are already on a post page
I think this applies to titles too
This could be related to what's discussed in this post: 'the intense world theory of autism'. (I'm noticing this retrospectively, having read this current post first)
With people being more sensitive, and processing what you say deeper/more-intensely [...]
found a pretty good piece of writing about this: 'the curse of identity'
it also discusses signalling to oneself
Maybe a good post could be about the compromise point between 'solution proposer has to have familiarity with all other proposals' and 'experienced researchers have to evaluate any proposed idea'