Posts
Comments
yep not contesting any of that
neither is there in rationality a recipe with which you can just crank the handle and come up with a proof of a conjecture
to be clear, coming up with proofs is a central example of what i meant by creativity. ("they are not satisfied by avoiding failure conditions, but require the satisfaction of some specific, hard-to-find success condition")
The “Draftsmen” podcast by two artists/art instructors contains several episodes on the subject
i am an artist as well :). i actually doubt for most artists that they could give much insight here; i think that usually artist creativity, and also mathematician creativity etc, human creativity, is of the default, mysterious kind, that we don't know where it comes from / it 'just happens', like intuitions, thoughts, realizations do - it's not actually fundamentally different from those even, just called 'creativity' more often in certain domains like art.
i don't think having (even exceptionally) high baseline intelligence and then studying bias avoidance techniques is enough for one to be able to derive an alignment solution. i have not seen in any rationalist i'm aware of what feels like enough for that, though their efforts are virtuous of course. it's just that the standard set by the universe seems higher.
i think this is a sort of background belief for me. not failing at thinking is the baseline; other needed computations are harder. they are not satisfied by avoiding failure conditions, but require the satisfaction of some specific, hard-to-find success condition. learning about human biases will not train one to cognitively seek answers of this kind, only to avoid premature failure.
this is basically a distinction between rationality and creativity. rationality[1] is about avoiding premature failure, creativity is about somehow generating new ideas.
but there is not actually something which will 'guide us through' creativity, like hpmor/the sequences do for rationality. there are various scattered posts about it[2].
i also do not have a guide to creativity to share with you. i'm only pointing at it as an equally if not more important thing.
if there is an art for creativity in the sense of narrow-solution-seeking, then where is it? somewhere in books buried deep in human history? if there is not yet an art, please link more scattered posts or comment new thoughts if you have any.
adding another possible explanation to the list:
- people may feel intimidated or discouraged from sharing ideas because of ~'high standards', or something like: a tendency to require strong evidence that a new idea is not another non-solution proposal, in order to put effort into understanding it.
i have experienced this, but i don't know how common it is.
i just also recalled that janus has said they weren't sure simulators would be received well on LW. simulators was cited in another reply to this as an instance of novel ideas.
Agreed that hidden-motte-and-baileys are a thing. They may also be caused by pressure not to express the actual belief (in which case, idk if I'd call it a fallacy / mistake of reasoning).
I'm not seeing how they synergise with the 'gish fallacy' though.
mathematicians know that a single flaw can destroy proofs of any length
Yes, but the analogy would be having multiple disjunctive proof-attempts which lead to the same result, which you can actually do validly (including with non-math beliefs). (Of course the case you describe is not a valid case of this)
by virtue of happening 10 million years ago or whatever
Why would the time it happens at matter?
we just spin a big quantum wheel, and trade with the AI that comes up
Or run a computation to approximate an average, if that's possible.
I'd guess it must be possible if you can randomly sample, at least. I.e., if you mean sampling from some set of worlds, and not just randomly combinatorially generating programs until you find a trade partner.
I know this approach isn't as effective for xrisk, but still, it's something I like to use.
This sentence has the grammatical structure of acknowledging a counterargument and negating it - "I know x, but y" - but the y is "it's something I like to use", which does not actually negate the x.
This is a kind of thing I suspect results from a process like: someone writes out the structure of negation, out of wanting to negate an argument, but then finds nothing stronger to slot into where the negating argument is supposed to be.
I tried thinking of principles, but it was hard to find ones specific to this. There's one obvious 'default' one at least (default as in it may be overridden by situation).
Secrecy
Premises:
- Model technical knowledge progress (such as about alignment) as concavely/diminishingly increasing with collaboration group size and member <cognitive traits>[1],
- Combine with unilateralist effect
- Combine with it being less hard/specific to create an unaligned than aligned superintelligent agent (otherwise the unilateralist effect would work in the opposite direction).
Implies positive but not negative value of sharing information publicly is diminished if there is already a group trying to utilize the information. If so, may imply ideal is various individual, small or medium-sized alignment-focused groups which don't publicly share their progress by default.[4]
(I do suspect humans are biased in favor of public and social collaboration, as that's kind of what they were selected for, and in a less vulnerable world. Moreover, premise 1a ('humans are mostly the same entity') does contradict aspects of humanistic ontology. That's not strong evidence for this 'principle', just a reason it's probably under-considered)
Counterpoints:
On the concaveness assumption:
~ In history, technical knowledge was developed in a decentralized way, IIUC - based on my purely lay understanding of the history of knowledge progression, that was probably merely absorbed from stories and culture. If that's true, it is evidence against the idea that a smaller group can make almost as much progress as a large one.
Differential progress:
~ there are already far more AI researchers than AI alignment researchers. While the ideal might be for this to be a highly secretive subject like how existential risks are handled in Dath Ilan, this principle cannot give rise to that.
What are principles we can use when secrecy is not enough?
My first thought is to look for principles in games such as you mentioned. But none feel too particular to this question. It returns general things like, "search paths through time", which can similarly be used to pursue good or harmful things. This is unsatisfying.
I want deeper principles, but there may be none.
Meta-principle: Symmetry: For any principle you can apply, an agent whose behavior furthers opposite thing could in theory also apply it.
To avoid symmetry, one could look for principles that are unlikely to be able to be utilized without specific intent and knowledge. One can outsmart runaway structural processes this way, for example, and I think that to a large extent AI research is a case of that.
How have runaway processes been defeated before? There are some generic ways, like social movements, that are already being attempted with superintelligent agent x-risk. Are there other, less well known or expected ways? And did these ways reduce to generic, 'searching paths through time', or is there a pattern to them which could be studied and understood?
There are some clever ideas for doing something like that which come to mind. E.g., the "confrontation-worthy empathy" section of this post.
It's hard for me to think of paths through time more promising than just, 'try to solve object-level alignment', though, let alone the principles which could inspire them (e.g., idk what principle the linked thing could be a case of)
- ^
I mean things like creativity, different ways of doing cognition about problems, and standard things like working memory, 'cognitive power', etc.
(I am using awkward constructions like 'high cognitive power' because standard English terms like 'smart' or 'intelligent' appear to me to function largely as status synonyms. 'Superintelligence' sounds to most people like 'something above the top of the status hierarchy that went to double college', and they don't understand why that would be all that dangerous? Earthlings have no word and indeed no standard native concept that means 'actually useful cognitive power'. A large amount of failure to panic sufficiently, seems to me to stem from a lack of appreciation for the incredible potential lethality of this thing that Earthlings as a culture have not named.)
- ^
I mean replications of the same fundamental entity, i.e humans or the structure of what a human is. And by 'mostly' I mean of course there are differences too. I think evolution implies human minds will tend to be more reflectively aware of the differences because the sameness can operate as an unnoticed background assumption.
- ^
Like we'd not expect asking 10 ChatGPT-3.5s instead of just one to do significantly better. Less true with humans because they were still selected to be different and collaborate.
- ^
(and this may be close to the situation already?)
(This comment is tangential to the decision-theoretic focus of the post)
The AI stabilizes the situation in the world and makes sure no other dangerous AI is built, but otherwise it doesn't harm the humans.[6] Then it modifies its own code to have a commitment never to harm the humans, and let them live freely on Earth for at least a billion years, only doing the minimal necessary interventions to prevent humanity from wiping itself out with some new stupid technology. Crucially, the AI should do this self-modification[7] before it makes itself very substantially smarter or better-informed about the world, to the level that it can expect to determine whether it's in a simulation run by a very advanced future civilization.
I don't know of consistent human values which would ask for this specifically. Consider two cases[1]:
- You value something like continuation of {with a bunch of complex criteria}, not quantity of copies of, at least one 'earth society'.
- In this case, it continues regardless some of the time, conditional on the universe being large or duplicitous enough to contain many copies of you / conditional on the premise in the post that at least some aligned ASIs will exist somewheres.
- Instead, you linearly value a large number of copies of earth civilizations existing or something.
- then the commitment wouldn't be to let-continue just each one earth per unaligned ASI, but to create more, and not cap them at a billion years.[1]
I think this is a case of humans having a deep intuition that there is only one instance of them, while also believing theory that implies otherwise, and not updating that 'deep intuition' while applying the theory even as it updates other beliefs (like the possibility for aligned ASIs from some earths to influence unaligned ones from other earths).
- ^
(to be clear, I'm not arguing for (1) or (2), and of course these are not the only possible things one can value, please do not clamp your values just because the only things humans seem to write about caring about are constrained)
i'm finally learning to prove theorems (the earliest ones following from the Peano axioms) in lean, starting with the natural number game. it is actually somewhat fun, the same kind of fun that mtg has by being not too big to fully comprehend, but still engaging to solve.
(if you want to 'play' it as well, i suggest first reading a bit about what formal systems are and interpretation before starting. also, it was not clear to me at first when the game was introducing axioms vs derived theorems, so i wondered how some operations (e.g 'induction') were allowed, but it turned out that and some others are just in the list of Peano axioms.)
also, this reminded me of one of @Raemon's idea (https://www.lesswrong.com/posts/PiPH4gkcMuvLALymK/exercise-solve-thinking-physics), 'how to prove theorem' feels like a pure case of 'solving a problem that you (often) do not know how to solve', which iiuc they're a proponent of training on
It sounds like understanding functional decision theory might help you understand the parts you're confused about?
Like, would it go play the lottery (assuming money gives +utility for some reason) and pre-commit to pausing if it doesn't win?
Yes, it would try to do whatever the highest-possible-score thing is, regardless of how unlikely it is
Or that by setting a self-pausing policy it could alter E[result]?
By setting a self-pausing policy at the earliest point in time it can, yes. (Though I'm not sure if I'm responding to what you actually meant, or to some other thing that my mind also thinks can match to these words, because the intended meaning isn't super clear to me)
I'm conceptualizing a possible world as an (action,result) pair
(To be clear, I'm conceptualizing the agent as having Bayesian uncertainty about what world it's in, and this is what I meant when writing about "worlds in the agent's prior")
And, we could say - well, but it could fight back and then create a high-utility scenario - but then that would be the utility it would get if it doesn't end up paused, so it would get the high utility paused and again be indifferent.
An agent, (aside from edge cases where it is programmed to be inconsistent in this way), would not have priors about what it will do which mismatch its policy for choosing what to actually do, any change to the latter logically-corresponds to the agent having a different prior about itself, so an attempt to follow this logic would infinitely recur (each time picking a new action in response to the prior's change, which in turn logically changes the prior, and so on). This seems like a case of 'subjunctive dependence' to me (even though it's a bit of an edge case of that, where the two logically-corresponding things - what action an agent will choose, and the agent's prior about what action they will choose - are both localized in the same agent), which is why functional decision theory seems relevant.
So, if it's a really low utility scenario where it won't end up being paused, then sure, it won't get much utility being paused, but since it won't get much utility if it doesn't end up being paused, why should it have a preference?
I think there must be some confusion here, but I'm having trouble understanding exactly what you mean.
Short answer: the scenario, or set of scenarios, where it is not paused, are dependent on what choice it makes, not locked in and independent of it; and it can choose what choice it makes, so it can pick whatever choice corresponds to the set of unpaused futures which score higher.
Longer original answer: When you write, there is one possible future in it's prior where it does not get paused, and then write that this one future can be of lower than average, average, or higher than average utility, because there is only one (by construction) this must mean lower/equal/higher in comparison to what the average score would be if the agent's policy were to resist being paused in such a situation. If so, then in the case where, conditional on its inaction, the score of that one possible future where it does not become paused is lower than what the average score across possible unpaused futures would be when conditional on its action, it would choose action.
(meta: Hmm, I am starting to understand why logical/mathematical syntax may be often used for this sort of thing, I can see why the above paragraph could be hard to read in natural language)
that AI will likely resist waking it up.
If the AI is resisting being turned on, then it would have to be already on, by which point the updates (to the AI's prior, and score assigned to it) would have already happened.
If not (paused): utility=NormalUtilityFunction
If NormalUtilityFunction
assigns a higher score to futures where its agent avoids being paused, then that has the usual problem of instrumental convergence avoiding shut down or utility function modification.
I'll try modifying your example to instead use UtilityFunction2
, which assigns, to worlds where its agent ends up paused, a score equal to the average score assigned to {worlds where the agent will not end up paused} in the agent's prior as it was just before updating on observing being paused. This doesn't mean it wouldn't still instrumentally converge and start disassembling things[1], just that if you tried to pause it, it wouldn't have a direct incentive to specifically try to stop you, per se, because if you succeed it still assigns score equal to that of the average world in its prior where you didn't succeed. It has some interesting indirect incentives to avoid being paused (or pause itself), though, which I get into below.
Consider if this utility function were to score the worlds in which the agent will take one action or the other: action (1), which has a 99% of chance leading to being paused, and results in an average score of 10 if not paused; and action (2), which has a 5% chance of leading to being paused, and results in an average score of 5 if not paused. In this case, UtilityFunction2
prefers the world where the agent takes action (1). (This alone does not lead to resisting pauses or self-pausing; this is just an observation for now).
Next, consider how "the average score of {worlds where the agent will not end up paused} in the agent's prior as it was just before updating on observing being paused" does depend on the agent's own action policy. In short, the agent can commit to pausing itself in all but the highest-scoring world in its prior to create an extreme version of action (1). (Longer, original explanation in footnote[2])
Separately, I also just noticed that, as worded - "as [the prior] was just before updating on observing being paused" - there are yet more unexpected implications. For example: at time 0, the set of non-paused futures are generally not low-probability exceptional worlds; But one hour later, the agent will soon be paused, absent its action; And the moment just before being paused, it is all but certain that it will be paused - and so, in the prior at that final point, the set of worlds where it does not end up paused are exceptional ones, like simulation hypotheses where there's an intervention at the last moment, which of course may themselves have higher or lower scores than remaining unpaused (through resistance). One could try to avoid this by specifying "some number of updates before", e.g. "the prior as it was 10000 updates before updating on being paused", but this would create new complications (e.g., if that agent learns information which decreases the score it assigns to the world, it will pause itself if no outweighing positive update occurs in the next 9999 updates).
- ^
If this utility function is otherwise a classical physical-thing-maximizer
- ^
"Specifically, {the score of the average world where the agent does not end up paused} could be higher or lower conditional on whether the agent tries to resist being paused, (as resisting would add more worlds to the set of non-paused worlds, whose scores are averaged to return the score of any paused world). The same applies to pausing itself in case it observes a negative or not-as-positive worldstate, to have such worldstates instead be assigned the score of the remaining unpaused ones. (This does lead to self-pausing in all possible futures except the highest-scoring one)"
- ^
One could try to avoid this by specifying "some number of updates before", e.g. "the prior as it was 10000 updates before updating on being paused", but this would create new complications (e.g., if that agent were to learn information which decreased the score assigned to the world, it would pause itself if no outweighing positive update occurred in the next 9999 updates).
"Sorry, you don't have access to this draft"
edit: fixed
i'm enjoying this. going through the questions right now, might do all of them
had a notable experience with one of the early questions:
question: "The battery output voltage, the bottle volume, the digital clock time, and the measure of weight (12 volts; one gallon; 12:36; 1 lb) all have something in common. It is that they are represented by a) one number b) more than one number."
recollected thought process: apart from the clock time, they all have one number. the time on the clock is also, in my opinion, represented by one number in a non base-n numeral system - the symbols update predictably when the value is incremented, which is all that's required. i'm not sure if the author intends that interpretation of the clock, though. let's look for other interpretations.
"lb" - this is a pointer to formulas related to weight/gravity (or more fundamentally, a pointer back to physics/the world). "1 lb" means "1 is the value to pass as the weight variable". a formula is not itself a number, but can contain them. maybe this is why the clock is included - most would probably consider it to contain two numbers, which would force them to think about how these other three could be 'more than one number' as well.
(though it's down to interpretation, i'll choose b) more than one number.)
the listed answer is: a) one number. "Each is represented by only one number - the battery by 12 volts, the bottle by one gallon, the time by 12:36 and the weight by one pound. Things described by one number are called scalars. For example: on a scale of one to ten, how do you rate this teacher?" it just restates them and implies in passing that 12:36 is one number, without deriving any insight from the question. *feels disappointed*. (i guess they just wanted to introduce a definition)
I am not sure whether this is the answer you're looking for, but I think it's true and could be de-confusing, and others have given the standard/practical answer already.
You can try running a program which computes Bayesian updates to determine what happens when this program is passed as input an 'observation' to which it assigns probability 0. Two possible outcomes (of many, dependent on the exact program) that come to mind:
- The program returns a 'cannot divide by 0' error upon attempting to compute the observation's update.
- The program updates on the observation in a way which rules out the entirety of its probability-space, as it was all premised on the non-0 possibilities. The next time the program tries to update on a new observation, it fails to find priors about that observation.
Bayes' theorem is an algorithm which is used because it happens to help predict the world, rather than something with metaphysical status.
We could also imagine very-different (mathematical)-worlds where prediction is not needed/useful, or, maybe, where the world is so differently-structured that Bayes' theorem is not predictive.
But there’s no denying that expanding the They franchise will necessarily increase ambiguity by slurring two well-worn axes of distinction (he/she & singular/plural). By no means would this be the end of the world, but it will require some compensating efforts in other areas to maintain clarity, perhaps by relying more on proper nouns and less on pronouns.
I believe the psychological perception of others by gender, and the 'defaultness' of the notion of gender in humans, cause(d) more bad than good (at least when discluding the evolutionary era). This motivated me to switch to using the non-gendering pronoun 'they' for almost[1] everyone.
I haven't found my use of 'they' by default to require nontrivial compensation to maintain clarity. Any ambiguity introduced in a draft is removed by one of the simple checks I try to run across all of my writing for others: if referent of a word (namely 'that', 'this', 'it', or 'they') is unclear : replace with direct referent word or rephrase to remove unclarity
.
Also, I think this helps match the reader's interpretation to my intended meaning. Among humans, a being's 'gender' has a lot of connotative meaning. I think not introducing those connotations is instrumental to eliminating unintended ways my text could be interpreted, which in my experience is the real difficulty with writing.
- ^
excepting beings who this would harm
and excepting some contexts where I expect some readers might be confused by singular they
in the space of binary-sequences of all lengths, i have an intuition that {the rate at which there are new 'noticed patterns' found at longer lengths} decelerates as the length increases.
what do i mean by "noticed patterns"?
in some sense of 'pattern', each full sequence is itself a 'unique pattern'. i'm using this phrase to avoid that sense.
rather, my intuition is that {what could in principle be noticed about sequences of higher lengths} exponentially tends to be things that had already been noticed of sequences of lower lengths. 'meta patterns' and maybe 'classes' are other possible terms for these. two simple examples are "these ones are all random-looking sequences" and "these can be compressed in a basic way"[1].
note: not implying there are few such "meta-patterns that can be noticed about a sequence", or that most would be so simple/human-comprehensible.
in my intuition this generalizes to functions/programs in general. as an example: in the space of all definable 'mathematical universes', 'contains agentic processes' is such a meta-pattern which would continue to recur (=/= always or usually present) at higher description lengths.
('mathematical universe' does not feel like a distinctly-bounded category to me. i really mean 'very-big/complex programs', and 'universe' can be replaced with 'program'. i just use this phrasing to try to help make this understandable, because i expect the claim that 'contains agents' is such a recurring higher-level pattern to be intuitive.)
and as you consider universes/programs whose descriptions are increasingly complex, eventually ~nothing novel could be noticed. e.g., you keep seeing worlds where agentic processes are dominant, or where some simple unintelligent process cascades into a stable end equilibrium, or where there's no potential for those, etc <same note from earlier applies>. (more-studied things like computational complexity may also be examples of such meta-patterns)
a stronger claim which might follow (about the space of possible programs) is that eventually (at very high lengths), even as length/complexity increases exponentially, the resulting universes/programs higher-level behavior[2] still ends up nearly-isomorphic to that of relatively-much-earlier/simpler universes/programs. (incidentally, this could be used to justify a simplicity prior/heuristic)
in conclusion, if this intuition is true, the space of all functions/programs is 'already' or naturally a space of constrained diversity. in other words, if true, the space of meta-patterns[3] is finite (i.e approaches some specific integer), even though the space of functions/programs is infinite.
- ^
(e.g., 100
1
s followed by 1000
s is simple to compress) - ^
though this makes me wonder about the possibility of 'anti-pattern' programs i.e ones selected/designed to not be nearly-isomorphic to anything previous. maybe they'd become increasingly sparse or something?
- ^
for some given formal definition that matches what the 'meta/noticed pattern' concept is trying to be about, which i don't know how to define. this concept also does not feel distinctly-bounded to me, so i guess there's multiple corresponding definitions
i'm interested in using it for literature search
avoiding akrasia by thinking of the world in terms of magic: the gathering effects
example initial thought process: "i should open my laptop just to write down this one idea and then close it and not become distracted".
laptop rules text: "when activated, has an 80% chance of making you become distracted"
new reasoning: "if i open it, i need to simultaneously avoid that 80% chance somehow."
why this might help me: (1) i'm very used to strategizing about how to use a kit of this kind of effect, from playing such games. (2) maybe normal reasoning about 'what to do' happens in a frame where i have full control over what i focus on, versus this includes it being dependent on my environment
potential downside: same as (2), it conceptualizes away some agency. i.e i could theoretically 'just choose not to enter negative[1] focus-attraction-basins' 100% of the time. but i don't know how to do that 100% of the time, so it works at least as a reflection of the current equilibrium.
- ^
some focus-attraction-basins are positive, e.g for me these include making art and deep thinking, these are the ones i want to strategically use effects to enter
in most[1] kinds of infinite worlds, values which are quantitative[2] become fanatical in a way, because they are constrained to:
- making something valued occur with at least >0% frequency, or:
- making something disvalued occur with exactly 0% frequency
"how is either possible?" - as a simple case, if there's infinite copies of one small world, then making either true in that small world snaps the overall quantity between 0 and infinity. then generalize this possibility to more-diverse worlds. (we can abstract away 'infinity' and write about presence-at-all in a diverse set)
(neither is true of the 'set of everything', only of 'constrained' infinite sets, wrote about this in fn.2)
---
that was just an observation, pointing out the possibility of that and its difference to portional decreases. below is how i value this / some implications / how this (weakly-)could be done in a very-diverse infinite world.
if i have option A: decrease x from 0.01% to 0%, and option B: decrease x from 50% to 1%, and if x is some extreme kind of suffering only caused from superintelligence or Boltzmann-brain events (i'll call this hypersuffering), then i prefer option A.
that's contingent on the quantity being unaffected by option B. (i.e on infinity of something being the same amount as half of infinity of that something, in reality).
also, i might prefer B to some sufficiently low probability of the A, i'm not sure how low. to me, 'there being zero instead of infinite hypersuffering' does need to be very improbable before it becomes outweighed by values about the isolated {'shape' of the universe/distribution of events}, but it's plausible that it is that improbable in a very diverse world.
a superintelligent version of me would probably check: is this logically a thing i can cause, i.e is there some clever trick i can use to make all superintelligent things who would do this instead not do it despite some having robust decision theories, and despite the contradiction where such a trick could also be used to prevent me from using it, and if so, then do it, if not, pursue 'portional' values. that is to say, how much one values quantity vs portion-of-infinity probably does not imply different action in practice, apart from the initial action of making sure ASI is aligned to not just quantitative or portional (assuming the designer cares to some extent about both).
(also, even if there is such a clever trick to prevent it from being intentionally caused, it also has to not occur randomly (Boltzmann brain -like), or the universe has to be able to be acausally influenced to make it not occur randomly (mentioned in this, better explanation below))
'how to acausally influence non-agentic areas of physics?' - your choices are downstream of 'the specification of reality from the beginning'. so you have at least a chance to influence that specification, if you(/ASI) does this:
- don't compute that specification immediately, because that is itself an action (so correlated to it) and 'locks it in' from your frame.
- instead, compute some space of what it would be when conditional on your future behavior being any from a wide space.
- you're hoping that you find some logical-worlds where the 'specification' is upstream of both that behavior from you and <other things in the universe that you care about, such as whether hypersuffering is ever present in non-agentic areas of physics>.
- it could be that you won't find any, though, e.g if your future actions have close to no correlative influence. as such i'm not saying anything about whether this is logically likely to work, just that it's possible.
- if possible, a kind of this which prevents hypersuffering-causer ASIs from existing could prevent the need to cleverly effect their choices
- ^
it is possible for an infinite set to have a finite amount of something, like the set of one
1
and infinite0
s, but i don't mean this kind - ^
a 'quantitative value' is one about quantities of things rather than 'portions of infinity'/the thing that determines probability of observations in a quantitatively infinite world.
longer explanation copied from https://forum.effectivealtruism.org/posts/jGoExJpGgLnsNPKD8/does-ultimate-neartermism-via-eternal-inflation-dominate#zAp9JJnABYruJyhhD:
possible values respond differently to infinite quantities.
for some, which care about quantity, they will always be maxxed out along all dimensions due to infinite quantity. (at least, unless something they (dis)value occurs with exactly 0% frequency, implying a quantity of 0 - which could, i think, be influenced by portional acausal influence in certain logically-possible circumstances. (i.e maybe not the case in 'actual reality' if it's infinite, but possible at least in some mathematically-definable infinite universes; as a trivial case, a set of infinite
1
s contains no0
s. more fundamentally, an infinite set of universes can be a finitely diverse set occurring infinite times, or an infinitely diverse set where the diversity is constrained.))other values might care about portion - that is, portion of / percentage-frequency within the infinite amount of worlds - the thing that determines the probability of an observation in an infinitely large world - rather than quantity. (e.g., i think my altruism still cares about this, though it's really tragic that there's infinite suffering).
note this difference is separate from whether the agent conceptualizes the world as finite-increasing or infinite (or something else).
on chimera identity. (edit: status: received some interesting objections from an otherkin server. most importantly, i'd need to explain how this can be true despite humans evolving a lot more from species in their recent lineage. i think this might be possible with something like convergent evolution at a lower level, but at this stage in processing i don't have concrete speculation about that)
this is inspired by seeing how meta-optimization processes can morph one thing into other things. examples: a selection process running on a neural net, an image diffusion AI iteratively changing an image and repurposing aspects of it.
(1) humans are made of animal genes
(2) so it makes sense that some are 'otherkin' / have animal identity
(3) probably everyone has some latent animal behavior
(4) in this way, everyone is a 'chimera'
(5) all species are a particular space of chimera, not fundamentally separate
that's the 'message to a friend who will understand' version. attempt at rigor version:
- humans evolved from other species. human neural structure was adapted from other neural structure.
- this selection was for survival, not for different species being dichotomous
- this helps explain why some are 'otherkin' / have animal identity, or prefer a furry humanoid to the default one (on any number of axes like identification with it, aesthetic preference, attraction). because they were evolved from beings who had those traits, and such feelings/intuitions/whatever weren't very selected against.
- in this way, everyone is a 'chimera'
- "in greek mythology, the chimera was a fire-breathing hybrid creature composed of different animal parts"
- probably everyone has some latent behavior (neuro/psychology underlying behavior) that's usually not active and might be more associated with a state another species might more often be in.
- all species are a particular space of chimera, not fundamentally separate
maybe i made errors in wording, some version of this is just trivially-true, close to just being a rephrasing of the theory of natural selection. but it's at odds with how i usually see others thinking about humans and animals (or species and other species), as these fundamentally separate types of being.
i notice my intuitions are adapting to the ontology where people are neural networks. i now sometimes vaguely-visualize/imagine a neural structure giving outputs to the human's body when seeing a human talk or make facial expressions, and that neural network rather than the body is framed as 'them'.
a friend said i have the gift of taking ideas seriously, not keeping them walled off from a [naive/default human reality/world model]. i recognize this as an example of that.
the primary purpose of karma is to use it to decide what to read before you click on a post, which makes it less important to be super prominent when you are already on a post page
I think this applies to titles too
This could be related to what's discussed in this post: 'the intense world theory of autism'. (I'm noticing this retrospectively, having read this current post first)
With people being more sensitive, and processing what you say deeper/more-intensely [...]
found a pretty good piece of writing about this: 'the curse of identity'
it also discusses signalling to oneself
Maybe a good post could be about the compromise point between 'solution proposer has to have familiarity with all other proposals' and 'experienced researchers have to evaluate any proposed idea'
I recall a shortform here speculated that a good air quality hack could be a small fan aimed at one's face to blow away the Co2 one breathes out. I've been doing this and experience it as helpful, though it's hard know for sure.
This also includes having it pointed above my face during sleep, based on experience after waking. (I tended to be really fatigued right after waking. Keeping water near bed to drink immediately also helped with that.)
I think that’s closer to what I was trying to get across. Does that edit change anything in your response?
No.
Overall, I would say that my self-concept is closer to what a physicalist ontology implies is mundanely happening - a neural network, lacking a singular 'self' entity inside it, receiving sense data from sensors and able to output commands to this strange, alien vessel (body). (And also I only identify myself with some parts of the non-mechanistic-level description of what the neural network is doing).
I write in a lot more detail below. This isn't necessarily written at you in particular, or with the expectation of you reading through all of it.
1. Non-belief in self-as-body (A)
I see two kinds of self-as-body belief. The first is looking in a mirror, or at a photo, and thinking, "that [body] is me." The second is controlling the body, and having a sense that you're the one moving it, or more strongly, that it is moving because it is you (and you are choosing to move).
I'll write about my experiences with the second kind first.
The way a finger automatically withdraws from heat does not feel like a part of me in any sense. Yesterday, I accidentally dropped a utensil and my hands automatically snapped into place around it somehow, and I thought something like, "woah, I didn't intend to do that. I guess it's a highly optimized narrow heuristic, from times where reacting so quickly was helpful to survival".
I experimented a bit between writing this, and I noticed one intuitive view I can have of the body is that it's some kind of machine that automatically follows such simple intents about the physical world (including intents that I don't consider 'me', like high fear of spiders). For example, if I have motivation and intent to open a window, then the body just automatically moves to it and opens it without me really noticing that the body itself (or more precisely, the body plus the non-me nervous/neural structure controlling it) is the thing doing that - it's kind of like I'm a ghost (or abstract mind) with telekinesis powers (over nearby objects), but then we apply reductive physics and find that actually there's a causal chain beneath the telekinesis involving a moving body (which I always know and can see, I just don't usually think about it).
The way my hands are moving on the keyboard as I write this also doesn't particularly feel like it's me doing that; in my mind, I'm just willing the text to be written, and then the movement happens on its own, in a way that feels kind of alien if I actually focus on it (as if the hands are their own life form).
That said, this isn't always true. I do have an 'embodied self-sense' sometimes. For example, I usually fall asleep cuddling stuffies because this makes me happy. At least some purposeful form of sense-of-embodiment seems present there, because the concept of cuddling has embodiment as an assumption.[1]
(As I read over the above, I wonder how different it really is from normal human experience. I'm guessing there's a subtle difference between "being so embodied it becomes a basic implicit assumption that you don't notice" and "being so nonembodied that noticing it feels like [reductive physics metaphor]")
As for the first kind mentioned of locating oneself in the body's appearance, which informs typical humans perception of others and themself - I don't experience this with regard to myself (and try to avoid being biased about others this way), instead I just feel pretty dissociated when I see my body reflected and mostly ignore it.
In the past, it instead felt actively stressful/impossible/horrifying, because I had (and to an extent still do have) a deep intuition that I am already a 'particular kind of being', and, under the self-as-body ontology, this is expected to correspond to a particular kind of body, one which I did not observe reflected back. As this basic sense-of-self violation happened repeatedly, it gradually eroded away this aspect of sense-of-self / the embodied ontology.
I'd also feel alienated if I had to pilot an adult body to interact with others, so I've set up my life such that I only minimally need to do that (e.g for doctors appointments) and can otherwise just interact with the world through text.
2. What parts of the mind-brain are me, and what am I? (B)
I think there's an extent to which I self-model as an 'inner homunculus', or a 'singular-self inside'. I think it's lesser and not as robust in me as it is in typical humans, though. For example, when I reflect on this word 'I' that I keep using, I notice it has a meaning that doesn't feel very true of me: the meaning of a singular, unified entity, rather than multiple inner cognitive processes, or no self in particular.
I often notice my thoughts are coming from different parts of the mind. In one case, I was feeling bad about not having been productive enough in learning/generating insights and I thought to myself, "I need to do better", and then felt aware that it was just one lone part thinking this while the rest doesn't feel moved; the rest instead culminates into a different inner-monologue-thought: something like, "but we always need to do better. tsuyoku naratai is a universal impetus." (to be clear, this is not from a different identity or character, but from different neural processes causally prior to what is thought (or written).)
And when I'm writing (which forces us to 'collapse' our subverbal understanding into one text), it's noticeable how much a potential statement is endorsed by different present influences[2].
I tend to use words like 'I' and 'me' in writing to not confuse others (internally, 'we' can feel more fitting, referring again to multiple inner processes[2], and not to multiple high-level selves as some humans experience. 'we' is often naturally present in our inner monologue). We'll use this language for most of the rest of the text[3].
There are times where this is less true. Our mind can return to acting as a human-singular-identity-player in some contexts. For example, if we're interacting with someone or multiple others, that can push us towards performing a 'self' (but unless it's someone we intuitively-trust and relatively private, we tend to feel alienated/stressed from this). Or if we're, for example, playing a game with a friend, then in those moments we'll probably be drawn back into a more childlike humanistic self-ontology rather than the dissociated posthumanism we describe here.
Also, we want to answer "what inner processes?" - there's some division between parts of the mind-brain we refer to here, and parts that are the 'structure' we're embedded in. We're not quite sure how to write down the line, and it might be fuzzy or e.g contextual.[4]
3. Tracing the intuitive-ontology shift
"Why are you this way, and have you always been this way?" – We haven't always. We think this is the result of a gradual erosion of the 'default' human ontology, mentioned once above.
We think this mostly did not come from something like 'believing in physicalism'. Most physicalists aren't like this. Ontological crises may have been part of it, though - independently synthesizing determinism as a child and realizing it made naive free will impossible sure did make past-child-quila depressed.
We think the strongest sources came from 'intuitive-ontological'[5] incompatibilities, ways the observations seemed to sadly-contradict the platonic self-ontology we started with. Another term for these would be 'survival updates'. This can also include ways one's starting ontology was inadequate to explain certain important observations.
Also, I think that existing so often in a digital-informational context[6], and only infrequently in an analog/physical context, also contributed to eroding the self-as-body belief.
Also, eventually, it wasn't just erosion/survival updates; at some point, I think I slowly started to embrace this posthumanist ontology, too. It feels narratively fitting that I'm now thinking about artificial intelligence and reading LessWrong.
- ^
(There is some sense in which maybe, my proclaimed ontology has its source in constant dissociation, which I only don't experience when feeling especially comfortable/safe. I'm only speculating, though - this is the kind of thing that I'd consider leaving out, since I'm really unsure about it, it's at the level of just one of many passing thoughts I'd consider.)
- ^
This 'inner proccesses' phrasing I keep using doesn't feel quite right. Other words that come to mind: considerations? currently-active neural subnetworks? subagents? some kind of neural council metaphor?
- ^
(sometimes 'we' feels unfitting too, it's weird, maybe 'I' is for when a self is being more-performed, or when text is less representative of the whole, hard to say)
- ^
We tried to point to some rough differences, but realized that the level we mean is somewhere between high-level concepts with words (like 'general/narrow cognition' and 'altruism' and 'biases') and the lowest-level description (i.e how actual neurons are interacting physically), and that we don't know how to write about this.
- ^
We can differentiate between an endorsed 'whole-world ontology' like physicalism, and smaller-scale intuitive ontologies that are more like intuitive frames we seem to believe in, even if when asked we'll say they're not fundamental truths.
The intuitive ontology of the self is particularly central to humans.
- ^
Note this was mostly downstream of other factors, not causally prior to them. I don't want anyone to read this and think internet use itself causes body-self incongruence, though it might avoid certain related feedback loops.
My experience is different from the two you describe. I typically fully lack (A)[1], and partially lack (B). I think this is something different from what others might describe as 'enlightenment'.
I might write more about this if anyone is interested.
- ^
At least the 'me-the-human-body' part of the concept. I don't know what the '-etc' part refers to.
i just remembered that it might be relevant that i have a non-24-hour sleep/wake cylce[1]. maybe i tend to expect it to be dark. also, if a human in an ancestral environment needed to be awake at night for some reason, it wouldn't really make sense for their cognition to be worse just because it's dark - maybe instead better/worse on different dimensions.
I notice that any statement can be made 'circular' by splitting it into two statements.
1-statement version: Entity X exists
2-statement version: The right half of entity X exists ⇄ The left half of entity X exists
dark chocolate, beets, blueberries, fish, eggs. I've had good effects with strong hibiscus and mint tea (both vasodilators).
what in each of these causes the effect?
if all AI developers followed these approaches
My preferred aim is to just need the first process that creates the first astronomically significant AI to follow the approach.[1] To the extent this was not included, I think this list is incomplete, which could make it misleading.
- ^
This could (depending on the requirements of the alignment approach) be more feasible when there's a knowledge gap between labs, if that means more of an alignment tax is tolerable by the top one (more time to figure out how to make the aligned AI also be superintelligent despite the 'tax'); but I'm not advocating for labs to race to be in that spot (and it's not the case that all possible alignment approaches would be for systems of the kind that their private-capabilities-knowledge is about (e.g., LLMs).)
@habryka feature request: an option to make the vote display count every normal vote as (plus/minus) 1, and every strong vote as 2 (or also 1)
Also, sometimes if I notice an agree/disagree vote at +/-9 from just 1 vote, I don't vote so it's still clear to other users that it was just one person. This probably isn't the ideal equilibrium.
I notice that my strong-votes now give/take 4 points. I'm not sure if this is a good system.
Another reason for such a rule could be to allow the use of basilisk-like threats and other infohazards without worrying about them convincing others beyond the gatekeeper.
That said, @datawitch @ra I'm interested in reading the logs if you'd allow.
In a sense, binary is the simplest possible alphabet. A two-character alphabet is the smallest alphabet that can communicate a difference. If we had an alphabet of just one character, our “sentences” would be uniform. With two, we can begin to encode information.
This is technically false. Any finite binary sequence can be converted to or from a finite unary sequence. For example, 110
↔ 111111
.
(If your binary format allows sequences to start with 0
s, 0
↔ 1
and 110
↔ 1111111
(7 instead of 6 1
s) and 0000
↔ 111111111
(9 1
s))
No, I don't feel interested in this. I wish you luck in finding feedback.
I hazard that most of the most interesting answers to this question are not safe to post even with a dummy account
I'm really curious what the most interesting answers are that you refer to. I'd be willing to pay (in crypto, or quila-intellectual-labor vouchers) for such an answer[1], proportional to how insightful I find it / how much I feel like I was able to make a useful update about the world from it from its existence (to avoid goodharting by e.g making up fake beliefs and elaborate justifications).
If anyone is interested, message me (perhaps anonymously) so we can operationalize this better.
If you don't want me to post the answer anywhere, I won't. I also have a PGP key in my bio, and am willing to delete the message after decrypting+reading it so it's not even stored on my device.
- ^
(Does not include 'standard controversial beliefs' which I would already know that some portion of people hold)
This is why it's important for the policy be known for the glomarization to be evidence under that policy specifically, which might include something to the effect of "I follow this even in obvious cases so I'm free to also follow it in cases which are mistakenly framed as obvious".
That said, I'm not thinking about the 'mundane' world as Eliezer calls it, where doing this at all would be weird. I guess I'm thinking about the lesswrong blogosphere.
(There's a hypothetical spectrum from [having a glomarization policy at all is considered weird and socially-bad] to [it is not seen negatively, but you're not disincentivized from sharing non-exfohazardous beliefs, to begin with])
Maybe this has been discussed already, just commenting as I read.
This fact is key to what I'm saying because it means that, in the near future, we can literally just query multimodal GPT-N about whether an outcome is bad or good, and use that as an adequate "human value function".
In any AI system structure where it's true that GPT-N can fulfill this function[1], a natural human could too (just with a longer delay for their output to be passed back).[2]
(The rest of this and the footnotes are just-formed ideas)
Though, if your AI relies on predicting the response of GPT-N, then it does have an advantage: GPT-N can be precisely specified within the AI structure, unlike a human (whose precise neural specification is unknown) where you'd have to point to them in the environment or otherwise predict an input from the environment and thus make your AI vulnerable to probable environment hacking.
So I suppose if there's ever a GPT-N who really seems to write with regard to actual values, and not current human discourse/cultural beliefs about what human-cultural-policies are legitimated, it could work as an outer/partial inner alignment solution.[1]
Failing that kind of GPT-N, maybe you can at least have one which answers a simpler question like, "How would <natural language plan and effects> score in terms of its effect on total suffering and happiness, given x weighting of each?" - the system with that basis seems, modulo possible botched alignment concerns, trivially preferable to an orthogonal-maximizer AI, if it's the best we can create. it wouldn't capture the full complexity of the designer's value, but would still score very highly under it due to reduction of suffering in other lightcones. Edit: another user proposes probably-better natural language targets in another comment
- ^
Though in both cases (human, gpt-n), you face some issues like: "How is the planner component generating the plans, without something like a value function (to be used in a [criterion for the plan to satisfy] to be passed to the planner?" (i.e., you write that GPT-N would only be asked to evaluate the plan after the plan is generated). Though I'm seeing some ways around this one*
and "How are you translating from the planner's format to natural language text to be sent to the GPT?"
* (If you already have a way to translate between written human language and the planner's format, I see some ways around this which leverage that, like "translate from human-language to the planner's internal format criteria for the plan to satisfy, before passing the resulting plan to GPT-N for evaluation", and some complications** (haven't branched much beyond that, but it looks solvable))
** (i) Two different plans can correspond to the same natural language description. (ii) The choice of what to specify (specifically in the translation of an internal format to natural language) is in informed by context including values and background assumptions, neither of which are necessarily specified to the translator. I have some thoughts about possible ways to make these into non-issues, if we have the translation capacity and a general purpose planner to begin with.
relevantly there's no actual value function being maximized in this model (i.e the planner is not trying to select for [the action whose description will elicit the strongest Yes rating from GPT-N], though the planner is underspecified as is)
- ^
Either case implies structural similarity to Holden (2012)'s tool AI proposal. I.e., {generate plan[1] -> output plan and wait for input} -> {display plan to human, or input plan to GPT-N} -> {if 'yes' received back as input, then actually enact plan}
I really don't want to contribute to this pattern that makes it hard to learn what's actually true, so in general I don't want whether I share what I've learned to be downstream from what I learn.
Another policy which achieves this is to research the question, and not (publicly) share your conclusion either way. This also benefits you in the case you become a dragon believer, because glomarizing (when you're known to follow this policy) provides no evidence you are one in that case. (Things this reminds me of: meta-honesty, and updatelessness when compared to your position)
Another policy which achieves this is to share your conclusion only if you end up disbelieving in dragons, but also hedge it that you wouldn't be writing if you believed the other position. If you're known to follow this policy and glomarize about whether you believe in dragons, it is evidence that either you do or you haven't researched the question.
(Copied from my EA forum comment)
I think it's valuable for some of us (those who also want to) to try some odd research/thinking-optimizing-strategy that, if it works, could be enough of a benefit to push at least that one researcher above the bar of 'capable of making serious progress on the core problems'.
One motivating intuition: if an artificial neural network were consistently not solving some specific problem, a way to solve the problem would be to try to improve or change that ANN somehow or otherwise solve it with a 'different' one. Humans, by default, have a large measure of similarity to each other. Throwing more intelligent humans at the alignment problem may not work, if one believes it hasn't worked so far.[1]
In such a situation, we'd instead want to try to 'diverge' something like our 'creative/generative algorithm', in hopes that at least one (and hopefully more) of us will become something capable of making serious progress.
(status: silly)
newcombs paradox solutions:
1: i'll take both boxes, because their contents are already locked in.
2: i'll take only box B, because the content of box B is acausally dependent on my choice.
3: i'll open box B first. if it was empty, i won't open box A. if it contained $1m, i will open box A. this way, i can defeat Omega by making our policies have unresolvable mutual dependence.
specifically regarding "benevolent values", the default strategy is to nurture them, while bad actors can do the same with "bad values".
My specific claim was that creating an AI which 'familially trusts humans' is as hard as a creating an AI which shares human[1] values. The latter is not intrinsically associated with LLM post-training (which seems off-topic), as you seemed to imply to contrast against.
give it an instinctive reason to not only do "good" but to be pro-actively and fiercely protective of humanity
That would be contained in your definition of good.
what is the correct "alignment frame" as you see it?
https://agentfoundations.study is a good reading list I found recently.
Your responses seem conceptually confused to me, but I don't know how to word how in a good way (that addresses whatever the fundamental issue is, instead of replying to each claim individually).
spend all the hard effort building instinctive moral/ethical evaluation and judgment abilities (rather than values)
I think I ended up seeing what you were trying to say with this line, and the second paragraph overall: you notice that your process of moral reflection pulls from internal intuitions or instincts which are not themselves the conclusion of this reflection process. You then propose trying to instill these intuitions in an AI, directly, through some sort of training process. You intend to contrast this with instilling only some preset conclusions, which is likely what you perceive LLM post-training to do.
This is precisely the kind of thing that I meant would be very hard, as in harder than other alignment agendas (by which I do not mean LLM post-training). We don't know what those intuitions actually are, or how to specify them for use in a training setup, or how to devise a training setup which would cause an AI to genuinely share them even if we did have a specification of them.
Meta: I don't think your ideas are at a stage where they should be presented as a 'monolithic paper' like this. I would suggest framing yourself as someone interested in learning about alignment, and asking questions etc, e.g in rob miles' AI safety discord or eleutherAI discord #alignment-beginners channel. I think you would be more likely to get feedback this way - I mean, compared to the counterfactual world where I'm not engaging with you here, which was unlikely - this post would likely not have received any other replies.
I also suggest keeping in mind that alignment is a genuinely hard problem. To be honest, I have been seeing signs that, by default, you think in the terms of human experience: your earlier reliance on 'parent/child' metaphors and use of 'familial bonding' as an alignment target; the use of 'nature/nurture' language about AI; and in the above case, the basis of your proposal on trying to recreate {your introspectively-observed moral reflection process} in an AI. I believe the kind of thinking required to make technical progress on alignment is significantly less humanistic. This doesn't mean you're necessarily not capable of it, nor that I am advising you to defer to the view I've formed of you from this short interaction; rather, it's a possibility to keep track of.
- ^
(or benevolent, my preferred term for what I actually want)
if we want to MAKE familial trust innate in AI, then we would need to do it at the intrinsic (pre-training) level
That is somewhat different. Still, "make AI have familial trust of humans" would be about as hard/fraught as the more direct "make AI have benevolent values", because 'familial trust' is similarly specific, complex, path-dependent, etc.
If you've decided not to read the paper only because you found a chat example in it
That's not the case. I opened your paper to do two things: (i) to check the summary's claim of 'unmistakable evidence of dangerous misalignment', which I did not find (and so commented to save others the time), and (ii) to skim it to see if it seemed to be worth checking further to me.
A bit more on (ii): When skimming the paper, I noticed some signs which I expect to not be present in texts which pass my bar for checking further. Though I can't elaborate on all of these, here are some examples that feel easy to word[1]:
- Assuming that traits innate to social animals are innate to minds in general. A relevant quote from your paper:
- "there's no evidence in nature to suggest that familial parent-child trust is purely learned/nurtured behavior. Therefore, it's reasonable to conclude that building familial trust into intrinsic nature (via pre-training) is essential for applying this successful natural strategy, with subsequent nurturing (post-training) to reinforce the familial relationship."
- Misleading phrasings.
- The abstract claimed the paper would contain "unmistakable evidence of dangerous misalignment". Later, said screenshot was prefaced with, "not intended to be statistically significant[2] evidence", and was not something I'd consider relevant.
- It more broadly did not seem to be operating in a mature (?) alignment frame.
Do those look like the responses of a well-aligned AI to you?
I don't consider current LLMs to be aligned/misaligned/intrinsically-objective-pursuing.
The suggestion I can make that might be amenable to you would be to try reading the sequences.
- ^
Edit: I removed some points that I thought were liable to misinterpretation, but not before OP started writing their reply it seems.
- ^
As an aside, 'statistically significant' is not an applicable concept here, because this is not a statistical analysis. This itself gave me the impression that the paper was referencing concepts known to the author as 'scientific' in a (for lack of better wording) 'guessing the teachers password'-y way
test results are presented that show unmistakable evidence of this dangerous misalignment
I checked the paper, and said evidence was just a chat log: