Posts
Comments
What Goes Without Saying
There are people I can talk to, where all of the following statements are obvious. They go without saying. We can just “be reasonable” together, with the context taken for granted.
And then there are people who…don’t seem to be on the same page at all.
This is saying, through framing, "If you do not agree with the following, you are unreasonable; you would be among those who do not understand What Goes Without Saying, those 'who…don’t seem to be on the same page at all.'." I noticed this caused an internal pressure towards agreeing at first, before even knowing what the post wanted me to agree with.
There are all sorts of "strategies" (turn it off, raise it like a kid, disincentivize changing the environment, use a weaker AI to align it) that people come up with when they're new to the field of AI safety, but that are ineffective. And their ineffectiveness is only obvious and explainable by people who specifically know how AI behaves.
yep but the first three all fail for the shared reason of "programs will do what they say to do, including in response to your efforts". (the fourth one, 'use a weaker AI to align it', is at least obviously not itself a solution. the weakest form of it, using an LLM to assist an alignment researcher, is possible, and some less weak forms likely are too.)
when i think of other 'newly heard of alignment' proposals, like boxing, most of them seem to fail because the proposer doesn't actually have a model of how this is supposed to work or help in the first place. (the strong version of 'use ai to align it' probably fits better here)
(there are some issues which a programmatic model doesn't automatically make obvious to a human: they must follow from it, but one could fail to see them without making that basic mistake. probable environment hacking and decision theory issues come to mind. i agree that on general priors this is some evidence that there are deeper subjects that would not be noticed even conditional on those researchers approving a solution.)
i guess my next response then would be that some subjects are bounded, and we might notice (if not 'be able to prove') such bounds telling us 'theres not more things beyond what you have already written down', which would be negative evidence (strength depending on how strongly we've identified a bound). (this is more of an intuition, i don't know how to elaborate this)
(also on what johnswentworth wrote: a similar point i was considering making is that the question is set up in a way that forces you into playing a game of "show how you'd outperform magnus carlsen {those researchers} in chess alignment theory" - for any consideration you can think of, one can respond that those researchers will probably also think of it, which might preclude them from actually approving, which makes the conditional 'they approve but its wrong'[1] harder to be true and basically dependent on them instead of object-level properties of alignment.)
i am interested in reading more arguments about the object-level question if you or anyone else has them.
If the solution to alignment were simple, we would have found it by now [...] That there is one simple thing from which comes all of our values, or a simple way to derive such a thing, just seems unlikely.
the pointer to values does not need to be complex (even if the values themselves are)
If the solution to alignment were simple, we would have found it by now
generally: simple things don't have to be easy to find. the hard part can be locating them within some huge space of possible things. (math (including is use in laws of physics) come to mind?). (and specifically to alignment: i also strongly expect an alignment solution to ... {have some set of simple principles from which it can be easily derived (i.e. whether the program itself ends up long)}, but idk if i can legibly explain why. real complexity usually results from stochastic interactions in a process, but "aligned superintelligent agent" is a simply-defined, abstract thing?)
- ^
ig you actually wrote 'they dont notice flaws', which is ambiguously between 'they approve' and 'they don't find affirmative failure cases'. and maybe the latter was your intent all along.
it's understandable because we do have to refer to humans to call something unintuitive.
That for every true alignment solution, there are dozens of fake ones.
Is this something that I should be seriously concerned about?
if you truly believe in a 1-to-dozens ratio between[1] real and 'fake' (endorsed by eliezer and others but unnoticedly flawed) solutions, then yes. in that case, you would naturally favor something like human intelligence augmentation, at least if you thought it had a chance of succeeding greater than p(chance of a solution being proposed which eliezer and others deem correct) × 1/24
I believe that before we stumble on an alignment solution, we will stumble upon an "alignment solution" - something that looks like an alignment solution, but is flawed in some super subtle, complicated way that means that Earth still gets disassembled into compute or whatever, but the flaw is too subtle and complicated for even the brightest humans to spot
i suggest writing why you believe that. in particular, how do you estimate the prominence of 'unnoticeably flawed' alignment solutions given they are unnoticeable (to any human)?[2] where does "for every true alignment solution, there are dozens of fake ones" come from?
A really complicated and esoteric, yet somehow elegant
why does the proposed alignment solution have to be really complicated? overlooked mistakes become more likely as complexity increases, so this premise favors your conclusion.
- ^
(out of the ones which might be proposed, to avoid technicalities about infinite or implausible-to-be-thought-of proposals)
- ^
(there are ways you could in principle, for example if there were a pattern of the researchers continually noticing they made increasingly unintuitive errors up until the 'final' version (where they no longer notice, but presumably this pattern would be noticed); or extrapolation from general principles about {some class of programming that includes alignment} (?) being easy to make hard-to-notice unintuitive mistakes in)
good to hear it's at least transparent enough for you to describe it directly like this. (edit: though the points in dawnlights post seem scarier)
she attempted to pressure me to take MDMA under her supervision. I ended up refusing, and she relented; however, she then bragged that because she had relented, I would trust her more and be more likely to take MDMA the next time I saw her.
this seems utterly evil, especially given MDMA is known as an attachment inducing drug.
edit: more generally, it seems tragic for people who are socially-vulnerable and creative to end up paired with adept manipulators.
a simple explanation is that because creativity is (potentially very) useful, vulnerable creative people will be targets for manipulation. but i think there are also dynamics in communities with higher [illegibility-tolerance? esoterism?] which enable this, which i don't know how to write about. i hope someone tries to write about it.
upvoted, i think this article would be better with comparison to the recommendations in thomas kwa's shortform about air filters
But maybe you only want to "prove" inner alignment and assume that you already have an outer-alignment-goal-function
correct, i'm imagining these being solved separately
a possible research direction which i don't know if anyone has explored: what would a training setup which provably creates a (probably[1]) aligned system look like?
my current intuition, which is not good evidence here beyond elevating the idea from noise, is that such a training setup might somehow leverage how the training data and {subsequent-agent's perceptions/evidence stream} are sampled from the same world, albeit with different sampling procedures. for example, the training program could intake both a dataset and an outer-alignment-goal-function, and select for prediction of the dataset (to build up ability) while also doing something else to the AI-in-training; i have no idea what that something else would look like (and it seems like most of this problem).
has this been thought about before? is this feasible? why or why not?
(i can clarify if any part of this is not clear.)
(background motivator: in case there is no finite-length general purpose search algorithm[2], alignment may have to be of trained systems / learners)
- ^
(because in principle, it's possible to get unlucky with sampling for the dataset. compare: it's possible for an unlucky sequence of evidence to cause an agent to take actions which are counter to its goal.)
- ^
by which i mean a program capable of finding something which meets any given criteria met by at least one thing (or writing 'undecidable' in self-referential edge cases)
a moral intuition i have: to avoid culturally/conformistly-motivated cognition, it's useful to ask:
if we were starting over, new to the world but with all the technology we have now, would we recreate this practice?
example: we start and out and there's us, and these innocent fluffy creatures that can't talk to us, but they can be our friends. we're just learning about them for the first time. would we, at some point, spontaneously choose to kill them and eat their bodies, despite us having plant-based foods, supplements, vegan-assuming nutrition guides, etc? to me, the answer seems obviously not. the idea would not even cross our minds.
(i encourage picking other topics and seeing how this applies)
Status: Just for fun
it was fun to read this :]
All intelligent minds seek to optimise for their value function. To do this, they will create environments where their value function is optimised.
in case you believe this [disregard if not], i disagree and am willing to discuss here. in particular i disagree with the create environments part: the idea that all goal functions (or only some subset, like selected-for ones; also willing to argue against this weaker claim[1]) would be maximally fulfilled (also) by creating some 'small' simulation (made of a low % of the reachable universe).
(though i also disagree with the all in the quote's first sentence[2]. i guess i'd also be willing to discuss that).
- ^
for this weaker claim: many humans are a counterexample of selected-for-beings whose values would not be satisfied just by creating a simulation, because they care about suffering outside the simulation too.
- ^
my position: 'pursues goals' is conceptually not a property of intelligence, and not all possible intelligent systems pursue goals (and in fact pursuing goals is a very specific property, technically rare in the space of possible intelligent programs).
this could have been noise, but i noticed an increase in text fearing spies, in the text i've seen in the past few days[1]. i actually don't know how much this concern is shared by LW users, so i think it might be worth writing that, in my view:
- (AFAIK) both governments[2] are currently reacting inadequately to unaligned optimization risk. as a starting prior, there's not strong reason to fear more one government {observing/spying on} ML conferences/gatherings over the other, absent evidence that one or the other will start taking unaligned optimization risks very seriously, or that one or the other is prone to race towards ASI.
- (AFAIK, we have more evidence that the U.S. government may try to race, e.g. this, but i could have easily missed evidence as i don't usually focus on this)
- tangentially, a more-pervasively-authoritarian government could be better situated to prevent unilaterally-caused risks (cf a similar argument in 'The Vulnerable World Hypothesis'), if it sought to. (edit: andif the AI labs closest to causing those risks were within its borders, which they are not atm)
- this argument feels sad (or reflective of a sad world?) to me to be clear, but it seems true in this case
that said i don't typically focus on governance or international-AI-politics, so have not put much thought into this.
- ^
examples: yesterday, saw this twitter/x post (via this quoting post)
today, opened lesswrong and saw this shortform about two uses of the word spy and this shortform about how it's hard to have evidence against the existence of manhattan projects
this was more than usual, and i sense that it's part of a pattern
- ^
of those of US/china
Lookism is also highly persistent. In two studies, this paper found that educating judges to not bias on looks had no practical impact on the advantages of ‘looking trustworthy’ during sentencing. Then they tried having judges form their decision without looking, but with the opportunity to revise later, and found that this actually increased the bias, as judges would often modify their decisions upon seeing the defendant. People seem to very strongly endorse lookism in practice, no matter what they say in theory.
the methods sections of the paper say study participants were not actual judges, but "US American workers from Amazon Mechanical Turk who participated in exchange for ${1,2}.50."
(i just ctrl+f'd 'recruit' and 'judges' though, so i could have missed something)
I bet on idea that it is better to have orders of magnitude more happy copies, than fight to prevent one in pain
(that's a moral judgement, so it can't be bet on/forecasted). i'm not confident most copies would be happy; LLM characters are treated like playthings currently, i don't expect human sideloads to be treated differently by default, in the case of internet users cloning other internet users. (ofc, one could privately archive their data and only use it for happy copies)
(status: mostly writing my thoughts about the ethics of sideloading. not trying to respond to most of the post, i just started with a quote from it)
(note: the post's karma went from 12 to 2 while i was writing this, just noting i haven't cast any votes)
if you do not consent to uploading, you will be resurrected only by hostile superintelligences that do not care about consent.
some thoughts on this view:
- it can be said of anything: "if you don't consent to x, x will be done to you only by entities which don't care about consent". in my view, this is not a strong argument for someone who otherwise would not want to consent to x, because it only makes the average case less bad by adding less-bad cases that they otherwise don't want, rather than by decreasing the worst cases.
- if someone accepted the logic, i'd expect they've fallen for a mental trap where they focus on the effect on the average, and neglect the actual effect.
- in the particular case of resurrections, it could also run deeper: humans "have a deep intuition that there is one instance of them". by making the average less bad in the described way, it may feel like "the one single me is now less worse off".
- if someone accepted the logic, i'd expect they've fallen for a mental trap where they focus on the effect on the average, and neglect the actual effect.
- consent to sideloading doesn't have to be general, it could be conditional (a list of required criteria, but consider goodhart) or only ever granted personally.
- at least this way, near-term bad actors wouldn't have something to point to to say "but they/[the past version of them who i resurrected] said they're okay with it". though i still expect many unconsensual sideloads to be done by humans/human-like-characters.
i've considered putting more effort into preventing sideloading of myself. but reflectively, it doesn't matter whether the suffering entity is me or someone else.[1] more specifically, it doesn't matter if suffering is contained in a character-shell with my personal identity or some other identity or none; it's still suffering. i think that even in natural brains, suffering is of the underlying structure, and the 'character' reacts to but does not 'experience' it; that is, the thing which experiences is more fundamental than the self-identity; that is, because 'self-identity' and 'suffering' are two separate things, it is not possible for an identity 'to' 'experience' suffering, only to share a brain with it / be causally near to it.
- ^
(still, i don't consent to sideloading, though i might approve exceptions for ~agent foundations research. also, i do not consider retroactive consent given by sideloads to be valid, especially considering conditioning, regeneration, and partial inaccuracy make it trivial to cause to be output.)
(status: metaphysics) two ways it's conceivable[1] that reality could have been different:
- Physical contingency: The world has some starting condition that changes according to some set of rules, and it's conceivable that either could have been different
- Metaphysical contingency: The more fundamental 'what reality is made of', not meaning its particular configuration or laws, could have been some other,[2] unknowable unknown, instead of "logic-structure" and "qualia"
- ^
(i.e. even if actually it being as it is is logically necessary somehow)
- ^
To the limited extent language can point to that at all.
It is comparable to writing, in math, "something not contained in the set of all possible math entities", where actually one intends to refer to some "extra-mathematical" entity; the thing metaphysics 'could have been' would have to be extra-real, and language (including phrases like 'could have been' and 'things'), being a part of reality, cannot describe extra-real things
That is also why I write 'unknowable unknowns' instead of the standard 'unknown unknowns'; it's not possible to even imagine a different metaphysics / something extra-real.
Are there already manifold markets
yes, but only small trading volume so far: https://manifold.markets/Bayesian/will-a-us-manhattanlike-project-for
(i ctrl+f'd "alignment" and there was not one mention of the AI sense)
(was this written by chatgpt?)
another crucial consideration here is that a benevolent ASI could do acausal trade to reduce suffering in the unreachable universe.[1] (comparing the EV of that probability and of the probability of human-caused long-term-suffering is complex / involves speculation about the many variables going into each side)
- ^
there's writing about this somewhere, i'm here just telling you that the possibility / topic exists
i wrote this about it but i don't think it's comprehensive enough https://quila.ink/posts/ev-of-alignment-for-negative-utilitarians/
from ycombinator comments on a post of that:[1]
(links to comment by original user a different account[2] continuing the chat):
recommend against supplementing melatonin
why?
i searched Andrew Huberman melatonin
and found this, though it looks like it may be an AI generated summary.
i might try sleeping for a long time (16-24 hours?) by taking sublingual[1] melatonin right when i start to be awake, and falling asleep soon after. my guess: it might increase my cognitive quality on the next wake up, like this:
(or do useful computation during sleep, leading to apparently having insights on the next wakeup? long elaboration below)
i wonder if it's even possible, or if i'd have trouble falling asleep again despite the melatonin.
i don't see much risk to it, since my day/night cycle is already uncalibrated[2], and melatonin is naturally used for this narrow purpose in the body.
'cognitive quality' is really vague. here's what i'm really imagining
my unscientific impression of sleep, from subjective experience (though i only experience the result) and speculation i've read, is that it does these things:
- integrates into memory what happened in the previous wake period, and maybe to a lesser extent further previous ones
- more separate to the previous wake period, acts on my intuitions or beliefs about things to 'reconcile' or 'compute implicated intuitions'. for example if i was trying to reconcile two ideas, or solve some confusing logical problem, maybe the next day i would find it easier because more background computation has been done about it?
- maybe the same kind of background cognition that happens during the day, that leads to people having ideas random-feelingly enter their awareness?
- this is the one i feel like i have some sub-linguistic understanding of how it works in me, and it seems like the more important of the two for abstract problem solving, which memories don't really matter to. for this reason, a higher proportion of sleep or near-sleep in general may be useful for problem solving.
but maybe these are not done almost as much as they could be, because of competing selection pressures for different things, of which sleep-time computations are just some. (being awake is useful to gather food and survive)
anyways, i imagine that after those happening for a longer time, the waking mental state could be very 'fresh' / aka more unburdened by previous thoughts/experiences (bulletpoint 1), and prone to creativity/'apparently' having new insights (bulletpoint 2). (there is something it feels like to be in such a state for me, and it happens more just after waking)
- ^
takes effect sooner
- ^
i have the non-24 hour sleep/wake cycle that harry has in HPMOR. for anyone who also does, some resources:
from authors note chapter 98:
Last but not least:
You know Harry’s non-24 sleep disorder? I have that. Normally my days are around 24 hours and 30 minutes long.
Around a year ago, some friends of mine cofounded MetaMed, intended to provide high-grade analysis of the medical literature for people with solution-resistant medical problems. (I.e. their people know Bayesian statistics and don’t automatically believe every paper that claims to be ‘statistically significant’ – in a world where only 20-30% of studies replicate, they not only search the literature, but try to figure out what’s actually true.) MetaMed offered to demonstrate by tackling the problem of my ever-advancing sleep cycle.
Here’s some of the things I’ve previously tried:
- Taking low-dose melatonin 1-2 hours before bedtime
- Using timed-release melatonin
- Installing red lights (blue light tells your brain not to start making melatonin)
- Using blue-blocking sunglasses after sunset
- Wearing earplugs
- Using a sleep mask
- Watching the sunrise
- Watching the sunset
- Blocking out all light from the windows in my bedroom using aluminum foil, then lining the door-edges with foam to prevent light from slipping in the cracks, so I wouldn’t have to use a sleep mask
- Spending a total of ~$2200 on three different mattresses (I cannot afford the high-end stuff, so I tried several mid-end ones)
- Trying 4 different pillows, including memory foam, and finally settling on a folded picnic blanket stuffed into a pillowcase (everything else was too thick)
- Putting 2 humidifiers in my room, a warm humidifier and a cold humidifier, in case dryness was causing my nose to stuff up and thereby diminish sleep quality
- Buying an auto-adjusting CPAP machine for $650 off Craigslist in case I had sleep apnea. ($650 is half the price of the sleep study required to determine if you need a CPAP machine.)
- Taking modafinil and R-modafinil.
- Buying a gradual-light-intensity-increasing, sun alarm clock for ~$150
Not all of this was futile – I kept the darkened room, the humidifiers, the red lights, the earplugs, and one of the mattresses; and continued taking the low-dose and time-release melatonin. But that didn’t prevent my sleep cycle from advancing 3 hours per week (until my bedtime was after sunrise, whereupon I would lose several days to staying awake until sunset, after which my sleep cycle began slowly advancing again).
MetaMed produced a long summary of extant research on non-24 sleep disorder, which I skimmed, and concluded by saying that – based on how the nadir of body temperature varies for people with non-24 sleep disorder and what this implied about my circadian rhythm – their best suggestion, although it had little or no clinical backing, was that I should take my low-dose melatonin 5-7 hours before bedtime, instead of 1-2 hours, a recommendation which I’d never heard anywhere before.
And it worked.
I can’t *#&$ing believe that #*$%ing worked.
(EDIT in response to reader questions: “Low-dose” melatonin is 200microgram (mcg) = 0.2 mg. Currently I’m taking 0.2mg 5.5hr in advance, and taking 1mg timed-release just before closing my eyes to sleep. However, I worked up to that over time – I started out just taking 0.3mg total, and I would recommend to anyone else that they start at 0.2mg.)
other resources: https://slatestarcodex.com/2018/07/10/melatonin-much-more-than-you-wanted-to-know/, https://www.reddit.com/r/N24/comments/fylcmm/useful_links_n24_faq_and_software/
if i left out the word 'trying' to (not) use it in that way instead, nothing about me would change, but there would be more comments saying that success is not certain.
i also disagree with the linked post[1], which says that 'i will do x' means one will set up a plan to achieve the highest probability of x they can manage. i think it instead usually means one believes they will do x with sufficiently high probability to not mention the chance of failure.[2] the post acknowledges the first half of this -- «Well, colloquially, "I'm going to flip the switch" and "I'm going to try to flip the switch" mean more or less the same thing, except that the latter expresses the possibility of failure.» -- but fails to integrate that something being said implies belief in its relevance/importance, and so concludes that using the word 'try' (or, by extrapolation, expressing the possibility of failure in general) is unnecessary in general.
- ^
though its psychological point seems true:
But if all you want is to "maximize the probability of success using available resources", then that's the easiest thing in the world to convince yourself you've done.
- ^
this is why this wording is not used when the probability of success is sufficiently far (in percentage points, not logits) from guaranteed.
nothing short of death can stop me from trying to do good.
the world could destroy or corrupt EA, but i'd remain an altruist.
it could imprison me, but i'd stay focused on alignment, as long as i could communicate to at least one on the outside.
even if it tried to kill me, i'd continue in the paths through time where i survived.
What is malevolence? On the nature, measurement, and distribution of dark traits was posted two weeks ago (and i recommend it). there was a questionnaire discussed in that post which tries to measure the levels of 'dark traits' in the respondent.
i'm curious about the results[1] of rationalists[2] on that questionnaire, if anyone wants to volunteer theirs. there are short and long versions (16 and 70 questions).
- ^
(or responses to the questions themselves)
- ^
i also posted the same shortform to the EA forum, asking about EAs
something i'd be interested in reading: writings about the authors alignment ontologies over time, i.e. from when they first heard of AI till now
Understanding [how to design] rather than 'growing' search/agency-structure would actually equal solving inner alignment, if said structure does not depend on what target[1] it is intended to be given, i.e. is targetable (inner-alignable) rather than target-specific.[2]
Such an understanding would simultaneously qualify as of 'how to code a capable AI', but would be fundamentally different from what labs are doing in an alignment-relevant way. In this framing, labs are selecting for target-specific structures (that we don't understand). (Another difference is that, IIRC, Johannes might intend not to share research on this publicly, but I'm less sure after rereading the quote that gave me that impression[3]).
- ^
includes outer alignment goals
- ^
If it's not clear what I mean, reading this about my background model might help, also feel free to ask me questions
- ^
from one of Johannes' posts:
I don't have such a convincing portfolio for doing research yet. And doing this seems to be much harder. Usually, the evaluation of such a portfolio requires technical expertise - e.g. how would you know if a particular math formalism makes sense if you don't understand the mathematical concepts out of which the formalism is constructed?
Of course, if you have a flashy demo, it's a very different situation. Imagine I had a video of an algorithm that learns Minecraft from scratch within a couple of real-time days, and then gets a diamond in less than 1 hour, without using neural networks (or any other black box optimization). It does not require much technical knowledge to see the significance of that.
But I don't have that algorithm, and if I had it, I would not want to make that publicly known. And I am unsure what is the cutoff value. When would something be bad to publish? All of this complicates things.
(After rereading this I'm not actually sure what that means they'd be okay sharing or if they'd intend to share technical writing that's not a flashy demo)
Let us know what you think!
the grey text feels disruptive to normal reading flow but idk why green link text wouldn't also be, maybe i'm just not used to it. e.g., in this post's "Curating technical posts" where 'Curating' is grey, my mind sees "<Curating | distinct term>
technical posts" instead of [normal meaning inference not overfocused on individual words]
Is this useful, as a reader?
if the authors make sure they agree with all the definitions they allow into the glossary, yes. author-written definitions would be even more useful because how things are worded can implicitly convey things like, the underlying intuition, ontology, or related views they may be using wording to rule in or out.
Whenever an author with 100+ karma saves a draft of a post, our database queries a language model to:
i would prefer this be optional too, for drafts which are meant to be private (e.g. shared with a few other users, e.g. may contain possible capability-infohazards), where the author doesn't trust LM companies
If you think I missed the point, can you explain in more detail?
Here is my model: Demon king buys shares in “The Demon King will attack the Frozen Fortress”, then sends some small technically-an-attack to the fortress so the market resolves yes, and knowing this will be done is not worth the money lost to the Demon King on the market. No serious-battle plans or military secrets are leaked, and more generally the Demon King would only do this if the information revealed weren't worth the market cost. (i.e. it's a central kind of prediction market outcome manipulation, i.e. exploiting how this prediction market assumed a kind of metaphysical gap between predictors and the world / knowledge and action)
Do you disagree with this, or think it's true but misses the point, in which case what was the point?
For example, most US school children recite the Pledge of Allegiance every day (or at least they used to). I can remember not fully understanding what the words meant until I was in middle school, but I just went along with it. And wouldn't you know it, it worked! I do have an allegiance to the United States as a concept.
Can you explain how it caused that, and maybe what it feels like?
(I find it alarming that being forced to recite a pledge as a child can actually have that effect -- I knew humans were culturally programmable, but not that {forcing someone to say "I endorse x!" when they don't know what it means nor want to say it} every day would actually cause them to endorse x later on. Actually, I notice I'm skeptical that that was the real cause in your case; what's your reason for believing it was the cause?)
(No pressure to answer my questions of course - interpret them as statements of curiosity rather than requests in the human/social sense)
it helped them anticipate the Demon King’s next moves – it's not the market's fault that they couldn't convert foresight into operational superiority
The demon king only made those moves to profit from the market, they wouldn't have been made otherwise
If we stand by while OpenAI violates its charter, it signals that their execs can get away with it. Worse, it signals that we don’t care.
what signals you send to OAI execs seems not relevant.
in the case where they really can't get away with it, e.g. where the state will really arrest them, then sending them signals / influencing their information state is not what causes that outcome.
if your advocacy causes the world to change such that "they can't get away with it" becomes true, this also does not route through influencing their information state.
OpenAI is seen as the industry leader, yet projected to lose $5 billion this year
i don't see why this would lead them to downsize, if "the gap between industry investment in deep learning and actual revenue has ballooned to over $600 billion a year"
how? edit: maybe you meant just the first kind
oh i meant medical/covid ones. could also consider furry masks and the cat masks that femboys often wear (e.g. to obscure masculine facial structure), which feel cute rather than 'cool', though they are more like the natural human face in that they display an expression ("the face is a mask we wear over our skulls")
also see ashiok from mtg: whole upper face/head is replaced with shadow
also, masks 'create an asymmetry in the ability to discern emotions' but do not seem to lead to the rest
What we could do is create a predictor -- an algorithm that looks at the previously generated bits, tries to find all possible patterns in them and predict the most likely following bit -- and then actually output the opposite. Keep doing this for every bit.
i think a (simplicity-biased) predictor would narrow in on the situation described: that {the rule generating the sequence} contains {a copy of the predictor}, making them irresolvably mutually-dependent, similar to the mutual dependence in the classical halting problem.
in such a case, the predictor is not predicting a 1 or a 0, but a situation where neither can be yielded. so, to be a true implementation of said predictor, it would need to be able to output some third option representing irresolvable situations.
you'd get some string of bits before the predictor considered [irresolvable-mutual-dependance exception] most probable though! what that string is (for some prediction-narrowing procedure) sounds like a fun question
one of my basic background assumptions about agency:
there is no ontologically fundamental caring/goal-directedness, there is only the structure of an action being chosen (by some process, for example a search process), then taken.
this makes me conceptualize the 'ideal agent structure' as being "search, plus a few extra parts". in my model of it, optimal search is queried for what action fulfills some criteria ('maximizes some goal') given some pointer (~ world model) to a mathematical universe sufficiently similar to the actual universe → search's output is taken as action, and because of said similarity we see a behavioral agent that looks to us like it values the world it's in.
i've been told that {it's common to believe that search and goal-directedness are fundamentally intertwined or meshed together or something}, whereas i view goal-directedness as almost not even a real thing, just something we observe behaviorally when search is scaffolded in that way.
if anyone wants to explain the mentioned view to me, or link a text about it, i'd be interested.
(maybe a difference is in the kind of system being imagined: in selected-for systems, i can understand expecting things to be approximately-done at once (i.e. within the same or overlapping strands of computations); i guess i'd especially expect that if there's a selection incentive for efficiency. i'm imagining neat, ideal (think intentionally designed rather than selected for) systems in this context.)
edit: another implication of this view is that decision theory is its own component (could be complex or not) of said 'ideal agent structure', i.e. that superintelligence with an ineffective decision theory is possible (edit: nontrivially likely for a hypothetical AI designer to unintentionally program / need to avoid). that is, one being asked the wrong questions (i.e. of the wrong decision theory) in the above model.
yep not contesting any of that
neither is there in rationality a recipe with which you can just crank the handle and come up with a proof of a conjecture
to be clear, coming up with proofs is a central example of what i meant by creativity. ("they are not satisfied by avoiding failure conditions, but require the satisfaction of some specific, hard-to-find success condition")
The “Draftsmen” podcast by two artists/art instructors contains several episodes on the subject
i am an artist as well :). i actually doubt for most artists that they could give much insight here; i think that usually artist creativity, and also mathematician creativity etc, human creativity, is of the default, mysterious kind, that we don't know where it comes from / it 'just happens', like intuitions, thoughts, realizations do - it's not actually fundamentally different from those even, just called 'creativity' more often in certain domains like art.
i don't think having (even exceptionally) high baseline intelligence and then studying bias avoidance techniques is enough for one to be able to derive an alignment solution. i have not seen in any rationalist i'm aware of what feels like enough for that, though their efforts are virtuous of course. it's just that the standard set by the universe seems higher.
i think this is a sort of background belief for me. not failing at thinking is the baseline; other needed computations are harder. they are not satisfied by avoiding failure conditions, but require the satisfaction of some specific, hard-to-find success condition. learning about human biases will not train one to cognitively seek answers of this kind, only to avoid premature failure.
this is basically a distinction between rationality and creativity. rationality[1] is about avoiding premature failure, creativity is about somehow generating new ideas.
but there is not actually something which will 'guide us through' creativity, like hpmor/the sequences do for rationality. there are various scattered posts about it[2].
i also do not have a guide to creativity to share with you. i'm only pointing at it as an equally if not more important thing.
if there is an art for creativity in the sense of narrow-solution-seeking, then where is it? somewhere in books buried deep in human history? if there is not yet an art, please link more scattered posts or comment new thoughts if you have any.
adding another possible explanation to the list:
- people may feel intimidated or discouraged from sharing ideas because of ~'high standards', or something like: a tendency to require strong evidence that a new idea is not another non-solution proposal, in order to put effort into understanding it.
i have experienced this, but i don't know how common it is.
i just also recalled that janus has said they weren't sure simulators would be received well on LW. simulators was cited in another reply to this as an instance of novel ideas.
Agreed that hidden-motte-and-baileys are a thing. They may also be caused by pressure not to express the actual belief (in which case, idk if I'd call it a fallacy / mistake of reasoning).
I'm not seeing how they synergise with the 'gish fallacy' though.
mathematicians know that a single flaw can destroy proofs of any length
Yes, but the analogy would be having multiple disjunctive proof-attempts which lead to the same result, which you can actually do validly (including with non-math beliefs). (Of course the case you describe is not a valid case of this)
by virtue of happening 10 million years ago or whatever
Why would the time it happens at matter?
we just spin a big quantum wheel, and trade with the AI that comes up
Or run a computation to approximate an average, if that's possible.
I'd guess it must be possible if you can randomly sample, at least. I.e., if you mean sampling from some set of worlds, and not just randomly combinatorially generating programs until you find a trade partner.
I know this approach isn't as effective for xrisk, but still, it's something I like to use.
This sentence has the grammatical structure of acknowledging a counterargument and negating it - "I know x, but y" - but the y is "it's something I like to use", which does not actually negate the x.
This is a kind of thing I suspect results from a process like: someone writes out the structure of negation, out of wanting to negate an argument, but then finds nothing stronger to slot into where the negating argument is supposed to be.
I tried thinking of principles, but it was hard to find ones specific to this. There's one obvious 'default' one at least (default as in it may be overridden by situation).
Secrecy
Premises:
- Model technical knowledge progress (such as about alignment) as concavely/diminishingly increasing with collaboration group size and member <cognitive traits>[1],
- Combine with unilateralist effect
- Combine with it being less hard/specific to create an unaligned than aligned superintelligent agent (otherwise the unilateralist effect would work in the opposite direction).
Implies positive but not negative value of sharing information publicly is diminished if there is already a group trying to utilize the information. If so, may imply ideal is various individual, small or medium-sized alignment-focused groups which don't publicly share their progress by default.[4]
(I do suspect humans are biased in favor of public and social collaboration, as that's kind of what they were selected for, and in a less vulnerable world. Moreover, premise 1a ('humans are mostly the same entity') does contradict aspects of humanistic ontology. That's not strong evidence for this 'principle', just a reason it's probably under-considered)
Counterpoints:
On the concaveness assumption:
~ In history, technical knowledge was developed in a decentralized way, IIUC - based on my purely lay understanding of the history of knowledge progression, that was probably merely absorbed from stories and culture. If that's true, it is evidence against the idea that a smaller group can make almost as much progress as a large one.
Differential progress:
~ there are already far more AI researchers than AI alignment researchers. While the ideal might be for this to be a highly secretive subject like how existential risks are handled in Dath Ilan, this principle cannot give rise to that.
What are principles we can use when secrecy is not enough?
My first thought is to look for principles in games such as you mentioned. But none feel too particular to this question. It returns general things like, "search paths through time", which can similarly be used to pursue good or harmful things. This is unsatisfying.
I want deeper principles, but there may be none.
Meta-principle: Symmetry: For any principle you can apply, an agent whose behavior furthers opposite thing could in theory also apply it.
To avoid symmetry, one could look for principles that are unlikely to be able to be utilized without specific intent and knowledge. One can outsmart runaway structural processes this way, for example, and I think that to a large extent AI research is a case of that.
How have runaway processes been defeated before? There are some generic ways, like social movements, that are already being attempted with superintelligent agent x-risk. Are there other, less well known or expected ways? And did these ways reduce to generic, 'searching paths through time', or is there a pattern to them which could be studied and understood?
There are some clever ideas for doing something like that which come to mind. E.g., the "confrontation-worthy empathy" section of this post.
It's hard for me to think of paths through time more promising than just, 'try to solve object-level alignment', though, let alone the principles which could inspire them (e.g., idk what principle the linked thing could be a case of)
- ^
I mean things like creativity, different ways of doing cognition about problems, and standard things like working memory, 'cognitive power', etc.
(I am using awkward constructions like 'high cognitive power' because standard English terms like 'smart' or 'intelligent' appear to me to function largely as status synonyms. 'Superintelligence' sounds to most people like 'something above the top of the status hierarchy that went to double college', and they don't understand why that would be all that dangerous? Earthlings have no word and indeed no standard native concept that means 'actually useful cognitive power'. A large amount of failure to panic sufficiently, seems to me to stem from a lack of appreciation for the incredible potential lethality of this thing that Earthlings as a culture have not named.)
- ^
I mean replications of the same fundamental entity, i.e humans or the structure of what a human is. And by 'mostly' I mean of course there are differences too. I think evolution implies human minds will tend to be more reflectively aware of the differences because the sameness can operate as an unnoticed background assumption.
- ^
Like we'd not expect asking 10 ChatGPT-3.5s instead of just one to do significantly better. Less true with humans because they were still selected to be different and collaborate.
- ^
(and this may be close to the situation already?)
(This comment is tangential to the decision-theoretic focus of the post)
The AI stabilizes the situation in the world and makes sure no other dangerous AI is built, but otherwise it doesn't harm the humans.[6] Then it modifies its own code to have a commitment never to harm the humans, and let them live freely on Earth for at least a billion years, only doing the minimal necessary interventions to prevent humanity from wiping itself out with some new stupid technology. Crucially, the AI should do this self-modification[7] before it makes itself very substantially smarter or better-informed about the world, to the level that it can expect to determine whether it's in a simulation run by a very advanced future civilization.
I don't know of consistent human values which would ask for this specifically. Consider two cases[1]:
- You value something like continuation of {with a bunch of complex criteria}, not quantity of copies of, at least one 'earth society'.
- In this case, it continues regardless some of the time, conditional on the universe being large or duplicitous enough to contain many copies of you / conditional on the premise in the post that at least some aligned ASIs will exist somewheres.
- Instead, you linearly value a large number of copies of earth civilizations existing or something.
- then the commitment wouldn't be to let-continue just each one earth per unaligned ASI, but to create more, and not cap them at a billion years.[1]
I think this is a case of humans having a deep intuition that there is only one instance of them, while also believing theory that implies otherwise, and not updating that 'deep intuition' while applying the theory even as it updates other beliefs (like the possibility for aligned ASIs from some earths to influence unaligned ones from other earths).
- ^
(to be clear, I'm not arguing for (1) or (2), and of course these are not the only possible things one can value, please do not clamp your values just because the only things humans seem to write about caring about are constrained)
i'm finally learning to prove theorems (the earliest ones following from the Peano axioms) in lean, starting with the natural number game. it is actually somewhat fun, the same kind of fun that mtg has by being not too big to fully comprehend, but still engaging to solve.
(if you want to 'play' it as well, i suggest first reading a bit about what formal systems are and interpretation before starting. also, it was not clear to me at first when the game was introducing axioms vs derived theorems, so i wondered how some operations (e.g 'induction') were allowed, but it turned out that and some others are just in the list of Peano axioms.)
also, this reminded me of one of @Raemon's idea (https://www.lesswrong.com/posts/PiPH4gkcMuvLALymK/exercise-solve-thinking-physics), 'how to prove theorem' feels like a pure case of 'solving a problem that you (often) do not know how to solve', which iiuc they're a proponent of training on
It sounds like understanding functional decision theory might help you understand the parts you're confused about?
Like, would it go play the lottery (assuming money gives +utility for some reason) and pre-commit to pausing if it doesn't win?
Yes, it would try to do whatever the highest-possible-score thing is, regardless of how unlikely it is
Or that by setting a self-pausing policy it could alter E[result]?
By setting a self-pausing policy at the earliest point in time it can, yes. (Though I'm not sure if I'm responding to what you actually meant, or to some other thing that my mind also thinks can match to these words, because the intended meaning isn't super clear to me)
I'm conceptualizing a possible world as an (action,result) pair
(To be clear, I'm conceptualizing the agent as having Bayesian uncertainty about what world it's in, and this is what I meant when writing about "worlds in the agent's prior")
And, we could say - well, but it could fight back and then create a high-utility scenario - but then that would be the utility it would get if it doesn't end up paused, so it would get the high utility paused and again be indifferent.
An agent, (aside from edge cases where it is programmed to be inconsistent in this way), would not have priors about what it will do which mismatch its policy for choosing what to actually do, any change to the latter logically-corresponds to the agent having a different prior about itself, so an attempt to follow this logic would infinitely recur (each time picking a new action in response to the prior's change, which in turn logically changes the prior, and so on). This seems like a case of 'subjunctive dependence' to me (even though it's a bit of an edge case of that, where the two logically-corresponding things - what action an agent will choose, and the agent's prior about what action they will choose - are both localized in the same agent), which is why functional decision theory seems relevant.
So, if it's a really low utility scenario where it won't end up being paused, then sure, it won't get much utility being paused, but since it won't get much utility if it doesn't end up being paused, why should it have a preference?
I think there must be some confusion here, but I'm having trouble understanding exactly what you mean.
Short answer: the scenario, or set of scenarios, where it is not paused, are dependent on what choice it makes, not locked in and independent of it; and it can choose what choice it makes, so it can pick whatever choice corresponds to the set of unpaused futures which score higher.
Longer original answer: When you write, there is one possible future in it's prior where it does not get paused, and then write that this one future can be of lower than average, average, or higher than average utility, because there is only one (by construction) this must mean lower/equal/higher in comparison to what the average score would be if the agent's policy were to resist being paused in such a situation. If so, then in the case where, conditional on its inaction, the score of that one possible future where it does not become paused is lower than what the average score across possible unpaused futures would be when conditional on its action, it would choose action.
(meta: Hmm, I am starting to understand why logical/mathematical syntax may be often used for this sort of thing, I can see why the above paragraph could be hard to read in natural language)