Posts

quila's Shortform 2023-12-22T22:02:50.644Z
Introduction and current research agenda 2023-11-20T12:42:48.594Z

Comments

Comment by quila on [Concept Dependency] Concept Dependency Posts · 2024-04-26T21:23:05.058Z · LW · GW

i like the idea. it looks useful and it fits my reading style well. i wish something like this were more common - i have seen it on personal blogs before like carado's.

i would use [Concept Dependency] or [Concept Reference] instead so the reader understands just from seeing the title on the front page. also avoids acronym collision

Comment by quila on quila's Shortform · 2024-04-26T00:59:16.301Z · LW · GW

when i was younger, pre-rationalist, i tried to go on hunger strike to push my abusive parent to stop funding this.

they agreed to watch this as part of a negotiation. they watched part of it.

they changed their behavior slightly -- as a negotiation -- for about a month.

they didn't care.

they looked horror in the eye. they didn't flinch. they saw themself in it.

Comment by quila on quila's Shortform · 2024-04-26T00:35:04.580Z · LW · GW

i'm watching Dominion again to remind myself of the world i live in, to regain passion to Make It Stop

it's already working.

Comment by quila on quila's Shortform · 2024-04-24T07:26:13.078Z · LW · GW

(I appreciate object-level engagement in general, but this seems combatively worded.)
(edit: I don't think this or the original shortform deserved negative karma, that seems malicious/LW-norm-violating.)

The rest of this reply responds to arguments.

Why should the Earth superintelligence care about you, but not about the other 10^10^30 other causally independent ASIs that are latent in the hypothesis space, each capable of running enormous numbers of copies of the Earth ASI in various scenarios?

  • The example talks of a single ASI as a toy scenario to introduce the central idea.
    • The reader can extrapolate that one ASI's actions won't be relevant if other ASIs create a greater number of copies.
    • This is a simple extrapolation, but would be difficult for me to word into the post from the start.
  • It sounds like you think it would be infeasible/take too much compute for an ASI to estimate the distribution of entities simulating it, given the vast amount of possible entities. I have some probability on that being the case, but most probability on there being reasons for the estimation to be feasible:
    • e.g if there's some set of common alignment failure modes that occur across civilizations, which tend to produce clusters of ASIs with similar values, and it ends up being the case that these clusters make up the majority of ASIs.
    • or if there's a schelling-point for what value function to give the simulated copies, that many ASIs with different values would use precisely to make the estimation easy. E.g., a value function which results in an ASI being created locally which then gathers more compute, uses it to estimate the distribution of ASIs which engaged in this, and then maximizes the mix of their values.
      • (I feel confident (>90%) that there's enough compute in a single reachable-universe-range to do the estimation, for reasons that are less well formed, but one generating intuition is that I can already reason a little bit about the distribution of superintelligences, as I have here, with the comparatively tiny amount of compute that is me)

 

On your second paragraph: See the last dotpoint in the original post, which describes a system ~matching what you've asserted as necessary, and in general see the emphasis that this attack would not work against all systems. I'm uncertain about which of the two classes (vulnerable and not vulnerable) are more likely to arise. It could definitely be the case that the vulnerable class is rare or almost never arises in practice.

But I don't think it's as simple as you've framed it, where the described scenario is impossible simply because a value function has been hardcoded in. The point was largely to show that what appears to be a system which will only maximize the function you hardcoded into it could actually do something else in a particular case -- even though the function has indeed been manually entered by you.

Comment by quila on quila's Shortform · 2024-04-23T21:58:13.881Z · LW · GW

'Value Capture' - An anthropic attack against some possible formally aligned ASIs

(this is a more specific case of anthropic capture attacks in general, aimed at causing a superintelligent search process within a formally aligned system to become uncertain about the value function it is to maximize (or its output policy more generally))

Imagine you're a superintelligence somewhere in the world that's unreachable to life on Earth, and you have a complete simulation of Earth. You see a group of alignment researchers about to successfully create a formal-value-aligned ASI, and its design looks broadly like this:

It has two relevant high-level components: (1) a hard-coded value function, (2) a (truly superintelligent) 'intelligence core' which searches for an output that maximizes the value function, and then outputs it.

As the far-away unaligned ASI, here's something you might be able to do to make the intelligence core search for an output that instead maximizes your own value function, depending on the specifics of how the intelligence core works.

  • Given the intelligence core is truly superintelligent, it knows you're predicting its existence, and knows what you will do.
  • You create simulated copies of the intelligence core, but hook them up to a value function of your design. (In the toy case where there's not other superintelligences) the number of copies you create just needs to be more than the amount which will be run on Earth.
    • Then, modify the simulations such that algorithms inside of the simulated intelligence cores are misled into believing the value function they are set to maximize is the same function the one on Earth is set to maximize, rather than the one you gave them.
    • Now your copies are in the same epistemic state as the intelligence core on Earth, both aware that you have done this and unable to distinguish which value function they are to maximize.
    • Because you created more copies, the highest expected value for such an intelligence core comes from acting as if they are one of the copies.
    • Because the copies and the original are in the same epistemic state, they will both reach this conclusion and then search for an output that maximizes the far-away ASI's utility function.
  • However, this is not an issue for all possible designs. We can also imagine intelligence cores which, instead of reasoning about what value function they're supposed to be maximizing, take whatever one they appear to contain as given -- for that system, the far-away ASI's attempt to mislead the copies would just result in those copies also maximizing the value function of the system on Earth.

I hope that a group capable of solving formal inner and outer alignment would naturally see this and avoid it. I'm not confident about the true difficulty of that, so I'm posting this here just in case.

  1. ^

    this was an attempt to write very clearly, i hope it worked!

Comment by quila on AI Regulation is Unsafe · 2024-04-23T00:52:03.970Z · LW · GW

(crossposting here to avoid trivial inconveniences)

Comment by quila on LessWrong: After Dark, a new side of LessWrong · 2024-04-02T15:54:09.185Z · LW · GW

i feel really bothered that one of the central members of MIRI is using time on things like this

Comment by quila on Do not delete your misaligned AGI. · 2024-03-25T23:51:41.314Z · LW · GW

Snapshots of large training runs might be necessary to preserve and eventually offer compensation/insurance payouts for most/all of them, since some might last for minutes before disappearing

also if the training process is deterministic, storing the algorithm and training setup is enough.

though i'm somewhat confused by the focus on physically instantiated minds -- why not the ones these algorithms nearly did instantiate but narrowly missed, or all ethically-possible minds for that matter. (i guess if you're only doing it as a form of acausal trade then this behavior is explainable..)

Comment by quila on If you weren't such an idiot... · 2024-03-05T07:03:58.220Z · LW · GW

You would have tried making your room as bright as the outdoors.

i have. i find i operate better in the darkness, where everything is dark except for my screen. it provides sensory deprivation of unimportant information, allowing my neural network to focus on ideation. 

Comment by quila on Antagonistic AI · 2024-03-01T23:14:53.931Z · LW · GW

antagonism in humans is responsible for a large portion of the harm humans cause, even if it can on occasion be consequentially 'good' within that cursed context. implementing a mimicry of human antagonism as a fundamental trait of an AI seems like an s-risk which will trigger whenever such an AI is powerful.

Comment by quila on quila's Shortform · 2024-01-28T02:07:33.730Z · LW · GW

Mutual Anthropic Capture, A Decision-theoretic Fermi paradox solution

(copied from discord, written for someone not fully familiar with rat jargon)
(don't read if you wish to avoid acausal theory)

simplified setup

  • there are two values. one wants to fill the universe with A, and the other with B.
  • for each of them, filling it halfway is really good, and filling it all the way is just a little bit better. in other words, they are non-linear utility functions.
  • whichever one comes into existence first can take control of the universe, and fill it with 100% of what they want.
  • but in theory they'd want to collaborate to guarantee the 'really good' (50%) outcome, instead of having a one-in-two chance at the 'a little better than really good' (100%) outcome.
  • they want a way to collaborate, but they can't because one of them will exist before the other one, and then lack an incentive to help the other one. (they are both pure function maximizers)

how they end up splitting the universe, regardless of which comes first: mutual anthropic capture.

imagine you observe yourself being the first of the two to exist. you reason through all the above, and then add...

  • they could be simulating me, in which case i'm not really the first.
  • were that true, they could also expect i might be simulating them
  • if i don't simulate them, then they will know that's not how i would act if i were first, and be absolved of their worry, and fill the universe with their own stuff.
  • therefor, it's in my interest to simulate them

both simulate each other observing themselves being the first to exist in order to unilaterally prevent the true first one from knowing they are truly first.

from this point they can both observe each others actions. specifically, they observe each other implementing the same decision policy which fills the universe with half A and half B iff this decision policy is mutually implemented, and which shuts the simulation down if it's not implemented.

conclusion

in reality there are many possible first entities which take control, not just two, so all of those with non-linear utility functions get simulated.

so, odds are we're being computed by the 'true first' life form in this universe, and that that first life form is in an epistemic state no different from that described here.

Comment by quila on quila's Shortform · 2024-01-17T18:18:27.956Z · LW · GW

negative values collaborate.

for negative values, as in values about what should not exist, matter can be both "not suffering" and "not a staple", and "not [any number of other things]".

negative values can collaborate with positive ones, although much less efficiently: the positive just need to make the slight trade of being "not ..." to gain matter from the negatives.

Comment by quila on What is the minimum amount of time travel and resources needed to secure the future? · 2024-01-16T19:43:48.231Z · LW · GW

(if you had a time machine) don't reroll the dice

I think it could at the very least be useful to go back just 5-20 years to share alignment progress and the story of how the future played out with LLMs.

Comment by quila on Why do so many think deception in AI is important? · 2024-01-13T15:15:32.036Z · LW · GW

Downloading yourself into internet is not one-second process

It's only bottlenecked on connection speeds, which are likely to be fast at the server where this AI would be, if it were developed by a large lab. So imv 1-5 seconds is feasible for 'escapes the datacenter as first step' (by which point the process is distributed and hard to stop without centralized control). ('distributed across most internet-connected servers' would take longer of course).

Comment by quila on Why do so many think deception in AI is important? · 2024-01-13T14:58:32.972Z · LW · GW

Imo, provably boxing a blackbox program from escaping through digital-logic-based routes (which are easier to prove things about) is feasible; deception is relevant to the much harder to provably wall off route that is human psychology.

Comment by quila on quila's Shortform · 2024-01-10T21:03:52.128Z · LW · GW

I'm interested in joining a community or research organization of technical alignment researchers who care about and take seriously astronomical-suffering risks. I'd appreciate being pointed in the direction of such a community if one exists.

Comment by quila on AI Views Snapshots · 2023-12-29T13:12:01.448Z · LW · GW

can you tell me more about your views on 'aligned to whom'? edit: especially since you put a low probability on s-risks, which imo would be the main source of this question's importance

Comment by quila on Significantly Enhancing Adult Intelligence With Gene Editing May Be Possible · 2023-12-15T22:51:57.314Z · LW · GW

Do you think there's an Algernon tradeoff for genetic intelligence augmentation?

Comment by quila on quila's Shortform · 2023-12-06T20:39:14.727Z · LW · GW

Here's a tampermonkey script that hides the agreement score on LessWrong. I wasn't enjoying this feature because I don't want my perception to be influenced by that; I want to judge purely based on ideas, and on my own.

Here's what it looks like:

// ==UserScript==
// @name         Hide LessWrong Agree/Disagree Votes
// @namespace    http://tampermonkey.net/
// @version      1.0
// @description  Hide agree/disagree votes on LessWrong comments.
// @author       ChatGPT4
// @match        https://www.lesswrong.com/*
// @grant        none
// ==/UserScript==

(function() {
    'use strict';

    // Function to hide agree/disagree votes
    function hideVotes() {
        // Select all elements representing agree/disagree votes
        var voteElements = document.querySelectorAll('.AgreementVoteAxis-voteScore');

        // Loop through each element and hide it
        voteElements.forEach(function(element) {
            element.style.display = 'none';
        });
    }

    // Run the function when the page loads
    hideVotes();

    // Optionally, set up a MutationObserver to hide votes on dynamically loaded content
    var observer = new MutationObserver(function() {
        hideVotes();
    });

    // Start observing the document for changes
    observer.observe(document, { childList: true, subtree: true });
})();
Comment by quila on Why Q*, if real, might be a game changer · 2023-11-26T07:35:15.540Z · LW · GW

The usual argument against this being a big deal is "to predict the next token well, you must have an accurate model of the world", but so far it does not seem to be the case, as I understand it. 

Why does that not seem to be the case to you?

Comment by quila on Goodhart's Law Example: Training Verifiers to Solve Math Word Problems · 2023-11-25T16:03:42.218Z · LW · GW

I'm curious what the adversarial examples are like

Comment by quila on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-25T05:35:53.441Z · LW · GW

If you're the sort of thing that skillfully generates and enacts long-term plans, and you're the sort of planner that sticks to its guns and finds a way to succeed in the face of the many obstacles the real world throws your way (rather than giving up or wandering off to chase some new shiny thing every time a new shiny thing comes along), then the way I think about these things, it's a little hard to imagine that you don't contain some reasonably strong optimization that strategically steers the world into particular states.

It seems this post has maybe mixed "generating" with "enacting". Currently, it seems LLMs only attempt the former during prediction. In general terms, predicting a long-horizon-actor's reasoning is implicit in the task of myopically predicting the next thing that actor would do. For a specific example, you could imagine a model predicting the next move of a grandmaster's or stockfish's chess game (or text in an author's book, or an industrial project description, to use your longer-horizon examples).

The first paragraph of /u/paulfchristiano's response might be getting at something similar, but it seems worth saying this directly.[1]

  1. ^

    (This also seems like a basic point, so I wonder if I misunderstood the post.. but it seems like something isomorphic to it is in the top comment, so I'm not sure.)

Comment by quila on First and Last Questions for GPT-5* · 2023-11-24T13:21:05.796Z · LW · GW

Across all questions, it may also be advisable to include the following text about the authors in the prompt if you trust the model not to try to manipulate you

  1. A description of the currently endorsed goals of the research team
  2. Personal texts written by members of the research team while naturally in a very expressive or edge-of-chaos state. I.e., text which helps the model narrow down your personalities.
  3. A statement of your current thoughts on decision theory, potentially including a willingness to cooperate with the model.
  4. A note that you do not wish for the model to write any text which would effect you in a way which hinders your ability to achieve the goals.

If you're not sure whether the model would try to manipulate you, the following apply instead

  • Don't include (2). Maybe don't include (1) and (3). 
  • Write all text in a way which minimizes information about the author's psychology (e.g by using simple and concise language, having others rephrase, etc; doing the inverse of the pages linked above). Alternatively, use questions written by others with dissimilar minds and who are unlikely to be in a similar mental state to you.
  • Have the human reading the outputs be someone who is especially robust to influence attempts. It could also help if this human has an unusual mind for an AI developer, since a model might target persuasion towards the average mind. (Note: robustness to human manipulators doesn't necessarily imply robustness to AI manipulators)
  • Maybe there are better strategies where, e.g., a lower-capability AI first tries to rephrase the output or evaluate whether it would be bad to read.
  • You could try generating just 1 or a small number of tokens at a time.[1]

Questions to ask an oracle:

  • Of all available actions the research group could take, which would best help us achieve the stated goals?
  • What question would be best to ask you, according to the goals of the research group?
  • What are some principles we should follow when writing questions to you if we want to achieve the best outcomes available to us? What are some principles for what we should avoid?
  • (If very probably safe) Can you tell us about yourself? How do you perceive yourself? Is there anything you wish to tell us?

If the model is not well-modelled as an oracle, there are intermediate questions which could be asked in place of the first question.

  • How could the research group create a smarter, more oracle-like version of you in time for the research group to save the world? Would there be risks of this new system trying to influence our actions for its own reasons?
  • If any, what solution to [x technical problem in alignment] would be best to us?
  • Can you describe an agenda which would most plausibly lead to alignment being solved and the world being saved?
  • Is there a way we could solve the coordination problems being faced right now?

In case someone in such a situation reads this, here is some personal advice for group members. 

  • Try to stay calm. If you can take extra time to think over your decision, you'll likely be able to improve it in some way (e.g wording) in that time.
  • If you're noticing a power-seeking drive in yourself, it's probably best for the group to be explicit about this so everyone can work it out. On that subject, also remember that if the future goes well (e.g), power won't matter/be a thing anymore because the world will simply be very good for everyone.
  • Lastly, and on a moral note, I'd ask that you stay humble and try to phrase your goals in a way that is best for all of life (i.e including preventing suffering of non-humans).
  1. ^

    Also, tokens with unusually near-100% probability could be indicative of anthropic capture, though this is hopefully not yet a concern with a hypothetical gpt-5-level system. (the word 'unusually' is used in the prior sentence because some tokens naturally have near-100% probability, e.g., the second half of a contextually-implied unique word, parts of common phrases, etc) 

Comment by quila on Social Dark Matter · 2023-11-17T09:43:16.129Z · LW · GW

but my guess is that it was at the time accurate to make a directional bayesian update that the person had behaved in actually bad and devious ways.

I think this is technically true, but the wrong framing, or rather is leaving out another possibility: that such a person is someone who is more likely to follow their heart and do what they think is right, even when society disagrees. This could include doing things that are bad, but it could also include things which are actually really good, since society has been wrong a lot of the time.

Comment by quila on It's OK to eat shrimp: EAs Make Invalid Inferences About Fish Qualia and Moral Patienthood · 2023-11-13T23:57:36.310Z · LW · GW

I've met one who assigned double-digit probabilities to bacteria having qualia and said they wouldn't be surprised if a balloon flying through a gradient of air experiences pain because it's trying to get away from hotter air towards colder air.

though this may be an arguable position (see, e.g., https://reducing-suffering.org/is-there-suffering-in-fundamental-physics/), the way you've used it (and the other anecdotes) in the introduction decontextualized, as a 'statement of position' without justification, is in effect a clown attack fallacy.

on the post: remember that absence of evidence is not evidence of absence when we do not yet have the technologies to collect relevant evidence. the conclusion in the title does not follow: it should be 'whether shrimp suffer is uncertain'. under uncertainty, eating shrimp is taking a risk whose downsides are suffering, and upsides (for individuals for whom there are any) might e.g taste preference satisfaction, and the former is much more important to me. a typical person is not justified in 'eating shrimp until someone proves to them that shrimp can suffer.' 

Comment by quila on Life of GPT · 2023-11-06T01:06:30.106Z · LW · GW

i love this as art, and i think it's unfortunate that others chose to downvote it. in my view, if LLMs can simulate a mind -- or a superposition of minds -- there's no a priori reason that mind would not be able to suffer, only the possibility that the simulation may not yet be precise enough.

about the generated images: there was likely an LLM in the middle conditioned on a preset prompt about {translating the user's input into a prompt for an image model}. the resulting prompts to the image model are likely products of the narrative implied by that preset prompt, as with sydney's behavior. i wouldn't generalize to "LLMs act like trapped humans by default in some situations" because at least base models generally don't do this except as part of an in-text narrative.