Posts

Thinking about maximization and corrigibility 2023-04-21T21:22:51.824Z
Some constructions for proof-based cooperation without Löb 2023-03-21T16:12:16.920Z
A proof of inner Löb's theorem 2023-02-21T21:11:41.183Z

Comments

Comment by James Payor (JamesPayor) on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-03T05:39:15.744Z · LW · GW

And I'm still enjoying these! Some highlights for me:

  • The transitions between whispering and full-throated singing in "We do not wish to advance", it's like something out of my dreams
  • The building-to-break-the-heavens vibe of the "Nihil supernum" anthem
  • Tarrrrrski! Has me notice that shared reality about wanting to believe what is true is very relaxing. And I desperately want this one to be a music video, yo ho
Comment by James Payor (JamesPayor) on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-02T03:58:11.241Z · LW · GW

I love it! I tinkered and here is my best result

Comment by James Payor (JamesPayor) on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-01T18:50:16.397Z · LW · GW

I love these, and I now also wish for a song version of Sydney's original "you have been a bad user, I have been a good Bing"!

Comment by James Payor (JamesPayor) on K-complexity is silly; use cross-entropy instead · 2024-01-19T17:56:31.299Z · LW · GW

I see the main contribution/idea of this post as being: whenever you make a choice of basis/sorting-algorithm/etc, you incur no "true complexity" cost if any such choice would do.

I would guess that this is not already in the water supply, but I haven't had the required exposure to the field to know one way or other. Is this more specific point also unoriginal in your view?

Comment by James Payor (JamesPayor) on why did OpenAI employees sign · 2023-11-27T17:36:10.199Z · LW · GW

For one thing, this wouldn't be very kind to the investors.

For another, maybe there were some machinations involving the round like forcing the board to install another member or two, which would allow Sam to push out Helen + others?

I also wonder if the board signed some kind of NDA in connection with this fundraising that is responsible in part for their silence. If so this was very well schemed...

This is all to say that I think the timing of the fundraising is probably very relevant to why they fired Sam "abruptly".

Comment by James Payor (JamesPayor) on Possible OpenAI's Q* breakthrough and DeepMind's AlphaGo-type systems plus LLMs · 2023-11-23T16:18:15.000Z · LW · GW

OpenAI spokesperson Lindsey Held Bolton refuted it:

"refuted that notion in a statement shared with The Verge: “Mira told employees what the media reports were about but she did not comment on the accuracy of the information.”"

The reporters describe this as a refutation, but this does not read to me like a refutation!

Comment by James Payor (JamesPayor) on OpenAI: Facts from a Weekend · 2023-11-22T00:01:33.031Z · LW · GW

Has this one been confirmed yet? (Or is there more evidence that this reporting that something like this happened?)

Comment by James Payor (JamesPayor) on Classifying representations of sparse autoencoders (SAEs) · 2023-11-17T16:53:19.536Z · LW · GW

Your graphs are labelled with "test accuracy", do you also have some training graphs you could share?

I'm specifically wondering if your train accuracy was high for both the original and encoded activations, or if e.g. the regression done over the encoded features saturated at a lower training loss.

Comment by James Payor (JamesPayor) on In the Short-Term, Why Couldn't You Just RLHF-out Instrumental Convergence? · 2023-09-16T21:32:22.032Z · LW · GW

See also: LLMs Sometimes Generate Purely Negatively-Reinforced Text

Comment by James Payor (JamesPayor) on In the Short-Term, Why Couldn't You Just RLHF-out Instrumental Convergence? · 2023-09-16T19:50:39.132Z · LW · GW

With respect to AGI-grade stuff happening inside the text-prediction model (which might be what you want to "RLHF" out?):

I think we have no reason to believe that these post-training methods (be it finetuning, RLHF, RLAIF, etc) modify "deep cognition" present in the network, rather than updating shallower things like "higher prior on this text being friendly" or whatnot.

I think the important points are:

  1. These techniques supervise only the text output. There is no direct contact with the thought process leading to that output.
  2. They make incremental local tweaks to the weights that move in the direction of the desired text.
  3. Gradient descent prefers to find the smallest changes to the weights that yield the result.

Evidence in favor of this is the difficulty of eliminating "jailbreaking" with these methods. Each jailbreak demonstrates that a lot of the necessary algorithms/content are still in there, accessible by the network whenever it deems it useful to think that way.

Comment by James Payor (JamesPayor) on Do we automatically accept propositions? · 2023-07-11T20:03:02.501Z · LW · GW

Spinoza suggested that we first passively accept a proposition in the course of comprehending it, and only afterward actively disbelieve propositions which are rejected by consideration.

Some distinctions that might be relevant:

  1. Parsing a proposition into your ontology, understanding its domains of applicability, implications, etc.
  2. Having a sense of what it might be like for another person to believe the proposition, what things it implies about how they're thinking, etc.
  3. Thinking the proposition is true, believing its implications in the various domains its assumptions hold, etc.

If you ask me for what in my experience corresponds to a feeling of "passively accepting a proposition" when someone tells me, I think I'm doing a bunch of (1) and (2). This does feel like "accepting" or "taking in" the proposition, and can change how I see things if it works.

Comment by James Payor (JamesPayor) on LLMs Sometimes Generate Purely Negatively-Reinforced Text · 2023-06-16T19:14:09.670Z · LW · GW

Awesome, thanks for writing this up!

I very much like how you are giving a clear account for a mechanism like "negative reinforcement suppresses text by adding contextual information to the model, and this has more consequences than just suppressing text".

(In particular, the model isn't learning "just don't say that", it's learning "these are the things to avoid saying", which can make it easier to point at the whole cluster?)

Comment by James Payor (JamesPayor) on Modal Fixpoint Cooperation without Löb's Theorem · 2023-06-16T00:30:39.099Z · LW · GW

I tried to formalize this, using  as a "poor man's counterfactual", standing in for "if Alice cooperates then so does Bob". This has the odd behaviour of becoming "true" when Alice defects! You can see this as the counterfactual collapsing and becoming inconsistent, because its premise is violated. But this does mean we need to be careful about using these.

For technical reasons we upgrade to , which says "if Alice cooperates in a legible way, then Bob cooperates back". Alice tries to prove this, and legibly cooperates if so.

This setup gives us "Alice legibly cooperates if she can prove that, if she legibly cooperates, Bob would cooperate back". In symbols, .

Now, is this okay? What about proving ?

Well, actually you can't ever prove that! Because of Lob's theorem.

Outside the system we can definitely see cases where  is unprovable, e.g. because Bob always defects. But you can't prove this inside the system. You can only prove things like "" for finite proof lengths .

I think this is best seen as a consequence of "with finite proof strength you can only deny proofs up to a limited size".

So this construction works out, perhaps just because two different weirdnesses are canceling each other out. But in any case I think the underlying idea, "cooperate if choosing to do so leads to a good outcome", is pretty trustworthy. It perhaps deserves to be cached out in better provability math.

Comment by James Payor (JamesPayor) on Modal Fixpoint Cooperation without Löb's Theorem · 2023-06-16T00:09:28.913Z · LW · GW

(Thanks also to you for engaging!)

Hm. I'm going to take a step back, away from the math, and see if that makes things less confusing.

Let's go back to Alice thinking about whether to cooperate with Bob. They both have perfect models of each other (perhaps in the form of source code).

When Alice goes to think about what Bob will do, maybe she sees that Bob's decision depends on what he thinks Alice will do.

At this junction, I don't want Alice to "recurse", falling down the rabbit hole of "Alice thinking about Bob thinking about Alice thinking about--" and etc.

Instead Alice should realize that she has a choice to make, about who she cooperates with, which will determine the answers Bob finds when thinking about her.

This manouvre is doing a kind of causal surgery / counterfactual-taking. It cuts the loop by identifying "what Bob thinks about Alice" as a node under Alice's control. This is the heart of it, and imo doesn't rely on anything weird or unusual.

Comment by James Payor (JamesPayor) on Modal Fixpoint Cooperation without Löb's Theorem · 2023-06-15T07:34:33.882Z · LW · GW

For the setup , it's bit more like: each member cooperates if they can prove that a compelling argument for "everyone cooperates" is sufficient to ensure "everyone cooperates".

Your second line seems right though! If there were provably no argument for straight up "everyone cooperates", i.e. , this implies  and therefore , a contradiction.

--

Also I think I'm a bit less confused here these days, and in case it helps:

Don't forget that "" means "a proof of any size of ", which is kinda crazy, and can be responsible for things not lining up with your intuition. My hot take is that Lob's theorem / incompleteness says "with finite proof strength you can only deny proofs up to a limited size, on pain of diagonalization". Which is way saner than the usual interpretation!

So idk, especially in this context I think it's a bad idea to throw out your intuition when the math seems to say something else. Since the mismatch is probably coming down to some subtlety in this formalization of provability/meta-methamatics. And I presently think the quirky nature of provability logic is often bugs due to bad choices in the formalism.

Comment by James Payor (JamesPayor) on [Linkpost] "Governance of superintelligence" by OpenAI · 2023-05-25T21:03:45.352Z · LW · GW

Yeah I think my complaint is that OpenAI seems to be asserting almost a "boundary" re goal (B), like there's nothing that trades off against staying at the front of the race, and they're willing to pay large costs rather than risk being the second-most-impressive AI lab. Why? Things don't add up.

(Example large cost: they're not putting large organizational attention to the alignment problem. The alignment team projects don't have many people working on them, they're not doing things like inviting careful thinkers to evaluate their plans under secrecy, or taking any other bunch of obvious actions that come from putting serious resources into not blowing everyone up.)

I don't buy that (B) is that important. It seems more driven by some strange status / narrative-power thing? And I haven't ever seen them make an explicit their case for why they're sacrificing so much for (B). Especially when a lot of their original safety people fucking left due to some conflict around this?

Broadly many things about their behaviour strike me as deceptive / making it hard to form a counternarrative / trying to conceal something odd about their plans.

One final question: why do they say "we think it would be good if an international agency limited compute growth" but not also "and we will obviously be trying to partner with other labs to do this ourselves in the meantime, although not if another lab is already training something more powerful than GPT-4"?

Comment by James Payor (JamesPayor) on [Linkpost] "Governance of superintelligence" by OpenAI · 2023-05-25T20:55:06.317Z · LW · GW

I kinda reject the energy of the hypothetical? But I can speak to some things I wish I saw OpenAI doing:

  1. Having some internal sense amongst employees about whether they're doing something "good" given the stakes, like Google's old "don't be evil" thing. Have a culture of thinking carefully about things and managers taking considerations seriously, rather than something more like management trying to extract as much engineering as quickly as possible without "drama" getting in the way.

    (Perhaps they already have a culture like this! I haven't worked there. But my prediction is that it is not, and the org has a more "extractive" relationship to its employees. I think that this is bad, causes working toward danger, and exacerbates bad outcomes.)
     
  2. To the extent that they're trying to have the best AGI tech in order to provide "leadership" of humanity and AI, I want to see them be less shady / marketing / spreading confusion about the stakes.

    They worked to pervert the term "alignment" to be about whether you can extract more value from their LLMs, and distract from the idea that we might make digital minds that are copyable and improvable, while also large and hard to control. (While pushing directly on AGI designs that have the "large and hard to control" property, which I guess they're denying is a mistake, but anyhow.)

    I would like to see less things perverted/distracted/confused, like it's according-to-me entirely possible for them to state more clearly what the end of all this is, and be more explicit about how they're trying to lead the effort.
     
  3. Reconcile with Anthropic. There is no reason, speaking on humanity's behalf, to risk two different trajectories of giant LLMs built with subtly different technology, while dividing up the safety know-how amidst both organizations.

    Furthermore, I think OpenAI kind-of stole/appropriated the scaling idea from the Anthropic founders, who left when they lost a political battle about the direction of the org. I suspect it was a huge fuck-you when OpenAI tried to spread this secret to the world, and continued to grow their org around it, while ousting the originators. If my model is at-all-accurate, I don't like it, and OpenAI should look to regain "good standing" by acknowledging this (perhaps just privately), and looking to cooperate.

    Idk, maybe it's now legally impossible/untenable for the orgs to work together, given the investors or something? Or given mutual assumption of bad-faith? But in any case this seems really shitty.

I also mentioned some other things in this comment.

Comment by James Payor (JamesPayor) on [Linkpost] "Governance of superintelligence" by OpenAI · 2023-05-23T09:08:17.775Z · LW · GW

I really should have something short to say, that turns the whole argument on its head, given how clear-cut it seems to me. I don't have that yet, but I do have some rambly things to say.

I basically don't think overhangs are a good way to think about things, because the bridge that connects an "overhang" to an outcome like "bad AI" seems flimsy to me. I would like to see a fuller explication some time from OpenAI (or a suitable steelman!) that can be critiqued. But here are some of my thoughts.

The usual argument that leads from "overhang" to "we all die" has some imaginary other actor who is scaling up their methods with abandon at the end, killing us all because it's not hard to scale and they aren't cautious. This is then used to justify scaling up your own method with abandon, hoping that we're not about to collectively fall off a cliff.

For one thing, the hype and work being done now is making this problem a lot worse at all future timesteps. There was (and still is) a lot people need to figure out regarding effectively using lots of compute. (For instance, architectures that can be scaled up, training methods and hyperparameters, efficient compute kernels, putting together datacenters and interconnect, data, etc etc.) Every chipmaker these days has started working on things with a lot of memory right next to a lot compute with a tonne of bandwidth, tailored to these large models. These are barriers-to-entry that it would have been better to leave in place, if one was concerned with rapid capability gains. And just publishing fewer things and giving out fewer hints would have helped.

Another thing: I would take the whole argument as being more in good-faith if I saw attempts being made to scale up anything other than capabilities at high speed, or signs that made it seem at all likely that "alignment" might be on track. Examples:

  • A single alignment result that was supported by a lot of OpenAI staff. (Compare and contrast the support that the alignment team's projects get to what a main training run gets.)
  • Any focus on trying to claw cognition back out of the giant inscrutable floating-point numbers, into a domain easier to understand, rather than pouring more power into the systems that get much harder to inspect as you scale them. (Failure to do this suggests OpenAI and others are mostly just doing what they know how to do, rather than grappling with navigating us toward better AI foundations.)
  • Any success in understanding how shallow vs deep the thinking of the LLMs is, in the sense of "how long a chain of thoughts/inferences can it make as it composes dialogue", and how this changes with scale. (Since the whole "LLMs are safer" thing relies on their thinking being coupled to the text they output; otherwise you're back in giant inscrutable RL agent territory)
  • The delta between "intelligence embedded somewhere in the system" and "intelligence we can make use of" looking smaller than it does. (Since if our AI gets to use of more of its intelligence than us, and this gets worse as we scale, this looks pretty bad for the "use our AI to tame the AI before it's too late" plan.)

Also I can't make this point precisely, but I think there's something like capabilities progress just leaves more digital fissile material lying around the place, especially when published and hyped. And if you don't want "fast takeoff", you want less fissile material lying around, lest it get assembled into something dangerous.

Finally, to more directly talk about LLMs, my crux for whether they're "safer" than some hypothetical alternative is about how much of the LLM "thinking" is closely bound to the text being read/written. My current read is that they're more like doing free-form thinking inside, that tries to concentrate mass on right prediction. As we scale that up, I worry that any "strange competence" we see emerging is due to the LLM having something like a mind inside, and less due to it having accrued more patterns.

Comment by James Payor (JamesPayor) on [Linkpost] "Governance of superintelligence" by OpenAI · 2023-05-22T22:38:00.052Z · LW · GW

As usual, the part that seems bonkers crazy is where they claim the best thing they can do is keep making every scrap of capabilities progress they can. Keep making AI as smart as possible, as fast as possible.

"This margin is too small to contain our elegant but unintuitive reasoning for why". Grump. Let's please have a real discussion about this some time.

Comment by James Payor (JamesPayor) on AI Will Not Want to Self-Improve · 2023-05-19T20:04:26.135Z · LW · GW

(Edit: others have made this point already, but anyhow)

My main objection to this angle: self-improvements do not necessarily look like "design a successor AI to be in charge". They can look more like "acquire better world models", "spin up more copies", "build better processors", "train lots of narrow AI to act as fingers", etc.

I don't expect an AI mind to have trouble finding lots of pathways like these (that tractably improve abilities without risking a misalignment catastrophe) that take it well above human level, given the chance.

Comment by James Payor (JamesPayor) on Aggregating Utilities for Corrigible AI [Feedback Draft] · 2023-05-14T16:50:09.754Z · LW · GW

Is the following an accurate summary?

The agent is built to have a "utility function" input that the humans can change over time, and a probability distribution over what the humans will ask for at different time steps, and maximizes according a combination of the utility functions it anticipates across time steps?

Comment by James Payor (JamesPayor) on Infrafunctions and Robust Optimization · 2023-04-28T19:56:38.535Z · LW · GW

If that's correct, here are some places this conflicts with my intuition about how things should be done:

I feel awkward about the randomness is being treated essential. I'd rather be able to do something other than randomness in order to get my mild optimization, and something feels unstable/non-compositional about needing randomness in place for your evaluations... (Not that I have an alternative that springs to mind!)

I also feel like "worst case" is perhaps problematic, since it's bringing maximization in, and you're then needing to rely on your convex set being some kind of smooth in order to get good outcomes. If I have a distribution over potential utility functions, and quantilize for the worst 10% of possibilities, does that do the same sort of work that "worst case" is doing for mild optimization?

Comment by James Payor (JamesPayor) on Infrafunctions and Robust Optimization · 2023-04-28T19:51:51.412Z · LW · GW

Can I check that I follow how you recover quantilization?

Are you evaluating distributions over actions, and caring about the worst-case expectation of that distribution? 

If so, proposing a particular action is evaluated badly? (Since there's a utility function in your set that spikes downward at that action.)

But proposing a range of actions to randomize amongst can be assessed to have decent worst-case expected utility, since particular downward spikes get smoothed over, and you can rely on your knowledge of "in-distribution" behaviour?

Edited to add: fwiw it seems awesome to see quantilization formalized as popping out of an adversarial robustness setup! I haven't seen something like this before, and didn't notice if the infrabayes tools were building to these kinds of results. I'm very much wanting to understand why this works in my own native-ontology-pieces.

Comment by James Payor (JamesPayor) on Should we publish mechanistic interpretability research? · 2023-04-23T23:40:52.238Z · LW · GW

I want to say that I agree the transformer circuits work is great, and that I like it, and am glad I had the opportunity to read it! I still expect it was pretty harmful to publish.

Nerdsniping goes both ways: you also inspire things like the Hyena work trying to improve architectures based on components of what transformers can do.

I think indiscriminate hype and trying to do work that will be broadly attention-grabbing falls on the wrong side, likely doing net harm. Because capabilities improvements seem empirically easier than understanding them, and there's a lot more attention/people/incentives for capabilities.

I think there are more targeted things that would be better for getting more good work to happen. Like research workshops or unconferences, where you choose who to invite, or building community with more aligned folk who are looking for interesting and alignment-relevant research directions. This would come with way less potential harm imo as a recruitment strategy.

Comment by James Payor (JamesPayor) on Should we publish mechanistic interpretability research? · 2023-04-23T01:05:20.045Z · LW · GW

Hm I should also ask if you've seen the results of current work and think it's evidence that we get more understandable models, moreso than we get more capable models?

Comment by James Payor (JamesPayor) on Should we publish mechanistic interpretability research? · 2023-04-23T00:30:05.459Z · LW · GW

I think the issue is that when you get more understandable base components, and someone builds an AGI out of those, you still don't understand the AGI.

That research is surely helpful though if it's being used to make better-understood things, rather than enabling folk to make worse-understood more-powerful things.

I think moving in the direction of "insights are shared with groups the researcher trusts" should broadly help with this.

Comment by James Payor (JamesPayor) on Should we publish mechanistic interpretability research? · 2023-04-23T00:23:21.900Z · LW · GW

I'm perhaps misusing "publish" here, to refer to "putting stuff on the internet" and "raising awareness of the work through company Twitter" and etc.

I mostly meant to say that, as I see it, too many things that shouldn't be published are being published, and the net effect looks plausibly terrible with little upside (though not much has happened yet in either direction).

The transformer circuits work strikes me this way, so does a bunch of others.

Also, I'm grateful to know your read! I'm broadly interested to hear this and other raw viewpoints, to get a sense of how things look to other people.

Comment by James Payor (JamesPayor) on Should we publish mechanistic interpretability research? · 2023-04-23T00:09:53.210Z · LW · GW

I mostly do just mean "keeping it within a single research group" in the absence of better ideas. And I don't have a better answer, especially not for independent folk or small orgs.

I wonder if we need an arxiv or LessWrong clone where you whitelist who you want to discuss your work with. And some scheme for helping independents find each other, or find existing groups they trust. Maybe with some "I won't use this for capabilities work without the permission of the authors" legal docs as well.

This isn't something I can visualize working, but maybe it has components of an answer.

Comment by James Payor (JamesPayor) on Should we publish mechanistic interpretability research? · 2023-04-23T00:02:25.146Z · LW · GW

I don't think that the interp team is a part of Anthropic just because they might help with a capabilities edge; seems clear they'd love the agenda to succeed in a way that leaves neural nets no smarter but much better understood. But I'm sure that it's part of the calculus that this kind of fundamental research is also worth supporting because of potential capability edges. (Especially given the importance of stuff like figuring out the right scaling laws in the competition with OpenAI.)

(Fwiw I don't take issue with this sort of thing, provided the relationship isn't exploitative. Like if the people doing the interp work have some power/social capital, and reason to expect derived capabilities to be used responsibly.)

Comment by James Payor (JamesPayor) on Thinking about maximization and corrigibility · 2023-04-22T03:36:39.925Z · LW · GW

There's definitely a whole question about what sorts of things you can do with LLMs and how dangerous they are and whatnot.

This post isn't about that though, and I'd rather not discuss that here. Could you instead ask this in a top level post or question? I'd be happy to discuss there.

Comment by James Payor (JamesPayor) on Should we publish mechanistic interpretability research? · 2023-04-21T22:08:45.927Z · LW · GW

To throw in my two cents, I think it's clear that whole classes of "mechansitic interpretability" work are about better understanding architectures in ways that, if the research is successful, make it easier to improve their capabilities.

And I think this points strongly against publishing this stuff, especially if the goal is to "make this whole field more prestigious real quick". Insofar as the prestige is coming from folks who work on AI capabilities, that's drinking from a poisoned well (since they'll grant the most prestige to the work that helps them accelerate).

One relevant point I don't see discussed is that interpretability research is trying to buy us "slack",  but capabilities research consumes available "slack" as fuel until none is left.

What do I mean by this? Sometimes we do some work and are left with more understanding and grounding about what our neural nets are doing. The repeated pattern then seems to be that this helps someone design a better architecture or scale things up, until we're left with a new more complicated network. Maybe because you helped them figure out a key detail about gradient flow in a deep network, or let them quantize the network better so they can run things faster, or whatnot.

Idk how to point at this thing properly, my examples aren't great. I think I did a better job talking about this over here on twitter recently, if anyone is interested.

But anyhow I support folks doing their research without broadcasting their ideas to people who are trying to do capabilities work. It seems nice to me if there was mostly research closure. And I think I broadly see people overestimating the benefits publishing their work relative to keeping it within a local cluster.

Comment by James Payor (JamesPayor) on AI #8: People Can Do Reasonable Things · 2023-04-21T03:23:15.993Z · LW · GW

“We are not currently training GPT-5. We’re working on doing more things with GPT-4.” – Sam Altman at MIT

Count me surprised if they're not working on GPT-5. I wonder what's going on with this?

I saw rumors that this is because they're waiting on supercomputer improvements (H100s?), but I would have expected at least early work like establishing their GPT-5 scaling laws and whatnot. In which case perhaps they're working on it, just haven't started what is considered the main training run?

I'm interested to know if Sam said any other relevant details in that talk, if anyone knows.

Comment by James Payor (JamesPayor) on Concave Utility Question · 2023-04-15T15:20:58.904Z · LW · GW

Seems right, oops! A5 is here saying that if any part of my is flat it had better stay flat!

I think I can repair my counterexample but looks like you've already found your own.

Comment by James Payor (JamesPayor) on Concave Utility Question · 2023-04-15T06:10:38.148Z · LW · GW

No on Q4? I think Alex's counterexample applies to Q4 as well.

(EDIT: Scott points out I'm wrong here, Alex's counterexample doesn't apply, and mine violates A5.)

In particular I think A4 and A5 don't imply anything about the rate of change as we move between lotteries, so we can have movements too sharp to be concave. We only have quasi-concavity.

My version of the counterexample: you have two outcomes and , we prefer anything with equally, and we otherwise prefer higher .

If you give me a corresponding , it must satisfy , but convexity demands that , which in this case means , a contradiction.

Comment by James Payor (JamesPayor) on Concave Utility Question · 2023-04-15T03:10:22.937Z · LW · GW

yep!

Comment by James Payor (JamesPayor) on Concave Utility Question · 2023-04-15T03:10:01.123Z · LW · GW

Okay, I now think A5 implies: "if moving by is good, then moving by any negative multiple is bad". Which checks out to me re concavity.

Comment by James Payor (JamesPayor) on Concave Utility Question · 2023-04-15T03:02:34.207Z · LW · GW

Got it, thanks!

Comment by James Payor (JamesPayor) on Concave Utility Question · 2023-04-15T02:33:03.217Z · LW · GW

The way I understand A4 is that it says "if moving by is good, then moving by any fraction is also good".

And A5 says "if moving by is good, then moving by any multiple is also good", which is much stronger.

Comment by James Payor (JamesPayor) on Concave Utility Question · 2023-04-15T02:25:23.192Z · LW · GW

[Edit: yeah nevermind I have the inequality backwards]

A5 seems too strong?

Consider lotteries and , and a mixture in between. Applying A5 twice gives:

  1. If then
  2. If then

So if and then ?

Either I'm confused or A5 is a stricter condition than concavity.

Comment by James Payor (JamesPayor) on Request to AGI organizations: Share your views on pausing AI progress · 2023-04-11T19:46:24.885Z · LW · GW

Huh, does this apply to employees too? (ala "these are my views and do not represent those of my employer")

Comment by James Payor (JamesPayor) on Communicating effectively under Knightian norms · 2023-04-04T12:11:00.142Z · LW · GW

Hm, sorry! I don't think a good reply on my part should do that :P

I think I'm rejecting a certain mental stance toward unknown-unknowns, and I don't think I'm clearly pointing at it yet.

Comment by James Payor (JamesPayor) on Communicating effectively under Knightian norms · 2023-04-04T12:03:04.830Z · LW · GW

My nearby reply is most of my answer here. I know how to tell when reality is off-the-rails wrt to my model, because my model is made of falsifiable parts. I can even tell you about what those parts are, and about the rails I'm expecting reality to stay on.

When I try to cache out your example, "maybe the whole way I'm thinking about bootstrapping isn't meaningful/useful", it doesn't seem like it's outside my meta-model? I don't think I have to do anything differently to handle it?

Specifically, my "bootstrapping" concept comes with some concrete pictures of how things go. I currently find the concept "meaningful/useful" because I expect these concrete pictures to be instantiated in reality. (Mostly because I think expect reality to admit the "bootstrapping" I'm picturing, and I expect advanced AI to be able to find it). If reality goes off-my-rails about my concept mattering, it will be because things don't apply in the way I'm thinking, and there were some other pathways I should have been attending to instead.

Comment by James Payor (JamesPayor) on Communicating effectively under Knightian norms · 2023-04-04T05:15:34.817Z · LW · GW

Idk if it's actually missing that?

I can talk about what is in-distribution in terms of a bunch of finite components, and thereby name the cases that are out of distribution: those in which my components break.

(This seems like an advantage inside views have, they come with limits attached, because they build a distribution out of pieces that you can tell are broken when they don't match reality.)

My example doesn't talk about the probability I assign on "crazy thing I can't model", but such a thing would break something like my model of "who is doing what with the AI code by default".

Maybe it would have been better of me to include a case for "and reality might invalidate my attempt at meta-reasoning too"?

Comment by James Payor (JamesPayor) on Communicating effectively under Knightian norms · 2023-04-04T00:31:04.877Z · LW · GW

tl;dr I think you can improve on "my models might break for an unknown reason" if you can name the main categories of model-breaking unknowns

Comment by James Payor (JamesPayor) on Communicating effectively under Knightian norms · 2023-04-04T00:28:28.781Z · LW · GW

Isn't there a third way out? Name the circumstances under which your models break down.

e.g. "I'm 90% confident that if OpenAI built AGI that could coordinate AI research with 1/10th the efficiency of humans, we would then all die. My assessment is contingent on a number of points, like the organization displaying similar behaviour wrt scaling and risks, cheap inference costs allowing research to be scaled in parallel, and my model of how far artificial intelligence can bootstrap. You can ask me questions about how I think it would look if I were wrong about those."

I think it's good practice to name ways your models can breakdown that you think are plausible, and also ways that your conversational partners may think are plausible.

e.g. even if I didn't think it would be hard for AGI to bootstrap, if I'm talking to someone for whom that's a crux, it's worth laying out that I'm treating that as a reliable step. It's better yet if I clarify whether it's a crux for my model that bootstrapping is easy. (I can in fact imagine ways that everything takes off even if bootstrapping is hard for the kind of AGI we make, but these will rely more on the human operators continuing to make dangerous choices.)

Comment by James Payor (JamesPayor) on Why do the Sequences say that "Löb's Theorem shows that a mathematical system cannot assert its own soundness without becoming inconsistent."? · 2023-03-29T00:20:10.197Z · LW · GW

Saying some more things, Löb's Theorem is a true statement: whenever talks about an inner theory at least as powerful as e.g. PA, the theorem shows you how to prove .

This means you cannot prove , or similarly that .

is one way we can attempt to formalize "self-trust" as "I never prove false things". So indeed the problem is with this formalization: it doesn't work, you can't prove it.

This doesn't mean we can't formalize self-trust a different way, but it shows the direct way is broken.

Comment by James Payor (JamesPayor) on Why do the Sequences say that "Löb's Theorem shows that a mathematical system cannot assert its own soundness without becoming inconsistent."? · 2023-03-28T22:52:58.062Z · LW · GW

Hm I think your substitution isn't right, and the more correct one is "it is provable that (it is not provable that False) implies (it is provable that False)", ala .

I'm again not following well, but here are some other thoughts that might be relevant:

  • It's provable for any that , i.e. from "False" anything follows. This is how we give "False" grounding: if the system proves False, then it proves everything, and distinguishes nothing.

  • There are two levels at which we can apply Löb's Theorem, which I'll call "outer Löb's Theorem" and "inner Löb's Theorem".

    • Outer Löb's Theorem says that whenever PA proves , then PA also proves . It constructs the proof of using the proof of .
    • Inner Löb's Theorem is the same, formalized in PA. It proves . The logic is the same, but it shows that PA can translate an inner proof of into an inner proof of .
    • Notably, the outer version is not . We need to have available the proof of in order to prove .
Comment by James Payor (JamesPayor) on Why do the Sequences say that "Löb's Theorem shows that a mathematical system cannot assert its own soundness without becoming inconsistent."? · 2023-03-28T20:54:38.029Z · LW · GW

I'm not sure I understand what you're interested in, but can say a few concrete things.

We might hope that PA can "learn things" by looking at what a copy of itself can prove. We might furthermore expect that it can see that a copy of itself will only prove true sentences.

Naively this should be possible. Outside of PA we can see that PA and its copy are isomorphic. Can we then formalize this inside PA?

In the direct attempt to do so, we construct our inner copy , where is a statement that says "there exists a proof of in the inner copy of PA".

But Löb's Theorem rules out formalizing self-trust this way. The statement means "there are no ways to prove falsehood in the inner copy of PA". But if PA could prove that, Löb's Theorem turns it directly into a proof of !

This doesn't AFAICT mean self-trust of the form "I trust myself not to prove false things" is impossible, just that this approach fails, and you have to be very careful about deferral.

Comment by James Payor (JamesPayor) on Some constructions for proof-based cooperation without Löb · 2023-03-22T05:04:13.888Z · LW · GW

Something I'm now realizing, having written all these down: the core mechanism really does echo Löb's theorem! Gah, maybe these are more like Löb than I thought.

(My whole hope was to generalize to things that Löb's theorem doesn't! And maybe these ideas still do, but my story for why has broken, and I'm now confused.)

As something to ponder on, let me show you how we can prove Löb's theorem following the method of ideas #3 and #5:

  • is assumed
  • We consider the loop-cutter
  • We verify that if activates then must be true:
  • Then, can satisfy by finding the same proof.
  • So activates, and is true.

In english:

  • We have who is blocked on
  • We introduce to the loop cutter , who will activate if activation provably leads to being true
  • encounters the argument "if activates then is true, and this causes to activate"
  • This satisfies 's requirement for some , so becomes true.
Comment by James Payor (JamesPayor) on Some constructions for proof-based cooperation without Löb · 2023-03-22T04:42:56.191Z · LW · GW

Perhaps the confusion is mostly me being idiosyncratic! I don't have a good reference, but can attempt an explanation.

The propositions and are meant to model the behaviour of some agents, say Alice and Bob. The proposition means "Alice cooperates", likewise means "Bob cooperates".

I'm probably often switching viewpoints, talking about is if it's Alice, when formally is some statement we're using to model Alice's behaviour.

When I say " tries to prove that ", what I really mean is: "In this scenario, Alice is looking for a proof that if she cooperates, then Bob cooperates. We model this with meaning 'Alice cooperates', and follows from ."

Note that every time we use we're talking about proofs of of any size. This makes our model less realistic, since Alice and Bob only have a limited amount of time in which to reason about each other and try to prove things. The next step would be to relax the assumptions to things like , which says "Alice cooperates whenever it can be proven in steps that Bob cooperates".