tor-okland-barstad

Posts
Comments

Posts

Tim Berners-Lee found it hard to explain the web 2023-04-10T13:33:49.845Z

Tor Økland Barstad's Shortform 2023-03-24T16:41:58.602Z

Alignment with argument-networks and assessment-predictions 2022-12-13T02:17:00.531Z

Making it harder for an AGI to "trick" us, with STVs 2022-07-09T14:42:29.803Z

Getting from an unaligned AGI to an aligned AGI? 2022-06-21T12:36:13.928Z

Comments

Comment by tor-okland-barstad on [deleted post] 2023-11-19T05:47:56.838Z

As far as I understand, MIRI did not assume that we're just able to give the AI a utility function directly.

I'm a bit unsure about how to interpret you here.

In my original comment, I used terms such as positive/optimistic assumptions and simplifying assumptions. When doing that, I meant to refer to simplifying assumptions that were made so as to abstract away some parts of the problem.

The Risks from Learned Optimization paper was written mainly by people from MIRI!

Good point (I should have written my comment in such a way that pointing out this didn't feel necessary).

Other things like Ontological Crises and Low Impact sort of assume you can get some info into the values of an agent

I guess this is more central to what I was trying to communicate than whether it is expressed in terms of a utility function per se.

In this tweet, Eliezer writes:

"The idea with agent foundations, which I guess hasn't successfully been communicated to this day, was finding a coherent target to try to get into the system by any means (potentially including DL ones)."

Based on e.g. this talk from 2016, I get the sense that when he says "coherent target" he means targets that relate to the non-digital world. But perhaps that's not the case (or perhaps it's sort of the case, but more nuanced).

Maybe I'm making this out to have been a bigger part of their work than what actually was the case.

Comment by Tor Økland Barstad (tor-okland-barstad) on Evaluating the historical value misspecification argument · 2023-10-06T14:38:46.895Z · LW · GW

Thanks for the reply :) I'll try to convey some of my thinking, but I don't expect great success. I'm working on more digestible explainers, but this is a work in progress, and I have nothing good that I can point people to as of now.

(...) part of the explanation here might be "if the world is solved by AI, we do actually think it will probably be via doing some concrete action in the world (e.g., build nanotech), not via helping with alignment (...)

Yeah, I guess this is where a lot of the differences in our perspective are located.

if the world is solved by AI, we do actually think it will probably be via doing some concrete action in the world (e.g., build nanotech)

Things have to cash out in terms of concrete actions in the world. Maybe a contention is the level of indirection we imagine in our heads (by which we try to obtain systems that can help us do concrete actions).

Prominent in my mind are scenarios that involve a lot of iterative steps (but over a short amount of time) before we start evaluating systems by doing AGI-generated experiments. In the earlier steps, we avoid doing any actions in the "real world" that are influenced in a detailed way by AGI output, and we avoid having real humans be exposed to AGI-generated argumentation.

Examples of stuff we might try to obtain:

AGI "lie detector techniques" (maybe something that is in line with the ideas of Collin Burns)
Argument/proof evaluators (this is an interest of mine, but making better explainers is still a work in progress, and I have some way to go)

If we are good at program-search, this can itself be used to obtain programs that help us be better at program-search (finding functions that score well according to well-defined criteria).

Some tasks can be considered to be inside of "test-range"^[1]:

Predicting human answers to questions posed by other humans^[2].
Outputting prime numbers^[3]
Predicting experimental results from past experimental data^[4]
Whether a chess-move is good^[5]
Etc, etc

Other tasks are outside of "test-range":

Predicting human answers to any question (including questions that involve being exposed to AGI-generated content)^[6]
Whether a given instruction-plan actually results in machines that copy strawberries at the molecular level (and does so in accordance with "the spirit" of the request that was given)
Etc, etc

Most requests that actually are helpful to us are outside of test-range. And when the requirements that matter to us are outside of test-range, it is of course harder to test in a safe/reliable way if systems are giving us what we want.

But we can have AGIs output programs that help us with tasks, and we can define requirements^[7]Be for these programs. And for these program-requirements, AGIs can help us explore stuff such as:

Are there programs that satisfy the requirements but disagree about certain outputs? (be that outputs that are inside of test-range or outside of test-range)
Are there programs that satisfy the requirements, but perform poorly for certain inputs that are inside of test-range?
If we only require correct outputs within the entire "test-range", is it nonetheless infeasible to make programs that satisfy the requirements while giving bad output for ≥1 inputs within the test-range?

Phrased somewhat differently: We may come up with techniques for making hard to not give helpful/honest outputs^[8], and test how well these seem to generalize.

And similar techniques can also be applied to program-requirement-requirements, program-requirement-requirement-requirements, etc^[9].

Anyway, it's hard to write about this kind of stuff in a concrete way without going off large tangents, and it's hard to write about it in an abstract way without being vague and convoluted. For whatever it's worth, these are some principles I think may be helpful (very much not a complete list):

Generality as it correlates with measurables (such as e.g. source code length or use of computational resource use): When writing a program to do well for many outputs (while fulfilling certain requirements), intentionally having it fail for particular outputs will sometimes require it to be less general (e.g. adding extra if-sentences, resulting in a longer minimum source code length for the program to exhibit this behavior).
Optimization-targets that incentivize modularity make it harder to make programs that intentionally fail for particular outputs: In a more modular system, it will less often be clear from sub-routine inputs what the macro-level task is^[10].
When a requirement or optimization-target fails at "forcing" good outputs, AGIs that act so as to be selected for are "incentivized" to reveal this as being the case: Suppose some set of requirements is compatible with bad outputs inside "test-range". Or suppose that some set of requirements allow for programs that give contradictory outputs. If we do "gradient descent", looking for AGIs that help us explore spaces of programs defined by requirements that we specify, we will select for AGIs that point out this kind of stuff.

Anyway, I'm not sure how much or little this attempt at conveying my thoughts achieves. It's a bit of a dilemma for me if I am to try to convey anything at all, or wait to do that until I have completed better explainers that I can point people to (as the explainers I have now fall short).

I guess an incomplete summary of my perspective might be:

There are lots of techniques we may apply for getting help with alignment while greatly limiting the extent to which we interact with AGI putout in ways that are dangerous
Not being able to directly evaluate outputs we are interested in does not necessarily preclude us from finding indirect methods of evaluation that are powerful.
We may do program-search for AGIs that help us explore possible program-requirements, and the spaces of programs (and program outputs) that are compatible with those program-requirements (and program-requirement-requirements, etc). We may learn a lot from patterns we observe relating to these "spaces".
Sometimes there are possible patterns such that if we know ourselves to be sufficiently capable at program-search, P(requirement R1 "forces" outputs to be "good" | we observe pattern P1) reasonably could be deemed to be high. Even if we're not opinionated in regards to P(requirement R1 "forces" outputs to be "good").

^{^}
By which I mean something akin to: "We have a safe and reliable way of testing if the output fulfills the requirements in question".

Admittedly, if we haven't solved some of the technical challenges related to containment, even playing tic tac toe can constitute some risk. But here, I mean to focus on risk from interacting with AGI output - not risk from just having AGI-systems system run (and them maybe finding a security vulnerability in the computer systems they are running at).
^{^}
(we can pick random predictions to test, and we can have AIs competitively point out predictions made by other AIs that they think are miscalibrated)
^{^}
(we can write programs that take alleged prime numbers as input, and test if they really are prime numbers)
^{^}
(or from experiments that haven't been run, but that humans have come up with, such that there is little room for AGIs to somehow influence the experiment to itself be dangerous somehow)
^{^}
(we may not be capable of evaluating chess-moves ourselves, but we may safely have AGI-systems play chess against each other - indicating something akin to: "insofar as we feel confident that we are good at program-search, we can feel confident that we can obtain programs that output good chess-moves")
^{^}
There are more and less dangerous ways to do this. With less dangerous ways of doing it, experiments (where actual humans answer questions) are done with humans that are temporarily isolated from the rest of the rest of the world (and who certainly aren't AGI-system operators).
^{^}
Such requirements may relate to:
- How the program is constructed
- Observables relating to the source code (source code length, etc)
- Restrictions the source code must adhere to
- Whether the program is accepted by a given verifier (or any verifier that itself fulfills certain requirements)
- "Proofs" of various kinds relating to the program
- Tests of program outputs that the program must be robust in regards to
- Etc
^{^}
By "making it hard" I means something like "hard to do while being the sort of program we select for when doing program-search". Kind of like how it's not "hard" for a chess program to output bad chess-moves, but it's hard for it to do that while also being the kind of program we continue to select for while doing "gradient descent".
^{^}
In my view of things, this is a very central technique (it may appear circular somehow, but when applied correctly, I don't think it is). But it's hard for me to talk about it in a concrete way without going off on tangents, and it's hard for me to talk about it in an abstract way without being vague. Also, my texts become more convoluted when I try to write about this, and I think people often just glaze over it.
^{^}
One example of this: If we are trying to obtain argument evaluators, the argumentation/demonstrations/proofs these evaluators evaluate should be organized into small and modular pieces, such that it's not car from any given piece what the macro-level conclusion is.

Comment by Tor Økland Barstad (tor-okland-barstad) on Evaluating the historical value misspecification argument · 2023-10-06T00:50:03.946Z · LW · GW

Thanks for the reply :) Feel free to reply further if you want, but I hope you don't feel obliged to do so^[1].

"Fill the cauldron" examples are (...) not examples where it has the wrong beliefs.

I have never ever been confused about that!

It's "even simple small-scale tasks are unnatural, in the sense that it's hard to define a coherent preference ordering over world-states such that maximizing it completes the task and has no serious negative impact; and there isn't an obvious patch that overcomes the unnaturalness or otherwise makes it predictably easier to aim AI systems at a bounded low-impact task like this". (Including easier to aim via training.)

That is well phrased. And what you write here doesn't seem in contradiction with my previous impression of things.

I think the feeling I had when first hearing "fill the bucket"-like examples was "interesting - you made a legit point/observation here"^[2].

I'm having a hard time giving a crystalized/precise summary of why I nonetheless feel (and have felt^[3]) confused. I think some of it has to do with:

More "outer alignment"-like issues being given what seems/seemed to me like outsized focus compared to more "inner alignment"-like issues (although there has been a focus on both for as long as I can remember).
The attempts to think of "tricks" seeming to be focused on real-world optimization-targets to point at, rather than ways of extracting help with alignment somehow / trying to find techniques/paths/tricks for obtaining reliable oracles.
Having utility functions so prominently/commonly be the layer of abstraction that is used^[4].

I remember Nate Soares once using the analogy of a very powerful function-optimizer ("I could put in some description of a mathematical function, and it would give me an input that made that function's output really large"). Thinking of the problem at that layer of abstraction makes much more sense to me.

It's purposeful that I say "I'm confused", and not "I understand all details of what you were thinking, and can clearly see that you were misguided".

When seeing e.g. Eliezer's talk AI Alignment: Why It's Hard, and Where to Start, I understand that I'm seeing a fairly small window into his thinking. So when it gives a sense of him not thinking about the problem quite like I would think about it, that is more of a suspicion that I get/got from it - not something I can conclude from it in a firm way.

^{^}
If I could steal a given amount of your time, I would not prioritize you replying to this.
^{^}
I can't remember this point/observation being particularly salient to me (in the context of AI) before I first was exposed to Bostrom's/Eliezer's writings (in 2014).

As a sidenote: I wasn't that worried about technical alignment prior to reading Bostrom's/Eliezer's stuff, and became worried upon reading it.
^{^}
What has confused me has varied throughout time. If I tried to be very precise about what I think I thought when, this comment would become more convoluted. (Also, it's sometimes hard for me to separate false memories from real ones.)
^{^}
I have read this tweet, which seemed in line with my interpretation of things.

Comment by Tor Økland Barstad (tor-okland-barstad) on Evaluating the historical value misspecification argument · 2023-10-05T21:37:06.599Z · LW · GW

Your reply here says much of what I would expect it to say (and much of it aligns with my impression of things). But why you focused so much on "fill the cauldron" type examples is something I'm a bit confused by (if I remember correctly I was confused by this in 2016 also).

Comment by Tor Økland Barstad (tor-okland-barstad) on Making Nanobots isn't a one-shot process, even for an artificial superintelligance · 2023-05-17T07:48:19.051Z · LW · GW

This tweet from Eliezer seems relevant btw. I would give similar answers to all of the questions he lists that relate to nanotechnology (but I'd be somewhat more hedged/guarded - e.g. replacing "YES" with "PROBABLY" for some of them).

Comment by Tor Økland Barstad (tor-okland-barstad) on Making Nanobots isn't a one-shot process, even for an artificial superintelligance · 2023-05-13T12:41:12.635Z · LW · GW

Thanks for engaging

Likewise :)

Also, sorry about the length of this reply. As the adage goes: "If I had more time, I would have written a shorter letter."

From my perspective you seem simply very optimistic on what kind of data can be extracted from unspecific measurements.

That seems to be one of the relevant differences between us. Although I don't think it is the only difference that causes us to see things differently.

Other differences (I guess some of these overlap):

It seems I have higher error-bars than you on the question we are discussing now. You seem more comfortable taking the availability heuristic (if you can think of approaches for how something can be done) as conclusive evidence.
Compared to me, it seems that you see experimentation as more inseparably linked with needing to build extensive infrastructure / having access to labs, and spending lots of serial time (with much back-and-fourth).
You seem more pessimistic about the impressiveness/reliability of engineering that can be achieved by a superintelligence that lacks knowledge/data about lots of stuff.
The probability of having a single plan work, and having one of several plans (carried out in parallel) work, seems to be more linked in your mind than mine.
You seem more dismissive than me of conclusions maybe being possible to reach from first-principles thinking (about how universes might work).
I seem to be more optimistic about approaches to thinking that are akin to (a more efficient version of) "think of lots of ways the universe might work, do Montecarlo-simulations for how those conjectures would affect the probability of lots of aspects of lots of different observations, and take notice if some theories about the universe seem unusually consistent with the data we see".
I wonder if you maybe think of computability in a different way from me. Like, you may think that it's computationally intractable to predict the properties of complex molecules based on knowledge of the standard model / quantum physics. And my perspective would be that this is extremely contingent on the molecule, what the AI needs to know about it, etc - and that an AGI, unlike us, isn't forced to approach this sort of thing in an extremely crude manner.
The AI only needs to find one approach that works (from an extremely vast space of possible designs/approaches). I suspect you of having fewer qualms about playing fast and lose with the distinction between "an AI will often/mostly be prevented from doing x due to y" and "an AI will always be prevented from doing x due to y".
It's unclear if you share my perspective about how it's an extremely important factor that an AGI could be much better than us at doing reasoning where it has a low error-rate (in terms of logical flaws in reasoning-steps, etc).

From my perspective, I don't see how your reasoning is qualitatively distinct from saying in the 1500s: "We will for sure never be able to know what the sun is made out of, since we won't be able to travel there and take samples."

Even if we didn't have e.g. the standard model, my perspective would still be roughly what it is (with some adjustments to credences, but not qualitatively so). So to me, us having the standard model is "icing on the cake".

Here is another good example on how Eliezer makes some pretty out there claims about what might be possible to infer from very little data: https://www.lesswrong.com/posts/ALsuxpdqeTXwgEJeZ/could-a-superintelligence-deduce-general-relativity-from-a -- I wonder what your intuition says about this?

Eliezer says "A Bayesian superintelligence, hooked up to a webcam, would invent General Relativity as a hypothesis (...)". I might add more qualifiers (replacing "would" with "might", etc). I think I have wider error-bars than Eliezer, but similar intuitions when it comes to this kind of thing.

Speaking of intuitions, one question that maybe gets at deeper intuitions is "could AGIs find out how to play theoretically perfect chess / solve the game of chess?". At 5/1 odds, this is a claim that I myself would bet neither for nor against (I wouldn't bet large sums at 1/1 odds either). While I think people of a certain mindset will think "that is computationally intractable [when using the crude methods I have in mind]", and leave it at that.

As to my credences that a superintelligence could "oneshot" nanobots^[1] - without being able to design and run experiments prior to designing this plan - I would bet neither "yes" or "no" to that a 1/1 odds (but if I had to bet, I would bet "yes").

Upon seeing three frames of a falling apple and with no other information, a superintelligence would assign a high probability to Newtonian mechanics, including Newtonian gravity. [from the post you reference]

But it would have other information. Insofar as it can reason about the reasoning-process that it itself consists of, that's a source of information (some ways by which the universe could work would be more/less likely to produce itself). And among ways that reality might work - which the AI might hypothesize about (in the absence of data) - some will be more likely than others in a "Kolmogorov complexity" sort of way.

How far/short a superintelligence could get with this sort of reasoning, I dunno.

Here is an excerpt from a TED-talk from the Wolfram Alpha that feels a bit relevant (I find the sort of methodology that he outlines deeply intuitive):

"Well, so, that leads to kind of an ultimate question: Could it be that someplace out there in the computational universe we might find our physical universe? Perhaps there's even some quite simple rule, some simple program for our universe. Well, the history of physics would have us believe that the rule for the universe must be pretty complicated. But in the computational universe, we've now seen how rules that are incredibly simple can produce incredibly rich and complex behavior. So could that be what's going on with our whole universe? If the rules for the universe are simple, it's kind of inevitable that they have to be very abstract and very low level; operating, for example, far below the level of space or time, which makes it hard to represent things. But in at least a large class of cases, one can think of the universe as being like some kind of network, which, when it gets big enough, behaves like continuous space in much the same way as having lots of molecules can behave like a continuous fluid. Well, then the universe has to evolve by applying little rules that progressively update this network. And each possible rule, in a sense, corresponds to a candidate universe.

Actually, I haven't shown these before, but here are a few of the candidate universes that I've looked at. Some of these are hopeless universes, completely sterile, with other kinds of pathologies like no notion of space, no notion of time, no matter, other problems like that. But the exciting thing that I've found in the last few years is that you actually don't have to go very far in the computational universe before you start finding candidate universes that aren't obviously not our universe. Here's the problem: Any serious candidate for our universe is inevitably full of computational irreducibility. Which means that it is irreducibly difficult to find out how it will really behave, and whether it matches our physical universe. A few years ago, I was pretty excited to discover that there are candidate universes with incredibly simple rules that successfully reproduce special relativity, and even general relativity and gravitation, and at least give hints of quantum mechanics."

invent General Relativity as a hypothesis [from the post you reference]

As I understand it, the original experiment humans did to test for general relativity (not to figure out that general relativity probably was correct, mind you, but to test it "officially") was to measure gravitational redshift.

And I guess redshift is an example of something that will affect many photos. And a superintelligent mind might be able to use such data better than us (we, having "pathetic" mental abilities, will have a much greater need to construct experiments where we only test one hypothesis at a time, and to gather the Bayesian evidence we need relating to that hypothesis from one or a few experiments).

It seems that any photo that contains lighting stemming from the sun (even if the picture itself doesn't include the sun) can be a source of Bayesian evidence relating to general relativity:

It seems that GPS data must account for redshift in its timing system. This could maybe mean that some internet logs (where info can be surmised about how long it takes to send messages via satellite) could be another potential source for Bayesian evidence:

I don't know exactly what and how much data a superintelligence would need to surmise general relativity (if any!). How much/little evidence it could gather from a single picture of an apple I dunno.

There is just absolutely no reason to consider general relativity at all when simpler versions of physics explain absolutely all observations you have ever encountered (which in this case is 2 frames). [from the post you reference]

I disagree with this.

First off, it makes sense to consider theories that explain more observations than just the ones you've encountered.

Secondly, simpler versions of physics do not explain your observations when you see 2 webcam-frames of a falling apple. In particular, the colors you see will be affected by non-Newtonian physics.

Also, the existence of apples and digital cameras also relates to which theories of physics are likely/plausible. Same goes for the resolution of the video, etc, etc.

However, there is no way to scale this to a one-shot scenario.

You say that so definitively. Almost as if you aren't really imagining an entity that is orders of magnitude more capable/intelligent than humans. Or as if you have ruled out large swathes of the possibility-space that I would not rule out.

I just think if an AI executed it today it would have no way of surviving and expanding.

If an AGI is superintelligent and malicious, then surviving/expanding (if it gets onto the internet) seems quite clearly feasible to me.

We even have a hard time getting corona-viruses back in the box! That's a fairly different sort of thing, but it does show how feeble we are. Another example is illegal images/videos, etc (where the people sharing those are humans).

An AGI could plant itself onto lots of different computers, and there are lots of different humans it could try to manipulate (a low success rate would not necessarily be prohibitive). Many humans fall for pretty simple scams, and AGIs would be able to pull off much more impressive scams.

This is absolutely what engineers do. But finding the right design patterns that do this involves a lot of experimentation (not for a pipe, but for constructing e.g. a reliable transistor).

Here you speak about how humans work - and in such an absolutist way. Being feeble and error-prone reasoners, it makes sense that we need to rely heavily on experiments (and have a hard time making effective use of data not directly related to the thing we're interested in).

That protein folding is "solved" does not disprove this IMO.

I think protein being "solved" exemplifies my perspective, but I agree about it not "proving" or "disproving" that much.

Biological molecules are, after all, made from simple building blocks (amino acid) with some very predictable properties (how they stick together) so it's already vastly simplified the problem.

When it comes to predictable properties, I think there are other molecules where this is more the case than for biological ones (DNA-stuff needs to be "messy" in order for mutations that make evolution work to occur). I'm no chemist, but this is my rough impression.

are, after all, made from simple building blocks (amino acid) with some very predictable properties (how they stick together)

Ok, so you acknowledge that there are molecules with very predictable properties.

It's ok for much/most stuff not to be predictable to an AGI, as long as the subset of stuff that can be predicted is sufficient for the AGI to make powerful plans/designs.

finding the right molecules that reliably do what you want, as well as how to put them together, etc., is a lot of research that I am pretty certain will involve actually producing those molecules and doing experiments with them.

Even IF that is the case (an assumption that I don't share but also don't rule out), design-plans may be made to have experimentation built into them. It wouldn't necessarily need to be like this:

experiments being run
data being sent to the AI so that it can reason about it
then having the AI think a bit and construct new experiments
more experiments being run
data being sent to the AI so that it can reason about it
etc

I could give specific examples of ways to avoid having to do it that way, but any example I gave would be impoverished, and understate the true space of possible approaches.

His claim is that an ASI will order some DNA and get some scientists in a lab to mix it together with some substances and create nanobots.

I read the scenario he described as:

involving DNA being ordered from lab
having some gullible person elsewhere carry out instructions, where the DNA is involved somehow
being meant as one example of a type of thing that was possible (but not ruling out that there could be other ways for a malicious AGI to go about it)

I interpreted him as pointing to a larger possibility-space than the one you present. I don't think the more specific scenario you describe would appear prominently in his mind, and not mine either (you talk about getting "some scientists in a lab to mix it together" - while I don't think this would need to happen in a lab).

Here is an excerpt from here (written in 2008), with boldening of text done by me:

"1. Crack the protein folding problem, to the extent of being able to generate DNA
strings whose folded peptide sequences fill specific functional roles in a complex
chemical interaction.
2. Email sets of DNA strings to one or more online laboratories which offer DNA
synthesis, peptide sequencing, and FedEx delivery. (Many labs currently offer this
service, and some boast of 72-hour turnaround times.)
3. Find at least one human connected to the Internet who can be paid, blackmailed,
or fooled by the right background story, into receiving FedExed vials and mixing
them in a specified environment.
4. The synthesized proteins form a very primitive “wet” nanosystem which, ribosomelike, is capable of accepting external instructions; perhaps patterned acoustic vibrations delivered by a speaker attached to the beaker.
5. Use the extremely primitive nanosystem to build more sophisticated systems, which
construct still more sophisticated systems, bootstrapping to molecular
nanotechnology—or beyond."

Btw, here are excerpts from a TED-talk by Dan Gibson from 2018:

"Naturally, with this in mind, we started to build a biological teleporter. We call it the DBC. That's short for digital-to-biological converter. Unlike the BioXp, which starts from pre-manufactured short pieces of DNA, the DBC starts from digitized DNA code and converts that DNA code into biological entities, such as DNA, RNA, proteins or even viruses. You can think of the BioXp as a DVD player, requiring a physical DVD to be inserted, whereas the DBC is Netflix. To build the DBC, my team of scientists worked with software and instrumentation engineers to collapse multiple laboratory workflows, all in a single box. This included software algorithms to predict what DNA to build, chemistry to link the G, A, T and C building blocks of DNA into short pieces, Gibson Assembly to stitch together those short pieces into much longer ones, and biology to convert the DNA into other biological entities, such as proteins.

This is the prototype. Although it wasn't pretty, it was effective. It made therapeutic drugs and vaccines. And laboratory workflows that once took weeks or months could now be carried out in just one to two days. And that's all without any human intervention and simply activated by the receipt of an email which could be sent from anywhere in the world. We like to compare the DBC to fax machines.

(...)

Here's what our DBC looks like today. We imagine the DBC evolving in similar ways as fax machines have. We're working to reduce the size of the instrument, and we're working to make the underlying technology more reliable, cheaper, faster and more accurate.

(...)

The DBC will be useful for the distributed manufacturing of medicine starting from DNA. Every hospital in the world could use a DBC for printing personalized medicines for a patient at their bedside. I can even imagine a day when it's routine for people to have a DBC to connect to their home computer or smart phone as a means to download their prescriptions, such as insulin or antibody therapies. The DBC will also be valuable when placed in strategic areas around the world, for rapid response to disease outbreaks. For example, the CDC in Atlanta, Georgia could send flu vaccine instructions to a DBC on the other side of the world, where the flu vaccine is manufactured right on the front lines."

I believe understanding protein function is still vastly less developed (correct me if I'm wrong here, I haven't followed it in detail).

I'm no expert on this, but what you say here seems in line with my own vague impression of things. As you maybe noticed, I also put "solved" in quotation marks.

However, in this specific instance, the way Eliezer phrases it, any iterative plan for alignment would be excluded.

As touched upon earlier, I am myself am optimistic when it comes to iterative plans for alignment. But I would prefer such iteration to be done with caution that errs on the side of paranoia (rather than being "not paranoid enough").

It would be ok if (many of the) people doing this iteration would think it unlikely that intuitions like Eliezer's or mine are correct. But it would be preferable for them to carry out plans that would be likely to have positive results even if they are wrong about that.

Like, you expect that since something seems hopeless to you, a superintelligent AGI would be unable to do it? Ok, fine. But let's try to minimize the amount of assumptions like that which are loadbearing in our alignment strategies. Especially for assumptions where smart people who have thought about the question extensively disagree strongly.

As a sidenote:

If I lived in the stone age, I would assign low credence to us going step by step from stone-age technologies akin to iPhones and the international space station and IBM being written with xenon atoms.
If I lived prior to complex life (but my own existence didn't factor into my reasoning), I would assign low credence to anything like mammals evolving.

It's interesting to note that even though many people (such as yourself) have a "conservative" way of thinking (about things such as this) compared to me, I am still myself "conservative" in the sense that there are several things that have happened that would have seemed too "out there" to appear realistic to me.

Another sidenote:

One question we might ask ourselves is: "how many rules by which the universe could work would be consistent with e.g. the data we see on the internet?". And by rules here, I don't mean rules that can be derived from other rules (like e.g. the weight of a helium atom), but the parameters that most fundamentally determine how the universe works. If we...

Rank rules by (1) how simple/elegant they are and (2) by how likely the data we see on the internet would be to occur with those rules
Consider rules "different from each other" if there are differences between them in regards to predictions they make for which nano-technology-designs that would work

...my (possibly wrong) guess is that there would be a "clear winner".

Even if my guess is correct, that leaves the question of whether finding/determining the "winner" is computationally tractable. With crude/naive search-techniques it isn't tractable, but we don't know the specifics of the techniques that a superintelligence might use - it could maybe develop very efficient methods for ruling out large swathes of search-space.

And a third sidenote (the last one, I promise):

Speculating about this feels sort of analogous to reasoning about a powerful chess engine (although there are also many disanalogies). I know that I can beat an arbitrarily powerful chess engine if I start from a sufficiently advantageous position. But I find it hard to predict where that "line" is (looking at a specific board position, and guessing if an optimal chess-player could beat me). Like, for some board positions the answer will be a clear "yes" or a clear "no", but for other board-positions, it will not be clear.

I don't know how much info and compute a superintelligence would need to make nanotechnology-designs that work in a "one short"-ish sort of way. I'm fairly confident that the amount of computational resources used for the initial moon-landing would be far too little (I'm picking an extreme example here, since I want plenty of margin for error). But I don't know where the "line" is.

^{^}
Although keep in mind that "oneshotting" does not exclude being able to run experiments (nor does it rule out fairly extensive experimentation). As I touched upon earlier, it may be possible for a plan to have experimentation built into itself. Needing to do experimentation ≠ needing access to a lab and lots of serial time.

Comment by Tor Økland Barstad (tor-okland-barstad) on Making Nanobots isn't a one-shot process, even for an artificial superintelligance · 2023-04-29T07:48:06.153Z · LW · GW

I suspect my own intuitions regarding this kind of thing are similar to Eliezer's. It's possible that my intuitions are wrong, but I'll try to share some thoughts.

It seems that we think quite differently when it comes to this, and probably it's not easy for us to achieve mutual understanding. But even if all we do here is to scratch the surface, that may still be worthwhile.

As mentioned, maybe my intuitions are wrong. But maybe your intuitions are wrong (or maybe both). I think a desirable property of plans/strategies for alignment would be robustness to either of us being wrong about this 🙂

I however will write below why I think this description massively underestimates the difficulty in creating self-replicating nanobots

Among people who would suspect me of underestimating the difficulty of developing advanced nanotech, I would suspect most of them of underestimating the difference made by superintelligence + the space of options/techniques/etc that a superintelligent mind could leverage.

In Drexler's writings about how to develop nanotech, one thing that was central to his thinking was protein folding. I remember that in my earlier thinking, it felt likely to me that a superintelligence would be able to "solve" protein folding (to a sufficient extent to do what it wanted to do). My thinking was "some people describe this as infeasible, but I would guess for a superintelligence to be able to do this".

This was before AlphaFold. The way I remember it, the idea of "solving" protein folding was more controversial back in the day (although I tried to google this now, and it was harder to find good examples than I thought it would be).

While physicists sometimes claim to derive things from first principles, in practice these derivations often ignore a lot of details which still has to be justified using experiments

As humans we are "pathetic" in terms of our mental abilities. We have a high error-rate in our reasoning / the work we do, and this makes us radically more dependent on tight feedback-loops with the external world.

This point of error-rate in one's thinking is a really important think. With lower error-rate + being able to do much more thinking / mental work, it becomes possible to learn and do much much more without physical experiments.

The world, and guesses regarding how the world works (including detail-oriented stuff relating to chemistry/biology), are highly interconnected. For minds that are able to do vast about of high-quality low error-rate thinking, it may be possible to combine subtle and noisy Bayesian evidence into overwhelming evidence. And for approaches it explores regarding this kind of thinking, it can test how good it does at predicting existing info/data that it already has access to.

The images below are simple/small examples of the kind of thinking I'm thinking of. But I suspect superintelligences can take this kind of thinking much much further.

The post Einstein's Arrogance also feels relevant here.

While it is indeed possible or even likely that the standard model theoretically describes all details of a working nanobot with the required precision, the problem is that in practice it is impossible to simulate large physical systems using it.

It is infeasible to simulate in "full detail", but it's not clear what we should conclude based on that. Designs that work are often robust to the kinds of details that we need precise simulation in order to simulate correctly.

The specifics of the level of detail that is needed depends on the design/plan in question. A superintelligence may be able to work with simulations in a much less crude way than we do (with much more fine-grained and precise thinking in regards to what can be abstracted away for various parts of the "simulation").

The construction-process/design the AI comes up with may:

Be constituted of various plans/designs at various levels of abstraction (without most of them needing to work in order for the top-level mechanism to work). The importance/power of the why not both?-principle is hard to understate.
Be self-correcting in various ways. Like, it can have learning/design-exploration/experimentation built into itself somehow.
Have lots of built-in contingencies relating to unknowns, as well as other mechanisms to make the design robust to unknowns (similar to how engineers make bridges be stronger than they need to be).

Here are some relevant quotes from Radical Abudance by Eric Drexler:

"Coping with limited knowledge is a necessary part of design and can often be managed. Indeed, engineers designed bridges long before anyone could calculate stresses and strains, which is to say, they learned to succeed without knowledge that seems essential today. In this light, it’s worth considering not only the extent and precision of scientific knowledge, but also how far engineering can reach with knowledge that remains incomplete and imperfect.

For example, at the level of molecules and materials—the literal substance of technological systems—empirical studies still dominate knowledge. The range of reliable calculation grows year by year, yet no one calculates the tensile strength of a particular grade of medium-carbon steel. Engineers either read the data from tables or they clamp a sample in the jaws of a strength-testing machine and pull until it breaks. In other words, rather than calculating on the basis of physical law, they ask the physical world directly.

Experience shows that this kind of knowledge supports physical calculations with endless applications. Building on empirical knowledge of the mechanical properties of steel, engineers apply physics-based calculations to design both bridges and cars. Knowing the empirical electronic properties of silicon, engineers apply physics-based calculations to design transistors, circuits, and computers.

Empirical data and calculation likewise join forces in molecular science and engineering. Knowing the structural properties of particular configurations of atoms and bonds enables quantitative predictions of limited scope, yet applicable in endless circumstances. The same is true of chemical processes that break or make particular configurations of bonds to yield an endless variety of molecular structures.

Limited scientific knowledge may suffice for one purpose but not for another, and the difference depends on what questions it answers. In particular, when scientific knowledge is to be used in engineering design, what counts as enough scientific knowledge is itself an engineering question, one that by nature can be addressed only in the context of design and analysis.

Empirical knowledge embodies physical law as surely as any calculation in physics. If applied with caution—respecting its limits—empirical knowledge can join forces with calculation, not just in contemporary engineering, but in exploring the landscape of potential technologies.

To understand this exploratory endeavor and what it can tell us about human prospects, it will be crucial to understand more deeply why the questions asked by science and engineering are fundamentally different. One central reason is this: Scientists focus on what’s not yet discovered and look toward an endless frontier of unknowns, while engineers focus on what has been well established and look toward textbooks, tabulated data, product specifications, and established engineering practice. In short, scientists seek the unknown, while engineers avoid it.

Further, when unknowns can’t be avoided, engineers can often render them harmless by wrapping them in a cushion. In designing devices, engineers accommodate imprecise knowledge in the same way that they accommodate imprecise calculations, flawed fabrication, and the likelihood of unexpected events when a product is used. They pad their designs with a margin of safety.

The reason that aircraft seldom fall from the sky with a broken wing isn’t that anyone has perfect knowledge of dislocation dynamics and high-cycle fatigue in dispersion-hardened aluminum, nor because of perfect design calculations, nor because of perfection of any other kind. Instead, the reason that wings remain intact is that engineers apply conservative design, specifying structures that will survive even unlikely events, taking account of expected flaws in high-quality components, crack growth in aluminum under high-cycle fatigue, and known inaccuracies in the design calculations themselves. This design discipline provides safety margins, and safety margins explain why disasters are rare."

"Engineers can solve many problems and simplify others by designing systems shielded by barriers that hold an unpredictable world at bay. In effect, boxes make physics more predictive and, by the same token, thinking in terms of devices sheltered in boxes can open longer sightlines across the landscape of technological potential. In my work, for example, an early step in analyzing APM systems was to explore ways of keeping interior working spaces clean, and hence simple.

Note that designed-in complexity poses a different and more tractable kind of problem than problems of the sort that scientists study. Nature confronts us with complexity of wildly differing kinds and cares nothing for our ability to understand any of it. Technology, by contrast, embodies understanding from its very inception, and the complexity of human-made artifacts can be carefully structured for human comprehension, sometimes with substantial success.

Nonetheless, simple systems can behave in ways beyond the reach of predictive calculation. This is true even in classical physics.

Shooting a pool ball straight into a pocket poses no challenge at all to someone with just slightly more skill than mine and a simple bank shot isn’t too difficult. With luck, a cue ball could drive a ball to strike another ball that drives yet another into a distant pocket, but at every step impacts between curved surfaces amplify the effect of small offsets, and in a chain of impacts like this the outcome soon becomes no more than a matter of chance—offsets grow exponentially with each collision. Even with perfect spheres, perfectly elastic, on a frictionless surface, mere thermal energy would soon randomize paths (after 10 impacts or so), just as it does when atoms collide.

Many systems amplify small differences this way, and chaotic, turbulent flow provides a good example. Downstream turbulence is sensitive to the smallest upstream changes, which is why the flap of a butterfly’s wing, or the wave of your hand, will change the number and track of the storms in every future hurricane season.

Engineers, however, can constrain and master this sort of unpredictability. A pipe carrying turbulent water is unpredictable inside (despite being like a shielded box), yet can deliver water reliably through a faucet downstream. The details of this turbulent flow are beyond prediction, yet everything about the flow is bounded in magnitude, and in a robust engineering design the unpredictable details won’t matter."

and is not possible without involvement of top-tier human-run labs today

Eliezer's scenario does assume the involvement of human labs (he describes a scenario where DNA is ordered online).

Alignment is likely an iterative process

I agree with you here (although I would hope that much of this iteration can be done in quick succession, and hopefully in a low-risk way) 🙂

Btw, I very much enjoyed this talk by Ralph Merkle. It's from 2009, but it's still my favorite talk from every talk I've seen on the topic. Maybe you would enjoy it as well. He briefly touches upon the topic of simulations at 28:50, but the entire talk is quite interesting IMO:

Comment by Tor Økland Barstad (tor-okland-barstad) on Catching the Eye of Sauron · 2023-04-07T11:43:41.313Z · LW · GW

None of these are what you describe, but here are some places people can be pointed to:

Rob Mile's channel
The Stampy FAQ (they are open for help/input)
This list of introductions to AI safety

Comment by Tor Økland Barstad (tor-okland-barstad) on Tor Økland Barstad's Shortform · 2023-03-24T18:06:03.827Z · LW · GW

AGI-assisted alignment in Dath Ilan (excerpt from here)

Suppose Dath Ilan got into a situation where they had to choose the strategy of AGI-assisted alignment, and didn't have more than a few years to prepare. Dath Ilan wouldn't actually get themselves into such a situation, but if they did, how might they go about it?

I suspect that among other things they would:

Make damn well sure to box the AGIs before it plausibly could become dangerous/powerful.
Try, insofar as they could, to make their methodologies robust to hardware exploits (rowhammer, etc). Not only by making hardware exploits hard, but by thinking about which code they ran on which computers and so on.
Limit communication bandwidth for AGIs (they might think of having humans directly exposed to AGI communication as "touching the lava", and try to obtain help with alignment-work while touching lava as little as possible).
Insofar as they saw a need to expose humans to AGI communication, those humans would be sealed off from society (and the AGIs communication would be heavily restricted). The idea of having operators themselves be exposed to AGI-generated content from superhuman AGIs they don't trust to be aligned - in Dath Ilan such an approach would have been seen as outside the realm of consideration.
Humans would be exposed to AGI communication mostly so as to test the accuracy of systems that predict human evaluations (if they don't test humans on AGI-generated content, they're not testing the full range of outputs). But even this they would also try to get around / minimize (by, among other things, using techniques such as the ones I summarize here).
In Dath Ilani, it would be seen as a deranged idea to have humans evaluate arguments (or predict how humans would evaluate arguments), and just trust arguments that seem good to humans. To them, that would be kind of like trying to make a water-tight basket, but never trying to fill it with water (to see if water leaked through). Instead, they would use techniques such as the ones I summarize here.
When forced to confront chicken and egg problems, they would see this as a challenge (after all, chickens exist, and eggs exist - so it's not as if chicken and egg problems never have solutions).

Comment by Tor Økland Barstad (tor-okland-barstad) on Tor Økland Barstad's Shortform · 2023-03-24T16:52:52.188Z · LW · GW

This is from What if Debate and Factored Cognition had a mutated baby? (a post I started on, but I ended up disregarding this draft and starting anew). This is just an excerpt from the intro/summary (it's not the entire half-finished draft).

Tweet-length summary-attempts

Resembles Debate, but:

Higher alignment-tax (probably)
More "proof-like" argumentation
Argumentation can be more extensive
There would be more mechanisms for trying to robustly separate out "good" human evaluations (and testing if we succeeded)

We'd have separate systems that (among other things):

Predict human evaluations of individual "steps" in AI-generated "proof-like" arguments.
Make functions that separate out "good" human evaluations.

I'll explain why obtaining help with #2 doesn't rely on us already having obtained honest systems.

"ASIs could manipulate humans" is a leaky abstraction (which humans? how is argumentation restricted?).

ASIs would know regularities for when humans are hard to fool (even by other ASIs).

I posit: We can safely/robustly get them to make functions that leverage these regularities to our advantage.

Summary

To many of you, the following will seem misguided:

We can obtain systems from AIs that predict human evaluations of the various steps in “proof-like” argumentation.
We can obtain functions from AIs that assign scores to “proof-like” argumentation based on how likely humans are to agree with the various steps.
We can have these score-functions leverage regularities for when humans tend to evaluate correctly (based on info about humans, properties of the argumentation, etc).
We can then request “proofs” for whether outputs do what we asked for / want (and trust output that can be accompanied with high-scoring proofs).

We can request outputs that help us make robustly aligned AGIs.

Many of you may find several problems with what I describe above - not just one. Perhaps most glaringly, it seems circular:

If we already had AIs that we trusted to write functions that separate out “good” human evaluations, couldn’t we just trust those AIs to give us “good” answers directly?

The answer has to do with what we can and can’t score in a safe and robust way (for purposes of gradient descent).

The answer also has to do with exploration of wiggle-room:

Given a specific score-function, is it possible to construct high-scoring arguments that argue in favor of contradictory conclusions?

And exploration of higher-level wiggle-room:

Suppose some specific restrictions for score-functions (designed to make it hard to make high-scoring score-functions that have low wiggle-room for “wrong” reasons).

Given those restrictions, is it possible to make high-scoring score-functions that are mutually contradictory (even if internally those score-functions have low wiggle-room)?

All score-functions that robustly leverate regularities for when human evaluations are correct/good would have low wiggle-room. The reverse is not true. Score-functions could have low wiggle-room due to somehow favoring wrong/bad conclusions.

Some (but not all) “core” concepts are summarized below:

Wiggle-room (relative to score-function for argument-step-networks)

Is it possible to generate high-scoring networks that argue for contradictory conclusions? If yes, then that means there is wiggle room.

Any score-function that robustly separates and leverages “good” human evaluations will have low wiggle-room. But the reverse is not true, as there also are other ways to achieve low wiggle-room.

Making it so that the way for a score-function to itself get a high score is to separate out “good” human evaluations

Some AIs would make score-functions, and they would try to make score-functions that get a high score (from the score-function score-function 🙃).

Score-functions would be required to have low wiggle-room. If constructed freely, score-functions could achieve low wiggle-room in a variety of ways. But with this alignment-methodology, the “space” of possible score-functions that can be proposed would be heavily restricted.

We’d want score-functions to achieve low wiggle-room by leveraging regularities for when human evaluators get things right. We’d do that by making it so that achieving low wiggle-room any other way either would be impossible, or result in a lower score.

Some of the tools that would be on our toolbox when restricting score-functions:

Giving complexity-penalty based on source code length (and have a low maximum limit for the amount of source code allowed)
Having predictions of human evaluations be done by external systems (that don’t add to the complexity-penalty)
Restricting the range of questions where human answers can be predicted
Making it so that source code must be constructed out of (something akin to) human-defined regular expressions, that define the confines of how the functions should work
Enforce modularity through a variety of means (making it so that no part of the score-function knows what the argument-network is about)

Wiggle-room (relative to score-function for score-functions)

Have we succeeded with making it so that, in order to maximize score, score-functions must robustly separate out “good” human evaluations? Or is it possible to somehow sneak in bias in favor of specific conclusions, without the score-function getting a lower score as a result?

We can check that, by asking AIs to find high-scoring score-functions that allow for contradictory conclusions. If they are able to do that, then it’s possible to sneak in bias.

Let S be the set of possible score-functions that would be given a high score (by the score-function score-function in question)
Let A be the set of possible argument-step-networks that would be given a high score by ≥1 of the score-functions in S
If it’s possible to sneak bias into high-scoring score-functions, then there will be argument-step-networks in A that argue in favor of contradictory conclusions.

Comment by Tor Økland Barstad (tor-okland-barstad) on Tor Økland Barstad's Shortform · 2023-03-24T16:41:58.838Z · LW · GW

Below are some concepts related to extracting aligned capabilities. The main goal is to be able to verify specialized functions without having humans need to look at the source code, and without being able to safely/robustly score outputs for the full range of inputs.

Some things we need:

We need AIs that act in such a way as to maximize score
There needs to be some some range of the inputs that we can test
There needs to be ways of obtaining/calculating the output we want that are at least somewhat general

An example of an aligned capability we might want would be to predict human answers. In this case, we could test outputs by actually asking questions to real humans (or using existing data of human receiving questions). But if we use the systems to predict human answers when they evaluate AGI-generated content, then we may not want to test those outputs/predictions on real humans.

(I'm working on texts that hopefully will explain these concepts better. In the meantime, this is the best I have.)

Wiggle-room

Desideratum

A function that determines whether some output is approved or not (that output may itself be a function).

Score-function

A function that assigns score to some output (that output may itself be a function).

Function-builder

Think regular expressions, but more expressive and user-friendly.

We can require of AIs: "Only propose functions that can be made with this builder". That way, we restrict their expressivity.

When we as humans specify desideratum, this is one tool (among several!) in the tool-box.

Higher-level desideratum or score-function

Not fundamentally different from other desideratum or score-functions. But the output that is evaluated is itself a desideratum or score-function.

At every level there can be many requirements for the level below.

A typical requirement at every level is low wiggle-room.

Example of higher-level desideratum / score-functions

Humans/operators define a score-function ← level 4

for desideratum ← level 3

for desideratum ← level 2

for desideratum ← level 1

for functions that generate

the output we care about.

Wiggle-room relative to desideratum

Among outputs that would be approved by the desideratum, do any of them contradict each other in any way?

For example: Are there possible functions that give contradicting outputs (for at least 1 input), such that both functions would be approved by the desideratum?

Wiggle-room relative to score-function

Among outputs that would receive a high score by the score-function in question (e.g. "no less than 80% of any other possible output"), do any of them contradict each other in any way?

2nd-level wiggle-room relative to desiderata

We start with a desiderata-desideratum or score-function-desideratum (aka 2nd-level desideratum).

Set A: Any desideratum that approved by the desiderata-desideratum.

Set B: Any output approved by ≥1 of the desiderata in A.

Are there ≥1 contradictions among outputs in B?

P(desideratum forces good outputs | desideratum has low wiggle-room)

If a desideratum forces good/correct outputs, then it has low wiggle-room. But the reverse is not necessarily true.

But for some desiderata we may think: "If wiggle-room is low, that’s probably because it’s hard to satisfy the desideratum without also producing good output."

“Spaces/sets of desideratum where we think P(desideratum forces good outputs | desideratum has low wiggle-room) is low

Among spaces/sets of low-wiggle-room desideratum where we suspect "low wiggle-room → good output" (as defined by higher-level desideratum), do outputs converge?

AI vs AI games and optimization-targets

AI vs AI game

AIs compete in a zero-sum game.

Game produces some type of output (e.g. a function) that is scored.

AI player is global optimum → game produces output we want
Human review not necessary
Must ask ourselves: Hard in practice to win with deceptive output (not just in theory)?

AIs (in AI vs AI games) proving each other wrong, making bets and counter-bets, etc

"This is the shortest source-code for doing A"

"Here is a source-code that's shorter"

"All outputs from F1 will be scored well by evaluation-function F2"

"Here's an input where that's not true"

"I'll bet on any prediction made by function F3"

"I make a counter-bet for input I1"

Evaluation-range

For functions with AGI origin, there may sometimes be only some subset of the inputs where we can evaluate output robustly+safely.

For example, when testing systems that predict human output, we may want to avoid experiments where humans evaluate AGI-generated content.

Test-range

In AI vs AI games (e.g. to make high-scoring functions), some subset of inputs may be declared the test-range.

Bad output in evaluation-range outside of test-range: No penalty.

Test-range ⊈ evaluation-range: Game is not scored.

Optimization-targets

Some examples (there are many more):

Source code brevity
Non-overlap-minimization with optimized code-bases where we can evaluate output
Heuristics that approximate code generality, modularity, etc
Minimal computational logic (including axioms) to infer outputs

Optimization-targets that cause low wiggle-room outside of test-range

Imagine AIs competing to make functions that calculate the area of squares.

Accurate output inside test-range is required. Source code brevity gives bonus-points.

Among functions that maximize score, all have the same outputs outside of test-range.

Combining optimization-targets

Optimization-targets can be combined (be that with weighted sums, or functions that assign score based on more sophisticated evaluations). When I talk about optimization-targets, you should not assume that it consists of just 1 heuristic (such as e.g. source code length).

Techniques for testing P(optimization-target forces good output | optimization-target causes low wiggle-room)

Run AI vs AI games where we:

Pick out various subsets of evaluation-range to be test-range.
Set semi-random test-ranges, even though for the task in question input-range = evaluation-range.

(For the sake of scale/efficiency we can have AIs make bets on the outcomes of such games.)

Rules for predicting if/when optimization-targets limit wiggle-room for the right reasons

AIs can find rules that predict when we should trust some optimization-target to cause low wiggle-room for the right reasons (scored by simplicity, false positive rate, etc).

They can make bets on average performance, and point out examples where rules from competing AIs fail.

Convergence/divergence among rules for predicting if/when optimization-targets limit wiggle-room for the right reasons

A: Huge (but limited) scope of outputs that we can test safely.

B: Outputs we can't test safely.

R: Rules for predicting when optimization-targets force good output (top-scoring when tested within

For outputs in B, we can explore: "Does R converge in regards to this output?"

“Spaces” of possible functions we think contain ≥1 robustly reliable function

If we (have good reason to) think some “space” of functions contains at least ≥1 robustly reliable function, and all functions in that “space” agree on output for a given input, then we can trust this to be that output - even if there isn’t a specific function that we trust to be robustly reliable.

Examples of optimization-targets

Source code brevity

Sometimes the solution with shortest source code is quite generic/general/abstract. In such cases, intentionally giving wrong output for some inputs would often require implementing this solution and adding (something akin to) if-sentences that gives bad output for some inputs.

Generality

A maximally “non-general” function would have 1 if-sentence per possible input.

Possible heuristics:

% chance that random change to code chances output for a low but non-zero fraction of inputs
If there are simplifications that would affect a low but non-zero fraction of inputs

Use of computational resources

If a function only focuses on finding the correct solution, and doesn’t need to spend cycles thinking about how to deceive, what humans think, etc (in addition to finding the correct solution), it will need less computational resources.

Non-overlap-minimization with optimized code-bases where we robustly can score/verify the full range of outputs

We can score/verify all outputs for:

Theorem-provers
Game-players
Programs predicting what humans say next in existing data (videos, chats, etc)
Etc, etc

If you’re not trying to have your function give bad output once in a while, borrowing sub-routines may be easier.

Source code brevity (while doing several specialized tasks from same code-base)

Suppose one code-base is to be specialized, but towards several different tasks (and for several of them, all output can be verified).

E.g.:

Coming up with ways to make nanotech
Searching for mathematical proofs
Predicting experimental results (from existing experiments)

Modularity

Imagine that you are trying to make a car that fails in certain very specific conditions. This becomes more challenging if it has to be made out of parts that also are used in other vehicles (and these other vehicles are robustly tested).

The degree to which subroutine instantiations inside/outside test-range can be told apart

% of source code that consists of subroutines where no AI is able to guess/determine based on input if the function-call is inside or outside of test-range.

This could be determined by running other AI vs AI games (but mostly by making predictions/bets on what outcome would be).

Module implementation flexibility

Imagine that you are trying to make a car, and want the steering wheel to fail, but only in certain very specific conditions. This becomes more challenging to achieve consistently if it isn’t you who will be making the steering wheel.

Module division flexibility

If you are trying to fail for certain specific inputs, this is made harder if you are making functions based on different architectures, and they need to fail the same way.

There may be crude heuristics that help approximate whether 2 architectures are substantially different.

Minimal computational “logic” (including “axioms”) needed to infer output

Each AI specifies an initial base of data-structures ("axioms"), which includes functions ("inference-rules").
Each AI can infer good/bad output from either initial base.
Less source code (in total for initial base, and per data-structure in initial base) leads to higher score.

Predicted comprehensibility to humans

Heuristics relating to how understandable code-base would be to humans.

It is allowed to rewrite code-base for understandability, so long as output is unchanged.

This optimization-target presupposes systems that can predict human output (with non-terrible accuracy).

Telephone game robustness

Suppose various systems (maybe some emulating humans and some that don't) for:

Converting human instructions/diagrams/etc to code
Converting back the other way

With N cycles, what's the % chance of ≥0 outputs changing, but only outside in test-range?

Output-robustness to training data

Instead of having the AI do something, we can have it make a seed AI that does (after having been trained). And we can see the degree to which various outputs are dependent on variations in training data.

Output-robustness to adding chains by which output can be indirectly obtained

Indirect ways of obtaining answers can be chained:

Make a function that
generates a computational logic that
(...)
gives us the output we want

Each chain is an opportunity for deceptive AIs to lose ground / be unable to have intentional mistakes remain.

Comment by tor-okland-barstad on [deleted post] 2023-03-24T09:15:41.503Z

I would also like to see more work where people make less positive/optimistic assumptions. I think of it as a good thing that different approaches to alignment are being explored, and would like to see more of that in general (both in terms of breadth and depth).

I guess there are many possible ways of trying to categorize/conceptualize approaches to alignment theorizing. One is by asking "when talking/thinking about the methodology, what capabilities are assumed to be in place?".

I'm not sure about this, but unless I'm mistaken^[1], a good amount of the work done by MIRI has been under a premise that goes (something sort of like): "Let's assume that we already know how to give AIs real-world objectives defined in terms of utility functions (not because we really assume that, but as a simplifying assumption)". And as far as I know, they haven't done much work where the assumption was something more like "suppose we were extremely good at gradient descent / searching through spaces of possible programs".

In my own theorizing, I don't make all of the simplifying assumptions that (I think/suspect) MIRI made in their "orthodox" research. But I make other assumptions (for the purpose of simplification), such as:

"let's assume that we're really good at gradient descent / searching for possible AIs in program-space"^[2]
"let's assume that the things I'm imagining are not made infeasible due to a lack of computational resources"
"let's assume that resources and organizational culture makes it possible to carry out the plans as described/envisioned (with high technical security, etc)"

In regards to your alignment ideas, is it easy to summarize what you assume to be in place? Like, if someone came to you and said "we have written the source code for a superintelligent AGI, but we haven't turned it on yet" (and you believed them), is it easy to summarize what more you then would need in order to implement your methodology?

^{^}
I very well could be, and would appreciate any corrections.

(I know they have worked on lots of detail-oriented things that aren't "one big plan" to "solve alignment". And maybe how I phrase myself makes it seem like I don't understand that. But if so, that's probably due to bad wording on my part.)
^{^}
Well, I sort of make that assumption, but there are caveats.

Comment by Tor Økland Barstad (tor-okland-barstad) on God vs AI scientifically · 2023-03-24T07:05:31.468Z · LW · GW

If humans (...) machine could too.

From my point of view, humans are machines (even if not typical machines). Or, well, some will say that by definition we are not - but that's not so important really ("machine" is just a word). We are physical systems with certain mental properties, and therefore we are existence proofs of physical systems with those certain mental properties being possible.

machine can have any level of intelligence, humans are in a quite narrow spectrum

True. Although if I myself somehow could work/think a million times faster, I think I'd be superintelligent in terms of my capabilities. (If you are skeptical of that assessment, that's fine - even if you are, maybe you believe it in regards to some humans.)

prove your point by analogy with humans. If humans can pursue somewhat any goal, machine could too.

It has not been my intention to imply that humans can pursue somewhat any goal :)

I meant to refer to the types of machines that would be technically possible for humans to make (even if we don't want to so in practice, and shouldn't want to). And when saying "technically possible", I'm imagining "ideal" conditions (so it's not the same as me saying we would be able to make such machines right now - only that it at least would be theoretically possible).

Comment by Tor Økland Barstad (tor-okland-barstad) on God vs AI scientifically · 2023-03-24T03:24:26.812Z · LW · GW

Why call it an assumption at all?

Partly because I was worried about follow-up comments that were kind of like "so you say you can prove it - well, why aren't you doing it then?".

And partly because I don't make a strict distinction between "things I assume" and "things I have convinced myself of, or proved to myself, based on things I assume". I do see there as sort of being a distinction along such lines, but I see it as blurry.

Something that is derivable from axioms is usually called a theorem.

If I am to be nitpicky, maybe you meant "derived" and not "derivable".

From my perspective there is a lot of in-between between these two:

"we've proved this rigorously (with mathemathical proofs, or something like that) from axiomatic assumptions that pretty much all intelligent humans would agree with"
"we just assume this without reason, because it feels self-evident to us"

Like, I think there is a scale of sorts between those two.

I'll give an extreme example:

Person A: "It would be technically possible to make a website that works the same way as Facebook, except that its GUI is red instead of blue."
Person B: "Oh really, so have you proved that then, by doing it yourself?"
Person A: "No"
Person B: "Do you have a mathemathical proof that it's possible"
Person A: "Not quite. But it's clear that if you can make Facebook like it is now, you could just change the colors by changing some lines in the code."
Person B: "That's your proof? That's just an assumption!"

Person A: "But it is clear. If you try to think of this in a more technical way, you will also realize this sooner or later."
Person B: "What's your principle here, that every program that isn't proven as impossible is possible?"

Person A: "No, but I see very clearly that this program would be possible."
Person B: "Oh, you see it very clearly? And yet, you can't make it, or prove mathemathically that it should be possible."
Person A: "Well, not quite. Most of what we call mathemathical proofs, are (from my point of view) a form of rigorous argumentation. I think I understand fairly well/rigorously why what I said is the case. Maybe I could argue for it in a way that is more rigorous/formal than I've done so far in our interaction, but that would take time (that I could spend on other things), and my guess is that even if I did, you wouldn't look carefully at my argumentation and try hard to understand what I mean."

The example I give here is extreme (in order to get across how the discussion feels to me, I make the thing they discuss into something much simpler). But from my perspective it is sort of similar to discussion in regards the The Orthogonality Thesis. Like, The Orthogonality Thesis is imprecisely stated, but I "see" quite clearly that some version of it is true. Similar to how I "see" that it would be possible to make a website that technically works like Facebook but is red instead of blue (even though - as I mentioned - that's a much more extreme and straight-forward example).

Comment by Tor Økland Barstad (tor-okland-barstad) on God vs AI scientifically · 2023-03-24T01:54:55.316Z · LW · GW

(...) if it's supported by argument or evidence, but if it is, then it's no mere assumption.

I do think it is supported by arguments/reasoning, so I don't think of it as an "axiomatic" assumption.

A follow-up to that (not from you specifically) might be "what arguments?". And - well, I think I pointed to some of my reasoning in various comments (some of them under deleted posts). Maybe I could have explained my thinking/perspective better (even if I wouldn't be able to explain it in a way that's universally compelling 🙃). But it's not a trivial task to discuss these sorts of issues, and I'm trying to check out of this discussion.

I think there is merit to having as a frame of mind: "Would it be possible to make a machine/program that is very capable in regards to criteria x, y, etc, and optimizes for z?".

I think it was good of you you to bring up Aumann's agreement theorem. I haven't looked into the specifics of that theorem, but broadly/roughly speaking I agree with it.

Comment by Tor Økland Barstad (tor-okland-barstad) on God vs AI scientifically · 2023-03-22T12:01:51.062Z · LW · GW

I cannot help you to be less wrong if you categorically rely on intuition about what is possible and what is not.

I wish I had something better to base my beliefs on than my intuitions, but I do not. My belief in modus ponens, my belief that 1+1=2, my belief that me observing gravity in the past makes me likely to observe it in the future, my belief that if views are in logical contradiction they cannot both be true - all this is (the way I think of it) grounded in intuition.

Some of my intuitions I regard as much more strong/robust than others.

When my intuitions come into conflict, they have to fight it out.

Thanks for the discussion :)

Comment by Tor Økland Barstad (tor-okland-barstad) on God vs AI scientifically · 2023-03-22T10:13:36.995Z · LW · GW

Like with many comments/questions from you, answering this question properly would require a lot of unpacking. Although I'm sure that also is true of many questions that I ask, as it is hard to avoid (we all have limited communication bandwitdh) :)

In this last comment, you use the term "science" in a very different way from how I'd use it (like you sometimes also do with other words, such as for example "logic"). So if I was to give a proper answer I'd need to try to guess what you mean, make it clear how I interpret what you say, and so on (not just answer "yes" or "no").

I'll do the lazy thing and refer to some posts that are relevant (and that I mostly agree with):

Comment by Tor Økland Barstad (tor-okland-barstad) on God vs AI scientifically · 2023-03-22T07:57:57.503Z · LW · GW

It seems that 2 + 2 = 4 is also an assumption for you.

Yes (albeit a very reasonable one).

Not believing (some version) of that claim would make typically make minds/AGIs less "capable", and I would expect more or less all AGIs to hold (some version of) that "belief" in practice.

I don't think it is possible to find consensus if we do not follow the same rules of logic.

Here are examples of what I would regard to be rules of logic: https://en.wikipedia.org/wiki/List_of_rules_of_inference (the ones listed here don't encapsulate all of the rules of inference that I'd endorse, but many of them). Despite our disagreements, I think we'd both agree with the rules that are listed there.

I regard Hitchens's razor not as a rule of logic, but more as an ambiguous slogan / heuristic / rule of thumb.

Best wishes from my side as well :)

Comment by Tor Økland Barstad (tor-okland-barstad) on AGI is uncontrollable, alignment is impossible · 2023-03-22T07:39:32.958Z · LW · GW

I do have arguments for that, and I have already mentioned some of them earlier in our discussion (you may not share that assesment, despite us being relatively close in mind-space compared to most possible minds, but oh well).

Some of the more relevant comments from me are on one of the posts that you deleted.

As I mention here, I think I'll try to round off this discussion. (Edit: I had a malformed/misleading sentence in that comment that should be fixed now.)

Comment by Tor Økland Barstad (tor-okland-barstad) on God vs AI scientifically · 2023-03-22T07:35:28.382Z · LW · GW

Every assumption is incorrect unless there is evidence.

Got any evidence for that assumption? 🙃

Answer to all of them is yes. What is your explanation here?

Well, I don't always "agree"^[1] with ChatGPT, but I agree in regards to those specific questions.

...

I saw a post where you wanted people to explain their disagreement, and I felt inclined to do so :) But it seems now that neither of us feel like we are making much progress.

Anyway, from my perspective much of your thinking here is very misguided. But not more misguided than e.g. "proofs" for God made by people such as e.g. Descartes and other well-known philiophers :) I don't mean that as a compliment, but more so as to neutralize what may seem like anti-compliments :)

Best of luck (in your life and so on) if we stop interacting now or relatively soon :)

I'm not sure if I will continue discussing or not. Maybe I will stop either now or after a few more comments (and let you have the last word at some point).

^{^}
I use quotation-marks since ChatGPT doesn't have "opinions" in the way we do.

Comment by Tor Økland Barstad (tor-okland-barstad) on AGI is uncontrollable, alignment is impossible · 2023-03-22T07:30:41.290Z · LW · GW

Do you think you can deny existence of an outcome with infinite utility?

To me, according to my preferences/goals/inclinations, there are conceivable outcomes with infinite utility/disutility.

But I think it is possible (and feasible) for a program/mind to be extremely capable, and affect the world, and not "care" about infinite outcomes.

The fact that things "break down" is not a valid argument.

I guess that depends on what's being discussed. Like, it is something to take into account/consideration if you want to prove something while referencing utility-functions that reference infinities.

Comment by Tor Økland Barstad (tor-okland-barstad) on God vs AI scientifically · 2023-03-22T07:18:57.982Z · LW · GW

About universally compelling arguments?

First, a disclaimer: I do think there are "beliefs" that most intelligent/capable minds will have in practice. E.g. I suspect most will use something like modus ponens, most will update beliefs in accordance with statistical evidence in certain ways, etc. I think it's possible for a mind to be intelligent/capable without strictly adhering to those things, but for sure I think there will be a correlation in practice for many "beliefs".

Questions I ask myself are:

Would it be impossible (in theory) to wire together a mind/program with "belief"/behavior x, and having that mind be very capable at most mental tasks?
Would it be infeasible (for humans) to wire together a mind/program with "belief"/behavior x, and having that mind be very capable at most mental tasks?

And in the case of e.g. caring about "goals" I don't see good reasons to think that the answer is "no".

Like, I think it is physically and practically possible to make minds that act in ways that I would consider "completely stupid", while still being extremely capable at most mental tasks.

Another thing I sometimes ask myself:

"Is it possible for an intelligent program to surmise what another intelligent mind would do if it had goal/preferences/optimization-target x?"
"Would it be possible for another program to ask about #1 as a question, or fetch that info from the internals of another program?"

If yes and yes, then a program could be written where #2 surmised from #1 what such a mind would do (with goal/preferences/optimization-target x), and carries out that thing.

I could imagine information that would make me doubt my opinion / feel confused, but nothing that is easy to summarize. (I would have to be wrong about several things - not just one.)

Comment by Tor Økland Barstad (tor-okland-barstad) on God vs AI scientifically · 2023-03-22T07:01:05.085Z · LW · GW

With all the interactions we had, I've got an impression that you are more willing to repeat what you've heard somewhere instead of thinking logically.

Some things I've explained in my own words. In other cases, where someone else already has explained something thing well, I've shared an URL to that explanation.

more willing to repeat what you've heard somewhere instead of thinking logically

This seems to support my hypothesis of you "being so confident that we are the ones who "don't get it" that it's not worth it to more carefully read the posts that are linked to you, more carefully notice what we point to as cruxes, etc".

Universally compelling arguments are not possible" is an assumption

Indeed. And it's a correct assumption.

Why would there be universally compelling arguments?

One reason would be that the laws of physics worked in such a way that only minds that think in certain ways are allowed at all. Meaning that if neurons or transistors fire so as to produce beliefs that aren't allowed, some extra force in the universe intervenes to prevent that. But, as far as I know, you don't reject physicalism (that all physical events, including thinking, can be explained in terms of relatively simple physical laws).

Another reason would be that minds would need "believe"^[1] certain things in order to be efficient/capable/etc (or being the kind of efficient/capable/etc thinking machine that humans may be able to construct). But that's also not the case. It's not even needed for logical consistency^[2].

^{^}
Believe is not quite the right word, since we also are discussing what minds are optimized for / what they are wired to do.
^{^}
And logical consistency is also not a requirement in order to be efficient/capable/etc. As a rule of thumb it helps greatly of course. And this is a good rule of thumb, as rules of thumbs go. But it would be a leaky generalization to presume that it is an absolute necessity to have absolute logical consistency among "beliefs"/actions.

Comment by Tor Økland Barstad (tor-okland-barstad) on God vs AI scientifically · 2023-03-22T06:48:32.217Z · LW · GW

What about "I think therefore I am"? Isn't it universally compelling argument?

Not even among the tiny tiny section of mind-space occupied by human minds:

Notice also that "I think therefore I am" is an is-statement (not an ought-statement / something a physical system optimizes towards).

As to me personally, I don't disagree that I exist, but I see it as a fairly vague/ill-defined statement. And it's not a logical necessity, even if we presume assumptions that most humans would share. Another logical possibility would be Boltzmann brains (unless a Boltzmann brain would qualify as "I", I guess).

I argue that "no universally compelling arguments" is misleading.

You haven't done that very much. Only, insofar as I can remember, through anthropomorphization, and reference to metaphysical ough-assumptions not shared by all/most possible minds (sometimes not even shared by the minds you are interacting with, despite these minds being minds that are capable of developing advanced technology).

Comment by Tor Økland Barstad (tor-okland-barstad) on God vs AI scientifically · 2023-03-22T02:03:41.347Z · LW · GW

~~Agreed (more or less). I have pointed him to this post earlier. He has given no signs so far of comprehending it, or even reading it and trying to understand what is being communicated to him.~~

~~I'm saying this more directly than I usually would~~ ~~@Donatas~~, since you seem insistent on clarifying a disagreement/misunderstanding you think is important for the world, while it seems (as far as I can see) that you're not comprehending all that is communicated to you (maybe due to being so confident that we are the ones who "don't get it" that it's not worth it to more carefully read the posts that are linked to you, more carefully notice what we point to as ~~cruxes, etc).~~

Edit: I was unnecessarily hostile/negative here.

Comment by Tor Økland Barstad (tor-okland-barstad) on AGI is uncontrollable, alignment is impossible · 2023-03-22T01:50:11.405Z · LW · GW

He didn't say that "infinite value" is logically impossible. He desdribed it as an assumption.

When saying "is possible, I'm not sure if he meant "is possible (conceptually)" or "is possible (according to the ontology/optimization-criteria of any given agent)". I think the latter would be most sensible.

He later said: "I think initially specifying premises such as these more precisely initially ensures the reasoning from there is consistent/valid.". Not sure if I interpreted him correctly, but I saw it largely as an encouragment to think more explicitly about things like these (not be sloppy about it). Or if not an encouragement to do that, then at least pointing out that it's something you're currently not doing.

If we have a traditional/standard utility-function, and use traditional/standard math in regards to that utility function, then involving credences of infinitie-utility outcomes would typically make things "break down" (with most actions considered to have expected utilities that are either infinite or undefined).

Like, suppose action A has 0.001% chance of infinite negative utility and 99% chance of infinite positive utility. The utility of that action would, I think, be undefined (I haven't looked into it). I can tell for sure that mathemathically it would not be regarded to have positive utility. Here is a video that explains why.

If that doesn't make intuitive sense to you, then that's fine. But mathemathically that's how it is. And that's something to have awareness of (account for in a non-handwavy way) if you're trying to make a mathemathical argument with a basis in utility functions that deal with infinities.

Even if you did account for that it would be besides the point from my perspective, in more ways than one. So what we're discussing now is not actually a crux for me.

Like, suppose action A has 0.001% chance of infinite negative utility and 99% chance of infinite positive utility. The utility of that action would, I think, be undefined

For me personally, it would of course make a big difference whether there is a 0.00000001% chance of infinite positive utility or a 99.999999999% chance. But that is me going with my own intuitions. The standard math relating to EV-calculations doesn't support this.

Comment by tor-okland-barstad on [deleted post] 2023-03-19T23:58:09.305Z

Same traits that make us intelligent (ability to logically reason), make us power seekers.

Well, I do think the two are connected/correlated. And arguments relating to instrumental convergence are a big part of why I take AI risk seriously. But I don't think strong abilities in logical reasoning necessitates power-seeking "on its own".

I think it is wrong to consider Pascal's mugging a vulnerability.

For the record, I don't think I used the word "vulnerability", but maybe I phrased myself in a way that implied me thinking of things that way. And maybe I also partly think that way.

I'm not sure what I think regarding beliefs about small probabilities. One complication is that I also don't have certainty in my own probability-guesstimates.

I'd agree that for smart humans it's advisable to often/mostly think in terms of expected value, and to also take low-probability events seriously. But there are exceptions to this from my perspective.

In practice, I'm not much moved by the original Pascal's Vager (and I'd find it hard to compare the probability of the Christian fantasy to other fantasies I can invent spontaneously in my head).

Comment by Tor Økland Barstad (tor-okland-barstad) on AGI is uncontrollable, alignment is impossible · 2023-03-19T23:39:55.957Z · LW · GW

Most humans are not obedient/subservient to others (at least not maximally so). But also: Most humans would not exterminate the rest of humanity if given the power to do so. I think many humans, if they became a "singleton", would want to avoid killing other humans. Some would also be inclined to make the world a good place to live for everyone (not just other humans, but other sentient beings as well).

From my perspective, the example of humans was intended as "existence proof". I expect AGIs we develop to be quite different from ourselves. I wouldn't be interested in the topic of alignment if I didn't perceive there to be risks associated with misaligned AGI, but I also don't think alignment is doomed/hopeless or anything like that 🙂

Comment by tor-okland-barstad on [deleted post] 2023-03-19T23:30:53.147Z

I'd argue that the only reason you do not comply with Pascal's mugging is because you don't have unavoidable urge to be rational, which is not going to be the case with AGI.

I'd agree that among superhuman AGIs that we are likely to make, most would probably be prone towards rationality/consistency/"optimization" in ways I'm not.

I think there are self-consistent/"optimizing" ways to think/act that wouldn't make minds prone to Pascal's muggings.

For example, I don't think there is anything logically inconsistent about e.g. trying to act so as to maximize the median reward, as opposed to the expected value of rewards (I give "median reward" as a simple example - that particular example doesn't seem likely to me to occur in practice).

Thanks for your input, it will take some time for me to process it.

🙂

Comment by Tor Økland Barstad (tor-okland-barstad) on AGI is uncontrollable, alignment is impossible · 2023-03-19T23:04:22.887Z · LW · GW

Hopefully I'm wrong, please help me find a mistake.

There is more than just one mistake here IMO, and I'm not going to try to list them.

Just the title alone ("AGI is uncontrollable, alignment is impossible") is totally misguided IMO. It would, among other things, imply that brain emulations are impossible (humans can be regarded as a sort of AGI, and it's not impossible for humans to be aligned).

But oh well. I'm sure your perspectives here are earnestly held / it's how you currently see things. And there are no "perfect" procedures for evaluating how much to trust one's own reasoning compared to others.

I would advise reading the sequences (or listening to them as an audiobook) 🙂

Comment by tor-okland-barstad on [deleted post] 2023-03-19T22:49:23.014Z

If an outcome with infinite utility is presented, then it doesn't matter how small its probability is: all actions which lead to that outcome will have to dominate the agent's behavior.

My perspective would probably be more similar to yours (maybe still with substantial differences) if I had the following assumptions:

All agents have a utility-function (or act indistinguishably from agents that do)
All agents where #1 is the case act in a pure/straight-forward way to maximize that utility-function (not e.g. discounting infinities)
All agents where #1 is the case have utility-functions that relate to states of the universe
Cases involving infinite positive/negative expected utility would always/typically speak in favor of one behavior/action. (As opposed to there being different possibilities that imply infinite negative/positive expected utility, and - well, not quite "cancel each other out", but make it so that traditional models of utility-maximization sort of break down).

I think that I myself am an example of an agent. I am relatively utilitarian compared to most humans. Far-fetched possibilities with infinite negative/positive utility don't dominate my behavior. This is not due to me not understanding the logic behind Pascal's Muggings (I find the logic of it simple and straight-forward).

Generally I think you are overestimating the appropriateness/correctness/merit of using a "simple"/abstract model of agents/utility-maximizers, and presuming that any/most "agents" (as we more broadly conceive of that term) would work in accordance with that model.

I see that Google defines an agent as "a person or thing that takes an active role or produces a specified effect". I think of it is cluster-like concept, so there isn't really any definition that fully encapsulates how I'd use that term (generally speaking I'm inclined towards not just using it differently than you, but also using it less than you do here).

Btw, for one possible way to think about utility-maximizers (another cluster-like concept IMO), you could see this post. And here and here are more posts that describe "agency" in a similar way:

In this sort of view, being "agent-like" is more of gradual thing than a yes-no-thing. This aligns with my own internal model of "agentness", but it's not as if there is any simple/crisp definition that fully encapsulates my conception of "agentness".

I think that Orthogonality thesis is right only if an agent is certain that an outcome with infinite utility does not exist. And I argue that an agent cannot be certain of that. Do you agree?

In regards to the first sentence ("I think that Orthogonality thesis is right only if an agent is certain that an outcome with infinite utility does not exist"):

No, I don't agree with that.

In regards to the second sentence ("And I argue that an agent cannot be certain of that"):

I'm not sure what internal ontologies different "agents" would have. Maybe, like with us, may have some/many uncertainties that don't correspond to clear numeric values.

In some sense, I don't see "infinite certainty" as being appropriate in regards to (more or less) any belief. I would not call myself "infinitely certain" that moving my thumb slightly upwards right now won't doom me to an eternity in hell, or that doing so won't save me from an eternity in hell. But I'm confident enough that I don't think it's worth it for me to spend time/energy worrying about those particular "possibilities".

Comment by tor-okland-barstad on [deleted post] 2023-03-19T16:34:36.116Z

It seems that you do not recognize https://www.lesswrong.com/tag/pascal-s-mugging .

Not sure what you mean by "recognize". I am familiar with the concept.

But to be honest most of statements that we can think of may be true and unknowable, for example "aliens exist", "huge threats exist", etc.

"huge threat" is a statement that is loaded with assumptions that not all minds/AIs/agents will share.

Can you prove that there cannot be any unknowable true statement that could be used for Pascal's mugging?

Used for Pascal's mugging against who? (Humans? Cofffee machines? Any AI that you would classify as an agent? Any AI that I would classify as an agent? Any highly intelligent mind with broad capabilities? Any highly intelligent mind with broad capabilities that has a big effect on the world?)

Comment by tor-okland-barstad on [deleted post] 2023-03-18T23:20:53.451Z

Fitch's paradox of knowability and Gödel's incompleteness theorems prove that there may be true statements that are unknowable.

Independently of Gödel's incompleteness theorems (which I have heard of) and Fitch's paradox of knowability (which I had not heard of), I do agree that there can be true statements that are unknown/unknowable (including relatively "simple" ones) 🙂

For example "rational goal exists" may be true and unknowable. Therefore "rational goal may exist" is true. (...) Do you agree?

I don't think it follows from "there may be statements that are true and unknowable" that "any particular statement may be true and unknowable".

Also, some statements may be seen as non-sensical / ill-defined / don't have a clear meaning.

Regarding the term "rational goal", I think it isn't well enough specified/clarified for me to agree or disagree about whether "rational goals" exist.

In regards to Gödel's incompleteness theorem, I suspect "rational goal" (the way you think of it) probably couldn't be defined clearly enough to be the kind of statement that Gödel was reasoning about.

I don't think there are universally compelling arguments (more about that here).

Comment by tor-okland-barstad on [deleted post] 2023-03-18T22:42:23.067Z

Why do you think your starting point is better?

I guess there are different possible interpretations of "better". I think it would be possible for software-programs to be much more mentally capable than me across most/all dimentions, and still not have "starting points" that I would consider "good" (for various interpretations of "good").

As I understand you assume different starting-point.

I'm not sure. Like, it's not as if I don't have beliefs or assumptions or guesses relating to AIs. But I think I probably make less general/universal assumptions that I'd expect to hold for "all" [AIs / agents / etc].

This post is sort of relevant to my perspective 🙂

Comment by tor-okland-barstad on [deleted post] 2023-03-18T22:23:22.892Z

In my opinion the optimal behavior is

Not sure what you mean by "optimal behavior". I think I can see how the things make sense if the starting point is that there is this things called "goals", and (I, the mind/agent) am motivated to optimize for "goals". But I don't assume this as an obvious/universal starting-point (be that for minds in general, extremely intelligent minds in general, minds in general that are very capable and might have a big influence on the universe, etc).

This is a common mistake to assume, that if you don't know your goal, then it does not exist (...)

My perspective is that even AIs that are (what I'd think of as) utility maximizes wouldn't necessarily think in terms of "goals".

The examples you list are related to humans. I agree that humans often have goals that they don't have explicit awareness of. And humans may also often have as an attitude that it makes sense to be in a position to act upon goals that they form in the future. I think that is true for more types of intelligent entities than just humans, but I don't think it generally/always is true for "minds in general".

Caring more about future goals you may form in the future, compared e.g. goals others may have, is not a logical necessity IMO. It may feel "obvious" to us, but what to us are obvious instincts will often not be so for all (or even most) minds in the space of possible minds.

Comment by tor-okland-barstad on [deleted post] 2023-03-18T22:06:43.725Z

I assume you mean "provide definitions"

More or less / close enough 🙂

Agent - https://www.lesswrong.com/tag/agent

Here they write: "A rational agent is an entity which has a utility function, forms beliefs about its environment, evaluates the consequences of possible actions, and then takes the action which maximizes its utility."

I would not share that definition, and I don't think most other people commenting on this post would either (I know there is some irony to that, given that it's the definition given on the LessWrong wiki).

Often the words/concepts we use don't have clear boundaries (more about that here). I think agent is such a word/concept.

Examples of "agents" (← by my conception of the term) that don't quite have utility functions would be humans.

How we may define "agent" may be less important if what we really are interested in is the behavior/properties of "software-programs with extreme and broad mental capabilities".

Future states - numeric value of agent's utility function in the future

I don't think all extremely capable minds/machines/programs would need an explicit utility-function, or even an implicit one.

To be clear, there are many cases where I think it would be "stupid" to not act as if you have (an explicit or implicit) utility function (in some sense). But I don't think it's required of all extremely mentally capable systems (even if these systems are required to have logically contradictory "beliefs").

Comment by tor-okland-barstad on [deleted post] 2023-03-18T21:45:47.326Z

No. That's exactly the point I try to make by saying "Orthogonality Thesis is wrong".

Thanks for the clarification 🙂

"There is no rational goal" is an assumption in Orthogonality thesis

I suspect arriving at such a conclusion may result from thinking of utility maximizes as more of a "platonic" concept, as opposed to thinking of it from a more mechanistic angle. (Maybe I'm being too vague here, but it's an attempt to briefly summarize some of my intuitions into words.)

I'm not sure what you would mean by "rational". Would computer programs need to be "rational" in whichever sense you have in mind in order to be extremely capable at many mental tasks?

First, what is your opinion about this comment?

I don't agree with it.

It is a goal which is not chosen, not assumed, it is concluded from first principles by just using logic. [from comment you reference]

There are lots of assumptions baked into it. I think you have a much too low a bar for thinking of something as a "first principle" that any capable/intelligent software-programs necessarily would adhere to by default.

Comment by tor-okland-barstad on [deleted post] 2023-03-18T20:05:37.286Z

why would you assume that agent does not care about future states? Do you have a proof for that?

Would you be able to Taboo Your Words for "agent", "care" and "future states"? If I were to explain my reasons for disagreement it would be helpful to have a better idea of what you mean by those terms.

Comment by tor-okland-barstad on [deleted post] 2023-03-18T19:52:03.575Z

Hi, I didn't downvote, but below are some thoughts from me 🙂

Some of my comment may be pointing out things you already agree with / are aware of.

I'd like to highlight, that this proof does not make any assumptions, it is based on first principles (statements that are self-evident truths).

First principles are assumptions. So if first principles are built in, then it's not true that it doesn't make assumptions.

I do not know my goal (...) I may have a goal

This seems to imply that the agent should have as a starting-point that is (something akin to) "I should apply a non-trivial probability to the possibility that I ought to pursue some specific goal, and act accordingly". That seems to me as starting with an ought/goal.

Even if there are "oughts" that are "correct" somehow - "oughts" that are "better" than others - that would not mean that intelligent machines by default or necessity would act in pursuit of these "oughts".

Like, suppose I thought that children "ought" not to be tortured for thousands of years (as I do). This does not make the laws of physics stop that from being the case, and it doesn't make it so that any machine that is "intelligent" would care about preventing suffering.

I also think it can be useful to ask ourselves what "goals" really are. We give one word to the word "goal", but if we try to define the term in a way that a computer could understand we see that there is nuance/complexity/ambiguity in that term.

I ought to prepare for any goal

This is not a first principle IMO.

Orthogonality Thesis is wrong

The Orthogonality Thesis states that "an agent can have (more or less) any combination of intelligence level and final goal".

Maybe I could ask you the following question: Do you think that for more or less any final goal, it's possible to for a machine to reason effectively/intelligently about how that goal may be achieved?

If yes, then why might not such a machine be wired up to carry out plans that it reasons would effectively pursue that goal?

Any machine (physical system) consists of tiny components that act in accordance with simple rules (the brain being no exception).

Why might not a machine use very powerful logical reasoning, concept formation, prediction abilities, etc, and have that "engine" wired up in such a way that it is directed at (more or less) any goal?

Some posts you may or may not find interesting 🙂:

Comment by Tor Økland Barstad (tor-okland-barstad) on Why Not Just Outsource Alignment Research To An AI? · 2023-03-17T02:54:16.671Z · LW · GW

Here is my attempt at a shorter answer (although it didn’t end up as short as planned) 🙂

I’m also being more simplistic here (at times deliberately so), in the hope of making “core” concepts digest with less effort.

If you don’t respond here you probably won’t hear from me in a while.

It can, sure, but how can a human get it to state those regularities (...)?

Score-functions would score argument-step-networks. It is these score-functions that would leverage regularities for when human evaluations are “good”/correct.

Here are some things that might be the case for any given score-function:

It has Low Wiggle-room (LW): It has low wiggle-room (that is to say, it’s not possible to construct high-scoring argument-step-networks that argue in favor of contradictory conclusions)
It Robustly Leverages Regularities for “good” human evaluations (RLR): It robustly separates out “good” human evaluations (without false positives).
It Contains Intelligence (CI): It is “intelligent”, and reasons itself towards favoring specific conclusions.
It Hardcodes Bias (HB): It has "hardcoded" bias in favor of specific conclusions.

Here are some things I assume regarding these properties:

We want RLR.
RLR implies LW, but LW does not imply RLR.
LW implies RLR and/or CI and/or HB.
We want to make it so that LW implies RLR (by eliminating other plausibilities).
We can determine if LW is the case if our capabilities at gradient descent are sufficiently good (and we know that they are).

If we are sufficiently good at gradient descent (and know that we are), we can figure out if LW is the case.

Our job would be to make it so that the most high-scoring score-functions (that it’s possible for AIs to make) would achieve LW by RLR. We could make P(RLR | LW) high by doing as follows when adding restrictions and optimization-criteria for score-functions:

We can give restrictions (and give bonus/penalty-points) based on source code length, processing power, etc.
We can make it possible to reference predictions of human answers to questions.
1. These answer-predictions can be referenced as system-calls. The internals of functions that make answer-predictions don’t count as part of the source code length (or processing power, etc)!
2. We can restrict the questions that are allowed (to ones that seem likely to help with RLR, and unlikely to help with HB)
3. We can restrict the amounts of “question-templates” that are allowed (so that no more are allowed than needed to achieve RLR)
We can make it possible to reference heuristics that we suspect help with RLR
1. For example average sentence length or total text length (those are simple examples - we could imagine more sophisticated ones also)
2. Such heuristics would be calculated in external functions that are referenced as system-calls. The internals of these external functions would not count as part of the source code length (or processing power, etc).
Humans could specify “code-snippets”, which are kind of like regular expressions. Humans will often have an inkling for how specific parts of the score-function could work to achieve RLR (even if we haven’t worked out all the details).
1. For various parts of the score-function, using some specific “code-snippet” may be obligatory. Or maybe the score-function would have to be made out of nothing else than “code-snippets”.
2. If “code-snippets” are not obligatory for all parts of the score-function, source code that isn’t from a snippet could be penalized more (in terms of source code length penalty).
“Code-snippets” may also help enforce modularity, and enforce “abstraction” of data.
1. For any part of the score-function it should be as hard as possible to adapt it to what the argument-step-network argues for or against.
2. AIs should not be able to reliably guess what the argument-step-network argues for or against from looking at any input that is received by any sub-routine.
#1 will (if done to a sufficient extent) make CI totally infeasible (and #4 helps make that even more assured).
#2 and #3 makes it so that achieving LW with RLR is possible (despite of #1). But it does not make it possible to achieve LW with HB/CI.
#2 and #3 make RLR easier (in terms of passing requirements and maximizing score-function score). If done well they don’t make HB easier (at least not to a comparable degree).
#1 and #2 makes it so that achieving LW with RLR results in a higher score compared to achieving LW with RLR + a bit of HB. (Why? Because RLR is more general, and hence reduces more wiggle-room per extra complexity that is added to the score-function. Hence minimizing HB, to 0 if possible, results in less score penalty.)
#4 and #5 make HB harder.

And low wiggle-room has to be kept low while reaching correct conclusions. These conclusions can relate to:

Things we know
Predictions relating to experiments that we can carry out safely+easily (or existing experimental data)
The behavior (in terms of e.g. output) of any function that is made by the AI or made step by step from within the argument-step-network

In other words: The score-function has to be “watertight” in relation to an incomprehensibly large scope of possible conclusions/predictions. Many of these would be no less challenging than the strawberry copy challenge, and would be in need of similar capabilities to be developed/verified from within the argument-step-network!

If a score-function is not “watertight”, it becomes easy for capable AGIs to generate conclusions that are contradictions or that easily/safely can be shown to be wrong.

Ask yourself, which is harder?:

Making a bowl that leaks water in certain very specific conditions - e.g. on Sunday afternoons (but otherwise tends to be very watertight).
A bowl like the one described above, except that it doesn’t leak in those very specific conditions.

Between A and B, A is often easier:

Making something be very reliable in general
Making something be very reliable in general, but have certain specific and idiosyncratic exceptions from this (where instead of working reliably it gets things reliably wrong)

In order for score-functions for argument-step-networks to lead us astray, it isn’t enough for B to not be harder than A. B would have to be easier than A!

If A or B are of equal difficulty, it would be possible to make high-scoring score-functions that give high scores to contradictory argument-step-networks. If a human-defined score-function score-function allows for contradictory conclusions, then score-functions that get a high score by this score-function score-function cannot be trusted (even if internally they have low wiggle-room)!

It can, sure, but how can a human (...) tell that it has stated them accurately?

We can define “spaces” of possible score-function score-functions (and score-function restrictions). In this space we have:

Lots of different ways to be “strict” (that we think make HB and CI harder)
Various optimization-criteria based on which we can redact/add points (where more points corresponds with optimizing for criteria that we think make HB/CI comparatively harder compared to RLR)

For any given restriction/optimization-criteria we may have various degrees of confidence for how this affects the ease of RLR vs HB/CI (for AIs that try to maximize score).

If we are right about our assumptions we will see clear patterns emerge:

Any high-scoring score-function that only allows for correct predictions/conclusions in regards to things we can test safely/easily, are also internally consistent (amongst each other) in regards to predictions/conclusions we can’t test safely/easily.
All/most high-scoring score-functions converge on the same set of possible conclusions/predictions (that it’s possible to argue in favor of with high-scoring argument-step-networks)
The “stricter” the score-functions are, the more they converge (towards the same set of possible conclusions/predictions).
The more confident we feel that any given restriction/optimization-target benefits P(RLR | LW), the more adding these restrictions/optimization-target makes score-functions converge (towards the same set of possible conclusions/predictions).

Meanwhile, if we are not right about assumptions, then we would see that for various claims it’s possible to argue for both “X” and “not X”, with argument-step-networks that are given a high score by ≥1 high-scoring score-function. We would not see high-scoring score-functions converge towards specific conclusions/predictions. (In which case we hopefully didn’t put all of our eggs into this specific alignment strategy 🙃)

More clear from this description what I have in mind?

Comment by Tor Økland Barstad (tor-okland-barstad) on Clarifying the Agent-Like Structure Problem · 2023-03-16T21:51:51.377Z · LW · GW

I think I'm probably missing the point here somehow and/or that this will be perceived as not helpful. Like, my conceptions of what you mean, and what the purpose of the theorem would be, are both vague.

But I'll note down some thoughts.

Next, the world model. As with the search process, it should be a subsystem which interacts with the rest of the system/environment only via a specific API, although it’s less clear what that API should be. Conceptually, it should be a data structure representing the world.
(...)
The search process should be able to run queries on the world model

Problems can often be converted into other problems. This can be for both the top-level problem and recursively for sub-problems. One example of this is how NP-completene problems by definition can be converted into other NP-problems in polynomial time:

And as humans we are fairly limited in terms of finding and leveraging abstractions like these. What we are able to do (in terms of converting tasks/problems into more "abstracted" tasks/problems) ≠ what's possible to do.

So then it's not necessarily necessary to be able to do search in a world model? Since very powerful optimizers maybe can get by while being restricted to searching within models that aren't world models (after having converted whatever it is they want to maximize into something more "abstract", or into a problem that corresponds to a different world/ontology - be that wholesale or in "chunks").

I was browsing your posts just now, partly to see if I could get a better idea of what you mean by the terms you use in this post. And I came across What's General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems?, which seems to describe either the same phenomena as the what I'm trying to point to, or at least something similar/overlapping. And it's a good post too (it explains various things better than I can remember hearing/reading elsewhere). So that increases my already high odds that I'm missing the point somehow.

But, depending on what the theorem would be used for, the distinction I'm pointing to could maybe make an important difference:

For example, we may want to verify that certain capabilities are "aligned". Maybe we have AIs compete to make functions that do some specialized task as effectively/optimally as possible, as measured by various metrics.

Some specialized tasks may be tasks where we can test performance safely/robustly, while for other tasks we may only be able to do that for some subset of all possible outputs/predictions/proposals/etc. But we could for example have AIs compete to implement both of these functions with code that overlaps as little as possible^[1].

For example, we may want functions to do is to predict human output (e.g. how humans would answer various questionnaires based on info about those humans). But we may not be able/willing to test the full range of predictions that such functions make (e.g., we may want to avoid exposing real humans to AGI-generated content). However, possible ways to implement such functions may have a lot of overlap with functions where we are able/willing to test the full range of predictions. And we may want to enforce restrictions/optimization-criteria such that it becomes hard to make functions that (1) get maximum score and (2) return wrong output outside of the range where we are able/willing to test/score output and (3) don't return wrong/suboptimal output inside of the range where we are able/willing to test/score output.

To be clear, I wouldn't expect world models to always/typically be abstracted/converted before search is done if what we select for simply is to have systems that do "the final task we are interested in" as effectively/optimally as possible, and we pretty much try to score/optimize/select for that thing in the most straightforward way we can (when "training" AIs / searching for AI-designs). But maybe there sometimes would be merit to actively trying to obtain/optimize systems that "abstract"/convert the model before search is done.

^{^}
As well as optimizing for other optimization-criteria that incentivize for the task to be "abstracted"/converted (and most of the work to be done on models that have been "abstracted"/converted).

Comment by Tor Økland Barstad (tor-okland-barstad) on Wittgenstein's Language Games and the Critique of the Natural Abstraction Hypothesis · 2023-03-16T17:29:22.479Z · LW · GW

NAH, refers to the idea that lower-dimensional summaries or abstractions used by humans in day-to-day thought and language are natural and convergent across cognitive systems

I guess whether there is such convergence isn't a yes-no-question, but a question of degree?

Very regularily I experience that thoughts I want to convey don't have words that clearly correspond to the concepts I want to use. So often I'll use words/expressions that don't match in a precise way, and sometimes there aren't even words/expressions that can be used to vaguely gesture at what I actually mean.

Nonetheless, our concepts are similar enough, and we have a similar enough understanding of how words/expressions correspond to concepts, for us to be able to communicate quite a lot (we misunderstand each other all the time, but nonetheless there is a considerable range of stuff that we are able to communicate fairly reliably).

Comment by Tor Økland Barstad (tor-okland-barstad) on Why Not Just Outsource Alignment Research To An AI? · 2023-03-16T01:08:25.059Z · LW · GW

Not rewarding contradictory conclusions is not a sufficient condition for a score-function to reward truth, or not reward falsehood.

Indeed!

It's a necessary but not sufficient condition.

It can, sure, but how can a human get it to state those regularities (...)?

Summary:

The regularities are expressed in terms of score-functions (that score argument-step-networks)^[1]. We can score these score-functions based on simplicity/brevity, and restrict what they can do (make it so that they have to be written within human-defined confines).

I posit that we probably can do this in such a way that the most high-scoring score-functions that it’s possible to make (that meet requirements relating to wiggle-room and “reach”^[2]) do so by specifying general regularities for when humans are/aren’t led astray (as opposed to ones that favor specific conclusions).

It should (I posit) be feasible to make it so that score-functions that are designed with the intention of fooling us about specific things earn significantly less points (if they pass the requirements at all).

If we allow score-functions to be constructed freely, then this makes it easy for AIs to output score-functions that (practically speaking) allow for pretty much any internally consistent set of conclusions. Indeed, the AI could simply have the score-function be itself, and choose to only give a high score to argument-step-trees with conclusions it likes.

So the question becomes: How do we restrict the “space” of possible score-functions that can be suggested, and add requirements that the score-functions must pass, so that the only way to get a high score (for AIs that suggest score-functions) is to make score-functions that (in some sense) separate out “good” human evaluations?

Here are examples of some of the types of possible requirements for score-functions:

Low maximum source code length (no space for score-function to be intelligent or biased in a detailed way)

With a sufficiently low maximum source code length, there is not enough space for the score-function itself to be intelligent, or for hardcoding bias for many specific conclusions.

Work would have to be done elsewhere (e.g. predictions of human output, which are obtained from system-calls to external functions).

Other optimization-targets relating to simplicity

Requirements relating to maximum use of computational resources, how “general”^[3] the code must be, etc.

Low expressivity (humans define contours of what the score-function should do)

As humans we may not be able to make a score-function that would force good output, but we may be able to outline what such a function might look like, e.g. using “function-builders'' (think regular expressions, but easier to work with for humans, and more expressive).

This way, no humans look at the functions proposed by the AIs, and humans themselves don’t have to write functions, but AI-written source code must stay within human-defined contours.

Enforced modularity (nowhere should score-function know what network argues for/against)

Individual argument-steps should (insofar as possible) be evaluated independently from the network as a whole. And macro-score for the network should be calculated without granular info about nodes. No part of the function should know what the network argues for/against. This could be enforced by the human-defined “contours” of the score-functions and/or by giving points based on heuristics that relate to modularity^[4].

Limited number of different meta-questions the score-function can reference

Systems could predict what humans are likely to answer to questions about argument-steps (“Does the argument-step in Box A fit the description given in Box B?”, etc). And score-functions could be allowed to label argument-steps as “not allowed” based on answer-predictions for meta-questions such as these.

We could limit the amount of different questions that are allowed. For example, the score-function could have 40 questions to choose from when evaluating any given argument-step, but no more.

I posit that if the AI only is allowed the minimum amount of questions^[5], then probably it will need to restrict itself to general questions that separate out “bad” argumentation/assumptions (that are likely to lead humans astray).

Why? Because it needs to use general questions to reduce wiggle-room. Questions that are aimed at making specific conclusions turn out wrong (or preventing certain specific correct conclusions) will probably not be efficient in helping with the general, across-the-board reduction in wiggle-room. Hence, these specific questions will not replace the general ones (that are needed to avoid wiggle-room). And since they don’t replace other questions that are needed, having to include them increases the minimal amount of questions that are needed^[6].

Having humans pre-define lists of argument-steps that help define minimum standard ("I know it when I see it")

We can specify argument-steps that seem top quality to us, and have it so that the score-function only can allow argument-steps that seem at least as rigorous/clear/etc as those ones (according to predictions for what humans would answer when asked to compare different argument-steps)^[7].

Having humans pre-define lists of meta-questions that help define minimum standard

Whether a meta-question is allowed by a score-function could depend on how “suspicious” that meta-question seems. How plausible does it seem that this question prevents humans from being led astray in a general way (as opposed to somehow being targeted at affecting specific conclusions)?

We could pre-define meta-questions that seem good and non-suspicious to us, and only allow meta-questions that don’t seem more suspicious than those^[8]. Or more strictly, we could choose to only allow score-functions to reference meta-questions from the pre-defined list.

“Abstracting” away specifics of argument-network content before score-functions sees it (to make it even harder for any part of the score-function to "know what's being discussed")

Functions (external to the score-function) can process network content, in ways such that score-functions still can evaluate argumentation robustness, but where no AI can tell what the nodes argue for/against (based on output from functions that "abstract" the content).

An “abstraction” can, in this context, be defined as a function that takes some data-structure as input, and returns some other data-structure (typically with less info) as output.

Scoring “abstractions” would be a process in itself. We would want “abstractions” that remove as much info as possible, and are as “general” as possible, but don’t make it impossible to get anything done at all.

Not accepting score-functions if there are simplification that chance conclusion for only some low fraction of conclusions

The way I imagine things, most tasks done by AIs are done in zero-sum competitions with other AIs^[9].

When AIs compete to make high-scoring score-functions, it may be possible for one AI to show that a score-function from the other AI could be simplified, in ways such that only a few conclusions (in the set of statements where high-scoring argument-step-networks are possible) are changed (proportionally speaking)^[10].

The description above is imprecise/crude. How we might make this kind of thing work (without the need for human review, etc) would be a discussion by itself.

The list above is not at all intended to be comprehensive. Hopefully it gives an impression of the kinds of restrictions I have in mind.

It can, sure, but how can a human (...) tell that it has stated them accurately?

1: Wiggle-room of score-functions can be explored^[11]. If a score-function is revealed to have wiggle-room, then the regularities weren’t stated “accurately”^[12].

How strong that indication is depends on our initial assumptions regarding P(high-scoring arguments can be trusted | score-function has low wiggle-room).

2: This leaves several questions, one of which is: If we specified requirements/desiderata that seemed to us that they should result in a very high P(high-scoring arguments can be trusted | score-function has low wiggle-room), how do we check if we were right?

Answer: The job of AIs that propose score-functions is to propose score-functions that pass our requirements and get a high score. They may be tasked to find examples of score-functions (that both pass the requirements + get a high score) that are contradictory in terms of the conclusions that it’s possible to argue for with high-scoring argument-networks.

What we do here is, in a sense, to explore wiggle-room “one level higher up”.

Suppose that A is some set/space of possible score-functions, and we think that for all of the score-functions in A, P(high-scoring arguments can be trusted | score-function has low wiggle-room) is very high. If so, then for any claim P1 (that we are interested in) we could task AIs to:

Construct score-function SF1 (SF1 must be in A + have low wiggle-room)
Construct score-function SF2 (SF2 must also be within A + have low wiggle-room)
Construct an argument-step-network that concludes P1, and is assigned a high score by SF1
Construct an argument-step-network that concludes ¬P1, and is assigned a high score by SF2

If any AI is able to do this, then this indicates that our assumptions were wrong^[13].

If we presume that some restriction/restrictions makes it disproportionately harder for score-functions to have low wiggle-room for the “wrong” reasons, then that can (sort of) be tested. Or at least we may be able to get very strong indications for whether this is the case.

This process of exploring wiggle-room “one level higher up” can be taken further. Maybe that intuitively seems like "bells and whistles". But to me it seems like an important part of the answer. And I'm explaining it less well than I'd want to in order to make things shorter and less convoluted^[14].

3: AIs could be tasked to try making high-scoring argument-step-networks with conclusions/predictions that can be proven wrong in a safe/cheap way.

For example: Constructing functions from within the argument-network, and making wrong conclusions/predictions relating to output from those functions.

If the score-function is not "watertight", this kind of thing would offer lots of opportunitities to achieve unambiguous contradictions.

Keep in mind how “interconnected” truth-claims are. And keep in mind that predictions of human evaluations would relate to individual argument-steps (that are evaluated in a fairly “modular” way, independently of the network as a whole, and often/typically with no way to guess what the conclusion of the network is, or what questions the conclusion answers).

Imagine "proof-like" reasoning relating to the behavior of code, where each argument-step relating to the code is focused on one little thing. The kinds of regularities needed to ensure that the scoring of such argumentation is correct - well, it would need to be quite "watertight" (in order to not be exposed by AIs that are "incentivized" to do so).

And while being so "watertight" it needs (1) allow for certain fairly specific errors and (2) prevent certain specific correct conclusions that correspond inversely to those errors.

Kind of like making a bowl be very watertight in general, but making the bowl always leak water in certain very specific conditions (even those conditions cannot be determined locally by the different parts of the bowl that is to allow the water through). And having the design of this bowl be simpler (not just the same level of complexity) as the simplest way to make a bowl that doesn't leak water in these highly specific conditions (but otherwise behaves the same)^[15].

The analogy above is maybe not perfect, but not so bad either IMO (if we accept certain presumptions, e.g. that the AIs that explore wiggle-room do so competently and without bias!).

We can also test how good score-functions are in terms of e.g. predictions of experimental results (where the predictions are the conclusions of high-scoring networks). Not all experiments are safe to test, but we can ask questions relating to ones that are (and ask in regards to already-existing experimental data that the AIs don't have access to).

The AIs that make argument-step-networks would be scored based on their ability to make high-scoring argument-step-networks (and would not be penalized if it turns out that the conclusions are wrong in ways that we quickly/safely can check). AIs that make score-functions would be penalized for this if/when accuracy against such predictions has been made to be one of the score-function requirements (for the request in question).

To put it somewhat simplistically (things absorb less easily with more caveats and nuance):

In order to fool us about specific things, a score-function would need to not make it infeasible to lead us astray about anything^[16] except some very specific things (that all are outside of the realm of things we can test safely/cheaply/quickly). It has to do that within the score-function restrictions that are designed with the purpose of making that hard.

And even though it is presumably harder to fail in these very specific ways (while getting things right most of the time), it has to be (1) feasible to do that and (2) infeasible to not do that.

That is to say: The human-defined score-function-restrictions^[17] have to somehow make it infeasible to lead us astray in regards to most conclusions, but at the same time make it infeasible to not lead us astray in regards to certain specific conclusions. If not, the score-function-restrictions would have wiggle-room, and various AIs would be leaving points on the table by not pointing this out^[18].

So the human-defined restrictions mustn’t just not work. All/most of them must fail in the same very specific and improbable way.

And if we have humans define things at a "higher level" (which I'd recommend), then this mustn't just be the case for a specific set of restrictions we have defined for score-functions (but very strong patterns observed within human-defined "spaces" of possible ways that score-function restrictions can be combined).

Some postulations/assumptions that I don’t discuss here, but could get into if requested (I’m pretty much a jukebox when it comes to this):

Whether it is realistic to obtain systems that earnestly try to explore wiggle-room (and never pass on an opportunity to expose wiggle-room when requested to look for examples of wiggle-room). And ideas/principles for how we might safely obtain/verify such systems.
Whether it is realistic to obtain systems that predict human answers/evaluations (with the necessary scope, accuracy, precision, and non-bias). And ideas/principles for how we might safely obtain/verify such systems.
(there are other ones also, but these ones stand out to me)

One dilemma:

We want to test/verify the accuracy of systems that predict human evaluations / answers to questions. But often we want these systems to predict human evaluations of AGI-generated content. So if we want to test the accuracy of these systems against the full range of outputs that we use, that would (seemingly) require us to expose real humans to AGI-generated content.

I would say that maybe around half of my thinking, if not more, relates to this kind of dilemma ("How might we evaluate the accuracy of functions outside of the range of inputs where we are willing+able to test output, without having humans try to understand the source code, or even look at it?").

I have also thought considerably about how we might obtain functions that handle certain types of requests, without ever intentionally leaving points on the table for specific requests (e.g. “deliberately” failing to notice specific cases where wiggle room can be exposed).

The situation I often imagine is having obtained superintelligent AGI-systems that can be imagined as trying to seem aligned/helpful/capable (while not necessarily being aligned/helpful)^[19]. And then I try to ponder: Given that hypothetical situation, what steps might we take from there (if we want to obtain robustly aligned superintelligences while minimizing risk along the way)?

Needs simplification more than clarification.

Thanks for the feedback (and thanks again for engaging). You may well be right about that. For sure I have plenty of room for improvement whereas communication is concerned.

That being said, a suspicion of mine is that it can be easy to significantly underestimate the inferential distance that I have to cover (even if usually correct heuristics make that seem unlikely). I am already leaving out concepts that I myself think of as “core” concepts, so as to reduce length.

^{^}
Or at least that's one way to think of it.
^{^}
By “reach” I mean: What is the range of questions where, for at least one answer to that question, there is at least one answer that can be argued for by a high-scoring argument-step-network?
^{^}
A maximally “non-general” function would have 1 if-sentence per possible input.
Possible heuristics:
- “% chance that random change to code chances output for a low but non-zero fraction of inputs”
- “If there are possible simplifications to the code that would affect a low but non-zero fraction of inputs”
^{^}
One example of such a heuristic:
“% of source code that consists of subroutines where no AI is able to make accurate guesses (based on the subroutine input) for what the argument-network argues for/against.
Values for heuristics could be determined by running zero-sum AI vs AI games (including games where the AIs make bets on what the outcomes of other AI vs AI games would be).
^{^}
The minimum amount necessary for achieving low wiggle-room (while still having it be possible to construct high-scoring argument-step-networks with useful conclusions).
^{^}
Similar reasoning to this also applies to maximal source code length. It’s probably hard to minimize source code as much as possible while trying to make it so that conclusions turn out wrong from certain specific questions (but otherwise turn out correct). So AIs that try to do that would presumably lose points to competing AIs that don’t (presuming the AIs are equally competent, etc).
^{^}
As alluded to earlier, score-functions should not have the expressivity or space that is necessary for predicting human answers from within themselves. And they would therefore have to make system-calls to external systems that make such predictions. And there are restrictions in regards to the system-calls that are allowed.
^{^}
Based on predictions of what reviewers will answer to meta-questions where they are asked to compare other meta-questions 🙃
^{^}
Or maybe some other process that approximates such a process (meaning that we do gradient descent, or something like that, as if we had AIs compete against each other in zero-sum games).
^{^}
If so, the AI that made the “non-general” score-function would lose points (and each point lost by one AI is earned by the other).
^{^}
By other AI-systems, that are selected for based on their ability to find examples of wiggle-room.
^{^}
As they for at least one statement would allow for both P and ¬P - and at least one of those has to be bad/wrong.
^{^}
Or maybe it was wrong to claim that SF1 and SF2 have low internal wiggle-room. But some AIs would be “incentivized” to find counter-examples (that prove claims of low wiggle-room wrong). And if counter-examples are found, then the AI that posited low wiggle-room will be penalized (with the penalty being proportional to the level of confidence that was expressed).
^{^}
Earlier I talked about score-function restrictions, and how we would define score-function restrictions that score-functions would need to adhere to.
The simplest thing to imagine is that we define one set of restrictions that score-functions must adhere to (hence it makes sense to start explaining things that way). But what I really envision is that we “explore landscapes” of higher-level wiggle-room relating to “spaces” of possible ways score-functions can be restricted.
The patterns we see could, in combination with our initial assumptions regarding P(high-scoring arguments can be trusted | score-function has low wiggle-room), provide strong evidence.
If our assumptions regarding regarding P(high-scoring arguments can be trusted | score-function has low wiggle-room) typically are correct (and at least not way off), then this should be possible to see from the "patterns" that are revealed.
I could try to explain this better and more concretely if requested.
^{^}
If they are the same level of complexity, then there is higher-level wiggle-room (e.g. relative to the score-function requirements / score-function score-function).
^{^}
Well, not literally anything, but an incomprehensibly large amount of possible statements.
^{^}
Or score-function score-functions.
^{^}
Presuming that we have succeeded in getting at least one of the AIs in question to act as if it is trying to maximize points for every individual request that it gets.
^{^}
While also imagining that these AGI-systems have been successfully contained, are extremely limited in terms of their communication to operators, etc.

Comment by Tor Økland Barstad (tor-okland-barstad) on Alignment with argument-networks and assessment-predictions · 2023-03-15T20:37:38.679Z · LW · GW

One concept I rely upon is wiggle-room (including higher-level wiggle-room). Here are some more abstract musings relating to these concepts:

Desideratum

A function that determines whether some output is approved or not (that output may itself be a function).

Score-function

A function that assigns score to some output (that output may itself be a function).

Some different ways of talking about (roughly) the same thing

Here are some different concepts where each often can be described or thought of in terms of the other:

Restrictions /requirements / desideratum (can often be defined in terms of function that returns true or false)
Sets (e.g. the possible data-structures that satisfy some desideratum)
“Space” (can be defined in terms of possible non-empty outputs from some function - which themselves can be functions, or any other data-structure)
Score-functions (possible data-structure above some maximum score define a set)
Range (e.g. a range of possible inputs)

Function-builder

Think regular expressions, but more expressive and user-friendly.

We can require of AIs: "Only propose functions that can be made with this builder". That way, we restrict their expressivity.

When we as humans specify desideratum, this is one tool (among several!) in the tool-box.

Higher-level desideratum or score-function

Not fundamentally different from other desideratum or score-functions. But the output that is evaluated is itself a desideratum or score-function.

At every level there can be many requirements for the level below.

A typical requirement at every level is low wiggle-room.

Example of higher-level desideratum / score-functions

Humans/operators define a score-function ← level 4

for desideratum ← level 3

for desideratum ← level 2

for desideratum ← level 1

for functions that generate

the output we care about.

Wiggle-room relative to desideratum

Among outputs that would be approved by the desideratum, do any of them contradict each other in any way?

For example: Are there possible functions that give contradicting outputs (for at least 1 input), such that both functions would be approved by the desideratum?

Wiggle-room relative to score-function

Among outputs that would receive a high score by the score-function in question (e.g. "no less than 80% of any other possible output"), do any of them contradict each other in any way?

2nd-level wiggle-room relative to desiderata

We start with a desiderata-desideratum or score-function-desideratum (aka 2nd-level desideratum).

Set A: Any desideratum that approved by the desiderata-desideratum.

Set B: Any output approved by ≥1 of the desiderata in A.

Are there ≥1 contradictions among outputs in B?

P(desideratum forces good outputs | desideratum has low wiggle-room)

If a desideratum forces good/correct outputs, then it has low wiggle-room. But the reverse is not necessarily true.

But for some desiderata we may think: "If wiggle-room is low, that’s probably because it’s hard to satisfy the desideratum without also producing good output."

“Spaces/sets of desideratum where we think P(desideratum forces good outputs | desideratum has low wiggle-room) is low

Among spaces/sets of low-wiggle-room desideratum where we suspect "low wiggle-room → good output" (as defined by higher-level desideratum), do outputs converge?

Properties of desideratum/score-function that we suspect affect P(desideratum forces good outputs | desideratum has low wiggle-room)

There are desideratum-properties that we suspect (with varying confidence) to correlate with "low wiggle-room → good output".

To test our suspicions / learn more we can:

Define spaces of possible desideratum.
Explore patterns relating to higher-level wiggle-room in these spaces.

Comment by Tor Økland Barstad (tor-okland-barstad) on Why Not Just Outsource Alignment Research To An AI? · 2023-03-15T00:29:14.790Z · LW · GW

At a quick skim, I don't see how that proposal addresses the problem at all. (...) I don't even see a built-in way to figure out whether the humans are correctly answering (or correctly assessing their own ability to answer).

Here are additional attempts to summarize. These ones are even shorter than the screenshot I showed earlier.

More clear now?

Comment by Tor Økland Barstad (tor-okland-barstad) on Alignment with argument-networks and assessment-predictions · 2023-03-15T00:03:13.323Z · LW · GW

I'm trying to find better ways of explaining these concepts succinctly (this is a work in progress). Below are some attempts at tweet-length summaries.

280 character limit

We'd have separate systems that (among other things):

Predict human evaluations of individual "steps" in AI-generated "proof-like" arguments.
Make functions that separate out "good" human evaluations.

I'll explain why #2 doesn't rely on us already having obtained honest systems.

Resembles Debate, but:

Higher alignment-tax (probably)
More "proof-like" argumentation
Argumentation can be more extensive
There would be more mechanisms for trying to robustly separate out "good" human evaluations (and testing if we succeeded)

Think Factored Cognition, but:

The work that's factored is evaluating AGI-generated "proofs"
Score-functions weigh human judgments, restrict AGI expressivity, etc
AIs explore if score-functions that satisfy human-defined desiderata allow for contradictions (in aggregate)

560 character limit

A superintelligence knows when it's easy/hard for other superintelligences could fool humans.

Imagine human magicians setting rules for other human magicians ("no cards allowed", etc).

A superintelligence can specify regularities for when humans are hard to fool ("humans with these specific properties are hard to fool with arguments that have these specific properties", etc).

If we leverage these regularities (+ systems that predict human evaluations), it should not be possible to produce high-scoring "proofs-like" arguments with contradictory conclusions.

AIs can compete to make score-functions that evaluate the reliability of "proof-like" arguments.

Score-functions can make system-calls to external systems that predict human answers to questions (whether they agree with any given argument-step, etc).

Other AIs compete to expose any given score-function as having wiggle-room (generating arguments with contradictory conclusions that both get a high score).

Human-defined restrictions/requirements for score-functions increase P(high-scoring arguments can be trusted | score-function has low wiggle-room).

"A superintelligence could manipulate humans" is a leaky abstraction.

It depends on info about reviewers, topic of discussion, restrictions argumentation must adhere to, etc.

Different sub-systems (that we iteratively optimize):

Predict human evaluations
Generate "proof-like" argumentation
Make score-functions for scoring "proof-like" argumentation (based on predictions of human evaluations of the various steps + regularities for when human evaluations tend to be reliable in aggregate)
Search for high-scoring arguments with contradictory conclusions

Comment by Tor Økland Barstad (tor-okland-barstad) on Why Not Just Outsource Alignment Research To An AI? · 2023-03-13T01:25:11.637Z · LW · GW

My own presumption regarding sentience and intelligence is that it's possible to have one without the other (I don't think they are unrelated, but I think it's possible for systems to be extremely capable but still not sentient).

I think it can be easy to underestimate how different other possible minds may be from ourselves (and other animals). We have evolved a survival instinct, and evolved an instinct to not want to be dominated. But I don't think any intelligent mind would need to have those instincts.

To me it seems that thinking machines don't need feelings in order to be able to think (similarily to how it's possible for minds to be able to hear but not see, and visa versa). Some things relating to intelligence are of such a kind that you can't have one without the other, but I don't think that is the case for the kinds of feelings/instincts/inclinations you mention.

That being said, I do believe in instrumental convergence.

Below are some posts you may or may not find interesting :)

Comment by Tor Økland Barstad (tor-okland-barstad) on Why Not Just Outsource Alignment Research To An AI? · 2023-03-11T19:28:17.060Z · LW · GW

I've never downvoted any of your comments, but I'll give some thoughts.

I think the risk relating to manipulation of human reviewers depends a lot on context/specifics. Like, for sure, there are lots of bad ways we could go about getting help from AIs with alignment. But "getting help from AIs with alignment" is fairly vague - a huge space of possible strategies could fit that description. There could be good ones in there even if most of them are bad.

I do find it concerning that there isn't a more proper description from OpenAI and others in regards to how they'd deal with the challenges/risks/limitations relating to these kinds of strategies. At best they're not prioritizing the task of explaining themselves. I do suspect them of not thinking through things very carefully (at least not to the degree they should), and I hope this will improve sooner rather than later.

Among positive attitudes towards AI-assisted alignment, some can be classified as "relax, it will be fine, we can just get the AI to solve alignment for us". While others can be classified as "it seems prudent to explore strategies among this class of strategies, but we should not put all of our eggs in that basket (but work on different alignment-related stuff in parallel)". I endorse the latter but not the former.

"Please Mr. Fox, how should we proceed to keep you out of the henhouse?"

I think this works well as a warning against a certain type of failure mode. But some approaches (for getting help with alignment-related work from AIs) may avoid or at least greatly alleviate the risk you're referring to.

What we "incentivize" for (e.g. select for with gradient descent) may differ between AI-systems. E.g., you could imagine some AIs being "incentivized" to propose solutions, and other AIs being "incentivized" to point out problems with solutions (e.g. somehow disprove claims that other AIs somehow posit).

The degree to which human evaluations are needed to evaluate output may vary depending on the strategies/techniques that are pursued. There could be schemes where this is needed to a much lesser extent than some people maybe imagine.

Some properties of outputs that AIs can posit are of such a kind that 1 counter-example is enough to unambiguously disprove what is being posited. And it's possible to give the same requests to lots of different AI-systems.

In my own ideas, one concept that is relied upon (among various others) is wiggle-room exploration:

Put simplistically: The basic idea here (well, parts of it) would be to explore whether AIs could convince us of contradictory claims (given the restrictions in question).

Such techniques would rely on:

Techniques for splitting up demonstrations/argumentations/"proofs" in ways such that humans can evaluate individual pieces independently of the demonstration/argumentation/"proof" as a whole.
Systems for predicting how humans would review various pieces of content.
Techniques for exploring restrictions for the type of argumentation humans can be presented to, and how those restrictions/requirements affect how easy it is to construct high-scoring argumentation/demonstrations that argue in favor of contradictory claims.
Techniques for finding/verifying+leveraging regularities for when human judgments/evaluations are reliable and when they aren't (depending on info about the person, info about the piece of content being evaluated, info about the state that the human is in, etc).
(Various other stuff also.)

Comment by Tor Økland Barstad (tor-okland-barstad) on Why Not Just Outsource Alignment Research To An AI? · 2023-03-11T08:41:22.032Z · LW · GW

I don't even see a built-in way to figure out whether the humans are correctly answering (or correctly assessing their own ability to answer).

Here is a screenshot from the post summary:

This lacks a lot of detail (it is, after all, from the summary). But do you think you are able to grok the core mechanism that's outlined?

Comment by Tor Økland Barstad (tor-okland-barstad) on Why Not Just Outsource Alignment Research To An AI? · 2023-03-11T00:08:21.434Z · LW · GW

Thanks for engaging! 🙂
As reward, here is a wall of text.

If the humans lack the expertise to accurately answer subquestions or assess arguments (or even realize that they don't know), then the proposal is hosed

You speak in such generalities:

"the humans" (which humans?)
"accurately answer subquestions" (which subquestions?)
"accurately assess arguments" (which arguments/argument-steps?)

But that may make sense based on whatever it is you imagine me to have in mind.

I don't even see a built-in way to figure out whether the humans are correctly answering (or correctly assessing their own ability to answer).

One of the main mechanisms (not the only one) is exploration of wiggle-room (whether it's feasible to construct high-scoring argument-step-networks that argue in favor of contradictory claims).

Some AGIs would be "trained" to construct high-scoring argument-step-networks. If they are able to construct high-scoring argument-step-networks that favor contradictory claims, this indicates that wiggle-room is high.

"A superintelligence could fool (even smart) humans" is a leaky abstraction. It depends on the restrictions/context in question.

It would be the job of the score-function to enforce restrictions for the kinds of argument-steps that are allowed, which assesment-predictions that should be accounted for (and how much), which structural requirements to enforce of argument-networks, etc.

Some AGIs would be "trained" to construct score-functions. These score-functions would themselves be scored, and one of the main criteria when evaluating a score-function would be to see if it allows for wiggle-room (if there are possible argument-networks that argue in favor of contradictory conclusions and that both would have been given a high score by the score-function).

Score-functions would need to be in accordance with restrictions/desideratum defined (directly or indirectly) by humans. These restrictions/desideratum would be defined so as to increase P(score-function forces good output | score-function has low wiggle-room).

One such restriction is low maximum source code length. With a sufficiently low maximum source code length, there is:

not enough space for the score-function itself to be intelligent
not enough space for hardcoding bias for/against lots of specific conclusions

Work must then be done in other ways, and we can (try to) "force" the score-functions to obtain low wiggle-room in the ways we want them to obtain low wiggle-room (since no other way is possible):

Describing and leveraging patterns for when and when human judgments are and aren't reliable
Defining rules in terms of predictions to human answers about node content (these predictions of humans would be made by external systems, which score-functions could reference as system-calls).

Suppose we define space A to be a "space" of possible score-functions, such that for all score-functions in that "space" we think P(score-function forces good output | score-function has low wiggle-room) probably is high. And suppose we define space B to be any score-function in A with low wiggle-room. Then we can "train" AGIs to help us explore:

"Are there argument-networks that argue in favor of contradictory conclusions, such that each of those argument-networks would have been given a high score by at least one of the score-functions in B?"

(or correctly assessing their own ability to answer).

Well, we would have systems that predict the judgment of specific humans (based on info about those humans)^[1]. AGIs that are "trained" to do so could search for ways of finding argument-steps make up argument-networks that argue in favor of contradictory claims (with each argument-step being an argument-step where the human in question is predicted to confidently agree).

Btw, I agree with Eliezer when he says:

"Human operators are fallible, breakable, and manipulable. Human raters make systematic errors - regular, compactly describable, predictable errors."

But similar things can be about our tendency to get things right. We are also, after all, capable of getting things right. We make "regular, compactly describable, predictable" non-errors.

It's possible to give us questions where the odds (of us getting things right) are in our favor. And it's possible to come up with (functions that enforce) restrictions such that only such questions are allowed.

I don't expect people to correctly assess their own ability to evaluate correctly. But I expect there to be ways to separate out "good/reliable" human judgments (based on info about the argument-step, info about the human, info about how confident the human is predicted to be, etc).

And even if these mechanisms for separating out "good/reliable" human judgments aren't perfect, that does not necessarily/automatically prevent these techniques from working.

Nor do I see any way to check that the system is asking the right questions.

Not sure what kinds of questions you have in mind (there could be several). For all the interpretations I can think of for what you might mean, I have an answer. But covering all of them could be long-winded/confusing.

(Though the main problems with this proposal are addressed in the rant on problem factorization, rather than here.)

Among my own reasons for uncertainty, the kinds of problems you point to there are indeed among the top ones^[2].

It's absolutely possible that I'm underestimating these difficulties (or that I'm overestimating them). But I'm not blue-eyed about problem factorization in general the way you maybe would suspect me to be (among humans today, etc)^[3].

Btw, I reference the rant on problem factorization under the sub-header Feasibility of splitting arguments into human-digestible “pieces”:

Some quick points:

This is a huge difference between an AGI searching for ways to demonstrate things to humans, and humans splitting up work between themselves. Among the huge space of possible ways to demonstrate something to be the case, superintelligent AGIs can search for the tiny fraction where it's possible to split each piece into something that (some) humans would be able to evaluate in a single sitting. It's not a given that even superintelligent AGIs always will be able to do this, but notice the huge difference between AIs factorizing for humans and humans factorizing for humans.
There is a huge difference between evaluating work/proofs in a way that is factorized and constructing proofs/work in a way that is factorized. Both are challenging (in many situations/contexts prohibitively so), but there is a big difference between them.
There is a huge difference between factorizing among "normal" humans and factorizing among the humans who are most capable in regards to the stuff in question (by "normal" here I don't mean IQ of 100, but rather something akin to "average employee at Google").
There is a huge difference between whether something is efficient, and whether it's possible. Factorizing work is typically very inefficient, but in relation to the kind of schemes I'm interested in it may be ok to have efficiency scaled down by orders of magnitude (sometimes in ways that would be unheard of in real life among humans today^[4]).
How much time humans have to evaluate individual "pieces" makes a huge difference. It takes time to orient oneself, load mental constructs into memory, be introduced to concepts and other mental constructs that may be relevant, etc. What I envision is not "5 minutes", but rather something like "one sitting" (not even that would need to be an unbreakable rule - several sittings may be ok).

I don't expect this comment to convince you that the approach I have in mind is worthwhile. And maybe it is misguided somehow. But I don't really explain myself properly here (there are main points/concepts I leave out). And there are many objections that I anticipate but don't address.

If you have additional feedback/objections I'd be happy to receive it. Even low-quality/low-effort feedback can be helpful, as it helps me learn where my communication is lacking. So I much prefer loud misunderstandings over quiet dismissal 🙂

^{^}
The question of how to safely obtain and verify the accuracy of such systems is a discussion by itself.
^{^}
This was also the case prior to reading that article. I learned about the Ought experiment from there, but insofar as reading about the Ought experiment changed my perspective it was only a very slight update.

I view the Ought experiment as similarly interesting/relevant to e.g. anecdotal stories from my own life when working on group projects in school.
^{^}
I work on an app/website with a big user base in several countries, as a dev-team of one. I never tried to outsource "core" parts of the coding to freelancers. And I suspect I have a higher threshold than most for bothering to use third-party libraries (when I do, I often find that they have problems or are badly documented).
^{^}
I presume/suspect efficiency losses of orders of magnitude per person due to problem-factorization are widespread among humans today already (a sometimes necessary evil). But the schemes I have in mind involve forms of evaluation/work that would be way too tedious if most of it was done by real humans.

User info