Posts
Comments
It seems to make sense that if hiring an additional employee provides marginal shareholder value, that the company will hire additional employees. So, when the company stops hiring employees, it seems reasonable that this is because the marginal benefit of hiring an additional employee is not positive. However, I don't see why this should suggest that the company is likely to hire an employee that provides a marginal value of 0 or negative.
"Number of employees" is not a continuous variable. When hiring an additional employee, how this changes what the marginal benefit of an additional employee can be large enough to change it from positive to negative.
Of course, when making a hiring decision, the actual marginal benefit isn't known, but something one has a belief about how likely the hire is to provide each different amount of value. I suppose then one can just consider the marginal expected benefit or whatever wherever I said "marginal benefit". Though I guess there's also something to be said there about appetite-for-risk or whatever.
I guess there's the possibility that:
1) the marginal expected benefit of hiring a certain potential new employee is strictly positive
2) it turns out that the actual marginal benefit of employing that person is negative
3) it turns out to be difficult for the company to determine/notice that they would be better off without that employee
and that this could result in the company accumulating employees/positions it would be better off not having?
Not if the point of the argument is to establish that a superintelligence is compatible with achieving the best possible outcome.
Here is a parody of the issue, which is somewhat unfair and leaves out almost all of your argument, but which I hope makes clear the issue I have in mind:
"Proof that a superintelligence can lead to the best possible outcome: Suppose by some method we achieved the best possible outcome. Then, there's no properties we would want a superintelligence to have beyond that, so let's call however we achieved the best possible outcome, 'a superintelligence'. Then, it is possible to have a superintelligence produce the best possible outcome, QED."
In order for an argument to be compelling for the conclusion "It is possible for a superintelligence to lead to good outcomes." you need to use a meaning of "a superintelligence" in the argument, such that the statement "It is possible for a superintelligence to lead to good outcomes", when interpreted with that meaning of "a superintelligence", produces the meaning you want that sentence to have? If I argue "it is possible for a superintelligence, by which I mean computer with a clock speed faster than N, to lead to good outcomes", then, even if I convincingly argue that a computer with a clock speed faster than N can lead to good outcomes, that shouldn't convince people that it is possible for a superintelligence, in the sense that they have in mind (presumably not defined as "a computer with a clock speed faster than N"), is compatible with good outcomes.
Now, in your argument you say that a superintelligence would presumably be some computational process. True enough! If you then showed that some predicate is true of every computational process, you would then be justified in concluding that that predicate is (presumably) true of every possible superintelligence. But instead, you seem to have argued that a predicate is true of some computational process, and then concluded that it is therefore true of some possible superintelligence. This does not follow.
Yes, I knew the cardinalities in question were finite. The point applies regardless though. For any set X, there is no injection from 2^X to X. In the finite case, this is 2^n > n for all natural numbers n.
If there are N possible states, then the number of functions from possible states to {0,1} is 2^N , which is more than N, so there is some function from the set of possible states to {0,1} which is not implemented by any state.
If your argument is, "if it is possible for humans to produce some (verbal or mechanical) output, then it is possible for a program/machine to produce that output", then, that's true I suppose?
I don't see why you specified "finite depth boolean circuit".
While it does seem like the number of states for a given region of space is bounded, I'm not sure how relevant this is. Not all possible functions from states to {0,1} (or to some larger discrete set) are implementable as some possible state, for cardinality reasons.
I guess maybe that's why you mentioned the thing along the lines of "assume that some amount of wiggle room that is tolerated" ?
One thing you say is that the set of superintelligences is a subset of the set of finite-depth boolean circuits. Later, you say that a lookup table is implementable as a finite-depth boolean circuit, and say that some such lookup table is the aligned superintelligence. But, just because it can be expressed as a finite-depth boolean circuit, it does not follow that it is in the set of possible superintelligences. How are you concluding that such a lookup table constitutes a superintelligence? It seems
Now, I don't think that "aligned superintelligence" is logically impossible, or anything like that, and so I expect that there mathematically-exists a possible aligned-superintelligence (if it isn't logically impossible, then by model existence theorem, there exists a model in which one exists... I guess that doesn't establish that we live in such a model, but whatever).
But I don't find this argument a compelling proof(-sketch).
Yes. I believe that is consistent with what I said.
"not((necessarily, for each thing) : has [x] -> those [x] are such that P_1([x]))"
is equivalent to, " (it is possible that something) has [x], but those [x] are not such that P_1([x])"
not((necessarily, for each thing) : has [x] such that P_2([x]) -> those [x] are such that P_1([x]))
is equivalent to "(it is possible that something) has [x], such that P_2([x]), but those [x] are not sure that P_1([x])" .
The latter implies the former, as (A and B and C) implies (A and C), and so the latter is stronger, not weaker, than the former.
Right?
Doesn't "(has preferences, and those preferences are transitive) does not imply (completeness)" imply (has preferences) does not imply (completeness)" ? Surely if "having preferences" implied completeness, then "having transitive preferences" would also imply completeness?
"Political category" seems, a bit strong? Like, sure, the literal meaning of "processed" is not what people are trying to get at. But, clearly, "those processing steps that are done today in the food production process which were not done N years ago" is a thing we can talk about. (by "processing step" I do not include things like "cleaning the equipment", just steps which are intended to modify the ingredients in some particular way. So, things like, hydrogenation. This also shall not be construed as indicating that I think all steps that were done N years ago were better than steps done today.)
For example, it is not clear to me if once I consider a program that outputs 0101 I will simply ignore other programs that output that same thing plus one bit (e.g. 01010).
No, the thing about prefixes is about what strings encode a program, not about their outputs.
The purpose of this is mostly just to define a prior over possible programs, in a way that conveniently ensures that the total probability assigned over all programs is at most 1. Seeing as it still works for different choices of language, it probably doesn't need to exactly use this kind of defining the probabilities, and I think any reasonable distribution over programs will do (at least, after enough observations)
But, while I think another distribution over programs should work, this thing with the prefix-free language is the standard way of doing it, and there are reasons it is nice.
The analogy for a normal programming language would be if no python script was a prefix of any other python script (which isn't true of python scripts, but could be if they were required to end with some "end of program" string)
There will be many different programs which produce the exact same output when run, and will all be considered when doing Solomonoff induction.
The programs in A have 5 bits of Kolmogorov complexity each. The programs in B have 6 bits. The program C has 4
This may be pedantic of me, but I wouldn't call the lengths of the programs, the Kolmogorov complexity of the program. The lengths of the programs are (upper bounds on) the Kolmogorov complexity of the outputs of the programs. The Kolmogorov complexity of a program g, would be the length of the shortest program which outputs the program g, not the length of g.
When you say that program C has 4 bits, is that just a value you picked, or are you obtaining that from somewhere?
Also, for a prefix-free programming language, you can't have 2^5 valid programs of length 5, and 2^6 programs of length 6, because if all possible binary strings of length 5 were valid programs, then no string of length 6 would be a valid program.
This is probably getting away from the core points though
(You could have the programming language be such that, e.g. 00XXXXX outputs the bits XXXXX, and 01XXXXXX outputs the bits XXXXXX, and other programs start with a 1, and any other program might want to encode is somehow encoded using some scheme)
the priors for each models will be 2^-5 for model A, 2^-6 for model B and 2^-4 for model C, according to their Kolmogorov complexity?
yeah, the (non-normalized) prior for each will be 2^(-(length of a program which directly encodes a 5 bit string to output)) for the programs which directly encode some 5 bit string and output it, 2^(-(length of a program which directly encodes a 6 bit string to output)) for the programs which directly encode some 6 bit string and output it, and (say) 2^(-4) for program C.
And those likelihoods you gave are all correct for those models.
So, then, the posteriors (prior to normalization)
would be 2^(-(length of a program which directly encodes a 5 bit string to output)) (let's say this is 2^(-7) for the program that essentially is print("HTHHT"),
2^(-(length of a program which directly encodes a 6 bit string to output)) (let's say this is 2^(-8) ) for the programs that essentially are print("HTHHTH") and print("HTHHTT") respectively
2^(-4) * 2^(-5) = 2^(-9) for model C.
If we want to restrict to these 4 programs, then, adding these up, we get 2^(-7) + 2^(-8) + 2^(-8) + 2^(-9) = 2^(-6) + 2^(-9) = 9 * 2^(-9), and dividing that, we get
(4/9) chance for the program that hardcodes HTHHT (say, 0010110)
(2/9) chance for the program that hardcodes HTHHTH (say, 01101101)
(2/9) chance for the program that hardcodes HTHHTT (say, 01101100)
(1/9) chance for the program that produces a random 5 bit string. (say, 1000)
So, in this situation, where we've restricted to these programs, the posterior probability distribution for "what is the next bit" would be
(4/9)+(1/9)=(5/9) chance that "there is no next bit" (this case might usually be disregarded/discared, idk.)
(2/9) chance that the next bit is H
(2/9) chance that the next bit is T
Thanks! The specific thing I was thinking about most recently was indeed specifically about context length, and I appreciate the answer tailored to that, as it basically fully addresses my concerns in this specific case.
However, I also did mean to ask the question more generally. I kinda hoped that the answers might also be helpful to others who had similar questions (as well as if I had another idea meeting the same criteria in the future), but maybe thinking other people with the same question would find the question+answers here, was not super realistic, idk.
Here is my understanding:
we assume a programming language where a program is a finite sequence of bits, and such that no program is a prefix of another program. So, for example, if 01010010 is a program, then 0101 is not a program.
Then, the (not-normalized) prior probability for a program is
Why that probability?
If you take any infinite sequence of bits, then, because no program is a prefix of any other program, at most one program will be a prefix of that sequence of bits.
If you randomly (with uniform distribution) select an infinite sequence of bits, the probability that the sequence of bits has a particular program as a prefix, is then (because there's a factor of (1/2) for each of the bits of the program, and if the first (length of the program) bits match the program, then it doesn't matter what the rest of the bits that come after are.
(Ah, I suppose you don't strictly need to talk about infinite sequences of bits, and you could just talk about randomly picking the value of the next bit, stopping if it ever results in a valid program in the programming language..., not sure which makes it easier to think about.)
If you want this to be an actual prior, you can normalize this by dividing by (the sum over all programs, of ).
The usual way of defining Solomonoff induction, I believe has the programs being deterministic, but I've read that allowing them to use randomness has equivalent results, and may make some things conceptually easier.
So, I'll make some educated guesses about how to incorporate the programs having random behavior.
Let G be the random variable for "which program gets selected", and g be used to refer to any potential particular value for that variable. (I avoided using p for program because I want to use it for probability. The choice of the letter g was arbitrary.)
Let O be the random variable for the output observed, and let o be used for any particular value for that variable.
And, given a program g, the idea of P(O=o|G=g) makes sense, and P(G=g) is proportional to (missing a normalization constant, but this will be the same across all g),
And, P(O=o) will also be a normalization constant that is the same across all g.
And so, if you can compute values P(O=o|G=g) (the program g may take too long to run in practice) we can compute values proportional to P(G=g|O=o) .
Does that help any?
(apologies if this should be a "comment" rather than an "answer". Hoping it suffices.)
Well, I was kinda thinking of as being, say, a distribution of human behaviors in a certain context (as filtered through a particular user interface), though, I guess that way of doing it would only make sense within limited contexts, not general contexts where whether the agent is physically a human or something else, would matter. And in this sort of situation, well, the action of "modify yourself to no-longer be a quantilizer" would not be in the human distribution, because the actions to do that are not applicable to humans (as humans are, presumably, not quantilizers, and the types of self-modification actions that would be available are not the same). Though, "create a successor agent" could still be in the human distribution.
Of course, one doesn't have practical access to "the true probability distribution of human behaviors in context M", so I guess I was imagining a trained approximation to this distribution.
Hm, well, suppose that the distribution over human-like behaviors includes both making an agent which is a quantilizer and making one which isn't, both of equal probability. Hm. I don't see why a general quantilizer in this case would pick the quantilizer over the plain optimizer, as the utility...
Hm...
I get the idea that the "quantilizers correspond to optimizing an infra-function of form [...]" thing is maybe dealing with a distribution over a single act?
Or.. if we have a utility function over histories until the end of the episode, then, if one has a model of how the environment will be and how one is likely to act in all future steps, given each of one's potential actions in the current step, one gets an expected utility conditioned on each of the potential actions in the current step, and this works as a utility function over actions for the current step,
and if one acts as a quantilizer over that, each step.. does that give the same behavior as an agent optimizing an infra-function defined using the condition with the norm described in the post, in terms of the utility function over histories for an entire episode, and reference distributions for the whole episode?
argh, seems difficult...
For the "Crappy Optimizer Theorem", I don't understand why condition 4, that if , then , isn't just a tautology[1]. Surely if , then no-matter what is being used,
as , then letting , then , and so .
I guess if the 4 conditions are seen as conditions on a function (where they are written for ), then it no-longer is automatic, and it is just when specifying that for some , that condition 4 becomes automatic?
______________
[start of section spitballing stuff based on the crappy optimizer theorem]
Spitball 1:
What if instead of saying , we had ? would we still get the results of the crappy optimizer theorem?
If we define if s(f) is now a distribution over X, then, I suppose instead of writing Q(s)(f)=f(s(f)) should write Q(s)(f) = s(f)(f) , and, in this case, the first 2 and 4th conditions seem just as reasonable. The third condition... seems like it should also be satisfied?
Spitball 2:
While I would expect that the 4 conditions might not be exactly satisfied by, e.g. gradient descent, I would kind of expect basically any reasonable deterministic optimization process to at least "almost" satisfy them? (like, maybe gradient-descent-in-practice would fail condition 1 due to floating point errors, but not too badly in reasonable cases).
Do you think that a modification of this theorem for functions Q(s) which only approximately satisfy conditions 1-3, would be reasonably achievable?
______________
- ^
I might be stretching the meaning of "tautology" here. I mean something provable in our usual background mathematics, and which therefore adding it as an additional hypothesis to a theorem, doesn't let us show anything that we couldn't show without it being an explicit hypothesis.
I thought CDT was considered not reflectively-consistent because it fails Newcomb's problem?
(Well, not if you define reflective stability as meaning preservation of anti-Goodhart features, but, CDT doesn't have an anti-Goodhart feature (compared to some base thing) to preserve, so I assume you meant something a little broader?)
Like, isn't it true that a CDT agent who anticipates being in Newcomb-like scenarios would, given the opportunity to do so, modify itself to be not a CDT agent? (Well, assuming that the Newcomb-like scenarios are of the form "at some point in the future, you will be measured, and based on this measurement, your future response will be predicted, and based on this the boxes will be filled")
My understanding of reflective stability was "the agent would not want to modify its method of reasoning". (E.g., a person with an addiction is not reflectively stable, because they want the thing (and pursue the thing), but would rather not want (or pursue) the thing.
The idea being that, any ideal way of reasoning, should be reflectively stable.
And, I thought that what was being described in the part of this article about recovering quantilizers, was not saying "here's how you can use this framework to make quantalizers better", so much as "quantilizers fit within this framework, and can be described within it, where the infrafunction that produces quantilizer-behavior is this one: [the (convex) set of utility functions which differ (in absolute value) from the given one, by, in expectation under the reference policy, at most epsilon]"
So, I think the idea is that, a quantilizer for a given utility function and reference distribution is, in effect, optimizing for an infrafunction that is/corresponds-to the set of utility functions satisfying the bound in question,
and, therefore, any quantilizer, in a sense, is as if it "has this bound" (or, "believes this bound")
And that therefore, any quantilizer should -
- wait.. that doesn't seem right..? I was going to say that any quantilizer should therefore be reflectively stable, but that seems like it must be wrong? What if the reference distribution includes always taking actions to modify oneself in a way that would result in not being a quantilizer? uhhhhhh
Ah, hm, it seems to me like the way I was imagining the distribution and the context in which you were considering it, are rather different. I was thinking of as being an accurate distribution of behaviors of some known-to-be-acceptably-safe agent, whereas it seems like you were considering it as having a much larger support, being much more spread out in what behaviors it has as comparably likely to other behaviors, with things being more ruled-out rather than ruled-in ?
Whoops, yes, that should have said , thanks for the catch! I'll edit to make that fix.
Also, yes, what things between and should be sent to, is a difficulty..
A thought I had which, on inspection doesn't work, is that (things between and ) could be sent to , but that doesn't work, because might be terminal, but (thing between and ) isn't terminal. It seems like the only thing that would always work would be for them to be sent to something that has an arrow (in B) to (such as f(a), as you say, but, again as you mention, it might not be viable to determine f(a) from the intermediary state).
I suppose if were a partial function, and one such that all states not in its domain have a path to a state which is in its domain, then that could resolve that?
I think equivalently to that, if you modified the abstraction to get, defined as, and
so that B' has a state for each state of B, along with a second copy of it with no in-going arrows and a single arrow going to the normal version,
uh, I think that would also handle it ok? But this would involve modifying the abstraction, which isn't nice. At least the abstraction embeds in the modified version of it though.
Though, if this is equivalent to allowing f to be partial, with the condition that anything not in its domain have arrows leading to things that are in its domain, then I guess it might help to justify a definition allowing to be partial, provided it satisfies that condition.
Are these actually equivalent?
Suppose is partial and satisfies that condition.
Then, define to agree with f on the domain of f, and for other , pick an in the domain of f such that , and define
In a deterministic case, should pick to be the first state in the path starting with to be in the domain of . (In the non-deterministic case, this feels like something that doesn't work very well...)
Then, for any non-terminal , either it is in the domain of f, in which case we have the existence of such that a and where , or it isn't, in which case we have that there exists such that a and where , and so f' satisfies the required condition.
On the other hand, if we have an satisfying the required condition, we can co-restrict it to B, giving a partial function , where, if isn't in the domain of f, then, assuming a is non-terminal, we get some s.t. a and where , and, as the only things in B' which have any arrows going in are elements of B, we get that f'(a'') is in B, and therefore that a'' is in the domain of f.
But what if a is terminal? Well, if we require that non-terminal implies non-terminal, then this won't be an issue because pre(b) is always non-terminal, and so anything terminal is in the domain.
Therefore the co-restriction f of f', does yield a partial function satisfying the proposed condition to require for partial maps.
So, this gives a pair of maps, one sending functions satisfying appropriate conditions to partial function satisfying other conditions, and one sending the other way around, and where, if the computations are deterministic, these two are inverses of each-other. (if not assuming deterministic, then I guess potentially only a one-sided inverse?)
So, these two things are equivalent.
uh... this seems a bit like an adjunction, but, not quite. hm.
A thought on the "but what if multiple steps in the actual-algorithm correspond to a single step in an abstracted form of the algorithm?" thing :
This reminds me a bit of, in the topic of "Abstract Rewriting Systems", the thing that the vs distinction handles. (the asterisk just indicating taking the transitive reflexive closure)
Suppose we have two abstract rewriting systems and .
(To make it match more closely what you are describing, we can suppose that every node has at most one outgoing arrow, to make it fit with how you have timesteps as functions, rather than being non-deterministic. This probably makes them less interesting as ARSs, but that's not a problem)
Then, we could say that is a homomorphism (I guess) if,
for all such that has an outgoing arrow, there exists such that and .
These should compose appropriately [err, see bit at the end for caveat], and form the morphisms of a category (where the objects are ARSs).
I would think that this should handle any straightforward simulation of one Turing machine be another.
As for whether it can handle complicated disguise-y ones, uh, idk? Well, if it like, actually simulates the other Turing machine, in a way where states of the simulated machine have corresponding states in the simulating machine, which are traversed in order, then I guess yeah. If the universal Turing machine instead does something pathological like, "search for another Turing machine along with a proof that the two have the same output behavior, and then simulate that other one", then I wouldn't think it would be captured by this, but also, I don't think it should be?
This setup should also handle the circuits example fine, and, as a bonus(?), can even handle like, different evaluation orders of the circuit nodes, if you allow multiple outgoing arrows.
And, this setup should, I think, handle anything that the "one step corresponds to one step" version handles?
It seems to me like this set-up, should be able to apply to basically anything of the form "this program implements this rough algorithm (and possibly does other stuff at the same time)"? Though to handle probabilities and such I guess it would have to be amended.
I feel like I'm being overconfident/presumptuous about this, so like,
sorry if I'm going off on something that's clearly not the right type of thing for what you're looking for?
__________
Checking that it composes:
suppose A, B, C,
then for any which has an outgoing arrow and where f(a) has an outgoing arrow, where
so either b' = f(a) or there is some sequence where , and so therefore,
Ah, hm, maybe we need an additional requirement that the maps be surjective?
or, hm, as we have the assumption that has an outward arrow,
then we get that there is an s.t. and s.t.
ok, so, I guess the extra assumption that we need isn't quite "the maps are surjective", so much as, "the maps f are s.t. if f(a) is a non-terminal state, then f(a) is a non-terminal state", where "non-terminal state" means one with an outgoing arrow. This seems like a reasonable condition to assume.
In the line that ends with "even if God would not allow complete extinction.", my impulse is to include " (or other forms of permanent doom)" before the period, but I suspect that this is due to my tendency to include excessive details/notes/etc. and probably best not to actually include in that sentence.
(Like, for example, if there were no more adult humans, only billions of babies grown in artificial wombs (in a way staggered in time) and then kept in a state of chemically induced euphoria until the age of 1, and then killed, that technically wouldn't be human extinction, but, that scenario would still count as doom.)
Regarding the part about "it is secular scientific-materialists who are doing the research which is a threat to my values" part: I think it is good that it discusses this! (and I hadn't thought about including it)
But, I'm personally somewhat skeptical that CEV really works as a solution to this problem? Or at least, in the simpler ways of it being described.
Like, I imagine there being a lot of path-dependence in how a culture's values would "progress" over time, and I see little reason why a sequence of changes of the form "opinion/values changing in response to an argument that seems to make sense" would be that unlikely to produce values that the initial values would deem horrifying? (or, which would seem horrifying to those in an alternate possible future that just happened to take a difference branch in how their values evolved)
[EDIT: at this point, I start going off on a tangent which is a fair bit less relevant to the question of improving Stampy's response, so, you might want to skip reading it, idk]
My preferred solution is closer to, "we avoid applying large amounts of optimization pressure to most topics, instead applying it only to topics where there is near-unanimous agreement on what kinds of outcomes are better (such as, "humanity doesn't get wiped out by a big space rock", "it is better for people to not have terrible diseases", etc.), while avoiding these optimizations having much effect on other areas where there is much disagreement as to what-is-good.
Though, it does seem plausible to me, as a somewhat scary idea, that the thing I just described is perhaps not exactly coherent?
(that being said, even though I have my doubts about CEV, at least in the form described in the simpler ways it is described, I do think it would of course be better than doom.
Also, it is quite possible that I'm just misunderstanding the idea of CEV in a way that causes my concerns, and maybe it was always meant to exclude the kinds of things I describe being concerned about?)
I want to personally confirm a lot of what you've said here. As a Christian, I'm not entirely freaked out about AI risk because I don't believe that God will allow it to be completely the end of the world (unless it is part of the planned end before the world is remade? But that seems unlikely to me.), but that's no reason that it can't still go very very badly (seeing as, well, the Holocaust happened).
In addition, the thing that seems to me most likely to be the way that God doesn't allow AI doom, is for people working on AI safety to succeed. One shouldn't rely on miracles and all that (unless [...]), so, basically I think we should plan/work as if it is up to humanity to prevent AI doom, only that I'm a bit less scared of the possibility of failure, but I would hope only in a way that results in better action (compared to panic) rather than it promoting inaction.
(And, a likely alternative, if we don't succeed, I think of as likely being something like,
really-bad-stuff happens, but then maybe an EMP (or many EMPs worldwide?) gets activated, solving that problem, but also causing large-scale damage to power-grids, frying lots of equipment, and causing many shortages of many things necessary for the economy, which also causes many people to die. idk.)
I don't understand why this comment has negative "agreement karma". What do people mean by disagreeing with it? Do they mean to answer the question with "no"?
First, I want to summarize what I understand to be what your example is an example of:
"A triple consisting of
1) A predicate P
2) the task of generating any single input x for which P(x) is true
3) the task of, given any x (and given only x, not given any extra witness information), evaluating whether P(x) is true
"
For such triples, it is clear, as your example shows, that the second task (the 3rd entry) can be much harder than the first task (the 2nd entry).
_______
On the other hand, if instead one had the task of producing an exhaustive list of all x such that P(x), this, I think, cannot be easier that verifying whether such a list is correct (provided that one can easily evaluate whether x=y for whatever type x and y come from), as one can simply generate the list, and check if it is the same list.
Another question that comes to mind is: Are there predicates P such that the task of verifying instances which can be generated easily, is much harder than the task of verifying those kinds of instances?
It seems that the answer to this is also "yes": Consider P to be "is this the result of applying this cryptographic hash function to (e.g.) a prime number?". It is fairly easy to generate large prime numbers, and then apply the hash function to it. It is quite difficult to determine whether something is the hash of a prime number (... maybe assume that the hash function produces more bits for longer inputs in order to avoid having pigeonhole principle stuff that might result in it being highly likely that all of the possible outputs are the hash of some prime number. Or just put a bound on how big the prime numbers can be in order for P to be true of the hash.)
(Also, the task of "does this machine halt", given a particular verifying process that only gets the specification of the machine and not e.g. a log of it running, should probably(?) be reasonably easy to produce machines that halt but which that particular verifying process will not confirm that quickly.
So, for any easy-way-to-verify there is an easy-way-to-generate which produces ones that the easy-way-to-verify cannot verify, so that seems to be another reason why "yes", though, there may be some subtleties here?)
As you know, there's a straightforward way to, given any boolean circuit, to turn it into a version which is a tree, by just taking all the parts which have two wires coming out from a gate, and making duplicates of everything that leads into that gate.
I imagine that it would also be feasible to compute the size of this expanded-out version without having to actually expand out the whole thing?
Searching through normal boolean circuits, but using a cost which is based on the size if it were split into trees, sounds to me like it would give you the memoization-and-such speedup, and be kind of equivalent theoretically to using the actual trees?
A modification of it comes to mind. What if you allow some gates which have multiple outputs, but don't allow gates which duplicate their input? Or, specifically, what if you allow something like toffoli gates as well as the ability to just discard an output? The two parts of the output would be sorta independent due to the gate being invertible (... though possibly there could be problems with that due to original input being used more than once?)
I don't know whether this still has the can-do-greedy-search properties you want, but it seems like it might in a sense prevent using some computational result multiple times (unless having multiple copies of initial inputs available somehow breaks that).
It seems like the 5th sentence has its ending cut off? "it tries to parcel credit and blame for a decision up to the input neurons, even when credit and blame" , seems like it should continue [do/are x] for some x.
When you say "which yields a solution of the form ", are you saying that yields that, or are you saying that yields that? Because, for the former, that seems wrong? Specifically, the former should yield only things of the form .
But, if the latter, then, I would think that the solutions would be more solutions than that?
Like, what about ? (where, say, and
. so
which, for , and , is positive, and so g should also be a solution to , yes?
Or, were you just giving a subset of the solutions, of the differential inequality, which were all the ones you needed to make the point?
Separately (btw, what's the convention if one has unrelated comments? to put them in one comment or separate comments?) :
As for how the network could detect something about its current weights, if it is a recurrent network, this seems like something that some form of this should be learnable.
Suppose you set up the architecture for a recurrent neural network, and pick some arbitrary linear combination (but where the coefficients aren't especially big) of the weights in the network, and then for the loss function, take the main part of the network's output as doing some normal task, and then another part of the output be judged for how well it approximates the current value of the linear combination of all the network's weights, with the initially chosen coefficients. It seems plausible to me that this would do well. Of course, one way it might fail to work properly is if this linear combination ends up becoming roughly constant, or just if some weights stop changing during the training, leading to this second output not getting the signal needed to learn that part. Maybe if you used dropout and took that which weights were dropped out when evaluating the weighted sum of the weights (counting them as zero) in the loss, it could fix that issue. (dropout being done the same in each recurrence, not separately.)
Why "a linear combination of the weights" instead of the particular weights? Well, I imagine getting all the individual weights would be not particularly feasible unless the architecture was specifically designed to do that (like in the architectures for those quine-ing neural nets), but, if we are looking at a mix between "how the net should be", and "a net that is doing gradient hacking", then it seems like "some particular weighted sum of the weights", might be all that would be needed. Or perhaps two weighted sums with different coefficients if having the and separate.
On the other hand, if the network isn't recurrent, or if all the recurrence is done in a vector-quantization-y way, I would imagine it would be more difficult for the network to estimate a given linear combination of its current weights. (Though, if the linear combination of weights is only of weights before a certain layer, then layers after that layer might be able to approximate it without any recurrence, but, the layers that do and use this approximation could be modified without this being detected, and therefore, I'd think it couldn't protect these later parts from being changed by gradient descent, or at least, it would be harder.)
(I'm imagining, if people are trying to make this sort of thing work, that these networks are using just RELU, not like a smoothed version of RELU, so that the values of the weights are reflected more piecewise-linearly in the activations of one layer, given activations in the previous layer, which I think would facilitate estimating other linear combinations of the weights between the previous layers, and as such also for estimating this for a linear combination of weights in any layer.)
As another "why not just" which I'm sure there's a reason for:
in the original circuits thread, they made a number of parameterized families of synthetic images which certain nodes in the network responded strongly to in a way that varied smoothly with the orientation parameter, and where these nodes detected e.g. boundaries between high-frequency and low-frequency regions at different orientations.
If given another such network of generally the same kind of architecture, if you gave that network the same images, if it also had analogous nodes, I'd expect those nodes to have much more similar responses to those images than any other nodes in the network. I would expect that cosine similarity of the "how strongly does this node respond to this image" would be able to pick out the node(s) in question fairly well? Perhaps I'm wrong about that.
And, of course, this idea seems only directly applicable to feed-forward convolution networks that take an image as the input, and so, not so applicable when trying to like, understand how an agent works, probably.
(well, maybe it would work in things that aren't just a convolutions-and-pooling-and-dilation-etc , but seems like it would be hard to make the analogous synthetic inputs which exemplify the sort of thing that the node responds to, for inputs other than images. Especially if the inputs are from a particularly discrete space, like sentences or something. )
But, this makes me a bit unclear about why the "NP-HARD" lights start blinking.
Of course, "find isomorphic structure", sure.
But, if we have a set of situations which exemplify when a given node does and does not fire (rather, when it activates more and when it activates less) in one network, searching another network for a node that does/doesn't activate in those same situations, hardly seems NP-hard. Just check all the nodes for whether they do or don't light up. And then, if you also have similar characterizations for what causes activation in the nodes that came before the given node in the first network, apply the same process with those on the nodes in the second network that come before the nodes that matched the closest.
I suppose if you want to give a overall score for each combination of "this sub-network of nodes in the new network corresponds to this other network of nodes in the old-and-understood network", and find the sub-network that gives the best score, then, sure, there could be exponentially many sub-networks to consider. But, if each well-understood node in the old network generally has basically only one plausibly corresponding node in the new network, then this seems like it might not really be an issue in practice?
But, I don't have any real experience with this kind of thing, and I could be totally off.
I was surprised by how the fine-tuning was done for the verbalized confidence.
My initial expectation was that it would make the loss be based on like, some scoring rule based on the probability expressed and the right answer.
Though, come to think of it, I guess seeing as it would be assigning logits values to different expressions of probabilities, it would have to... what, take the weighted average of the scores it would get if it gave the different probabilities? And, I suppose that if many training steps were done on the same question/answer pairs, then the confidences might just get pushed towards 0% or 100%?
Ah, but for the indirect logit it was trained using the "is it right or wrong" with the cross-entropy loss thing. Ok, cool.
For such that is a mesa=optimizer let be the space it optimizes over, and be its utility function .
I know you said "which we need not notate", but I am going to say that for and , that , and is the space of actions (or possibly, and is the space of actions available in the situation )
(Though maybe you just meant that we need note notate separately from s, the map from X to A which s defines. In which case, I agree, and as such I'm writing instead of saying that something belongs to the function space . )
For to have its optimization over have any relevance, there has to be some connection between the chosen (chosen by m) , and .
So, the process by which m produces m(x) when given x, should involve the selected .
Moreover, the selection of the ought to depend on x in some way, as otherwise the choice of is constant each time, and can be regarded as just a constant value in how m functions.
So, it seems that what I said was should instead be either , or (in the latter case I suppose one might say )
Call the process that produces the action using the choice of by the name
(or more generally, ) .
is allowed to also use randomness in addition to and . I'm not assuming that it is a deterministic function. Though come to think of it, I'm not sure why it would need to be non-deterministic? Oh well, regardless.
Presumably whatever is being used to select , depends primarily (though not necessarily exclusively) on what s(x) is for various values of x, or at least on something which indicates things about that, as f is supposed to be for selecting systems which take good actions?
Supposing that for the mesa-optimizer that the inner optimization procedure (which I don't have a symbol for) and the inner optimization goal (i.e. ) are separate enough, one could ask "what if we had m, except with replaced with , and looked at how the outputs of and differ, where and are respectively are selected (by m's optimizer) by optimizing for the goals , and respectively?".
Supposing that we can isolate the part of how f(s) depends on s which is based on what is or tends to be for different values of , then there would be a "how would differ if m used instead of ?".
If in place of would result in things which, according to how works, would be better, then it seems like it would make sense to say that isn't fully aligned with ?
Of course, what I just described makes a number of assumptions which are questionable:
- It assumes that there is a well-defined optimization procedure that m uses which is cleanly separable from the goal which it optimizes for
- It assumes that how f depends on s can be cleanly separated into a part which depends on (the map in which is induced by ) and (the rest of the dependency on )
The first of these is also connected to another potential flaw with what I said, which is, it seems to describe the alignment of the combination of (the optimizer m uses) along with , with , rather than just the alignment of with .
So, alternatively, one might say something about like, disregarding how the searching behaves and how it selects things that score well at the goal , and just compare how and tend to compare when and are generic things which score well under and respectively, rather than using the specific procedure that uses to find something which scores well under , and this should also, I think, address the issue of possibly not having a cleanly separable "how it optimizes for it" method that works for generic "what it optimizes for".
The second issue, I suspect to not really be a big problem? If we are designing the outer-optimizer, then presumably we understand how it is evaluating things, and understand how that uses the choices of for different .
I may have substantially misunderstood your point?
Or, was your point that the original thing didn't lay these things out plainly, and that it should have?
Ok, reading more carefully, I see you wrote
I can certainly imagine that it may be possible to add in details on a case-by-case basis or at least to restrict to a specific explicit class of base objectives and then explicitly define how to compare mesa-objectives to them.
and the other things right before and after that part, and so I guess something like "it wasn't stated precisely enough for the cases it is meant to apply to / was presented as applying as a concept more generally than made sense as it was defined" was the point and which I had sorta missed it initially.
(I have no expertise in these matters; unless shown otherwise, assume that in this comment I don't know what I'm talking about.)
Is this something that the infra-bayesianism idea could address? So, would an infra-bayesian version of AIXI be able to handle worlds that include halting oracles, even though they aren't exactly in its hypothesis class?
Do I understand correctly that in general the elements of A, B, C, are achievable probability distributions over the set of n possible outcomes? (But that in the examples given with the deterministic environments, these are all standard basis vectors / one-hot vectors / deterministic distributions ?)
And, in the case where these outcomes are deterministic, and A and B are disjoint, and A is much larger than B, then given a utility function on the possible outcomes in A or B, a random permutation of this utility function will, with high probability, have the optimal (or a weakly optimal) outcome be in A?
(Specifically, if I haven't messed up, if asymptotically (as |B| goes to infinity) then the probability of there being something in A which is weakly better than anything in B goes to 1 , and if then the probability goes to at least , I think?
Coming from )
While I'd readily believe it, I don't really understand why this extends to the case where the elements of A and B aren't deterministic outcomes but distributions over outcomes. Maybe I need to review some of the prior posts.
Like, what if every element of A was a probability distribution with over 3 different observation-histories (each with probability 1/3) , and every element of B was a probability distribution over 2 different observation-histories (each with probability 1/2)? (e.g. if one changes pixel 1 at time 1, then in addition to the state of the pixel grid, one observes at random either a orange light or a purple light, while if one instead changes pixel 2 at time 1, in addition to the pixel grid state, one observes at random either a red, green, or blue light, in addition to the pixel grid) Then no permutation of the set of observations-histories would convert any element of A into an element of B, nor visa versa.
My understanding:
One could create a program which hard-codes the point about which it oscillates (as well as some amount which it always eventually goes that far in either direction), and have it buy once when below, and then wait until the price is above to sell, and then wait until price is below to buy, etc.
The programs receive as input the prices which the market maker is offering.
It doesn't need to predict ahead of time how long until the next peak or trough, it only needs to correctly assume that it does oscillate sufficiently, and respond when it does.
The part about Chimera functions was surprising, and I look forward to seeing where that will go, and to more of this in general.
In section 2.1 , Proposition 2 should presumably say that is a partial order on rather than on .
In the section about Non-Dogmatism , I believe something was switched around. It says that if the logical inductor assigns prices converging to $1 to a proposition that cannot be proven, that the trader can buy shares in that proposition at prices of $ and thereby gain infinite potential upside. I believe this should say that if the logical inductor assigns prices converging to $0 to a proposition that can't be dis-proven, instead of prices converging to $1 for a proposition that can't be proven .
(I think that if the price was converging to $1 for a proposition that cannot be proven, the trader would sell shares at prices $ , for potential gain of $1 each time, and potential losses of , so, to have this be $ , this should be .)
There's also a little formatting error with the LaTeX in section 4.1
Nice summary/guide! It made the idea behind the construction of the algorithm much more clear to me.
(I had a decent understanding of the criterion, but I hadn't really understood big picture of the algorithm. I think I had previously been tripped up by the details around the continuity and such, and not following these led to me not getting the big picture of it.)
You said that you thought that this could be done in a categorical way. I attempted something which appears to describe the same thing when applied to the category FinSet , but I'm not sure it's the sort of thing you meant by when you suggested that the combinatorial part could potentially be done in a categorical way instead, and I'm not sure that it is fully categorical.
Let S be an object.
For i from 1 to k, let be an object, (which is not anything isomorphic to the product of itself with itself, or at least is not the terminal object) .
Let be an isomorphism.
Then, say that is a representation of a factorization of S.
If and are each a representative of a factorization of S, then say that they represent the same factorization of S iff there exist isomorphisms such that , where is the isomorphism obtained from the with the usual product map, the composition of it with f' is equal to f, that is, .
Then say that a factorization is, the class of representative of the same factorization. (being a representation of the same factorization is an equivalence relation).
For FinSet , the factorizations defined this way correspond to the factorizations as originally defined.
However, I've no idea whether this definition remains interesting if applied to other categories.
For example, if it were to be applied to the closed disk in a category of topological spaces and continuous functions, it seems that most of the isomorphisms from [0,1] * [0,1] to the disk would be distinct factorizations, even though there would still be many which are identified, and I don't really see talking about the different factorizations of the closed disk as saying much of note. I guess the factorizations using [0,1] and [0,1] correspond to different cosets of the group of automorphisms of the closed disk by a particular subgroup, but I'm pretty sure it isn't a normal subgroup, so no luck there.
If instead we try the category of vector spaces and linear maps over a particular field, then I guess it looks more potentially interesting. I guess things over sets having good analogies over vector spaces is a common occurrence. But here still, the subgroups of the automorphism groups given largely by the products of the automorphism groups of the things in the product, seems like they still usually fail to be a normal subgroup, I think. But regardless, it still looks like there's some ok properties to them, something kinda Grassmannian-ish ? idk. Better properties than in the topological spaces case anyway.
I've now computed the volumes within the [-a,a]^3 cube for and, or, and the constant 1 function. I was surprised by the results.
(I hadn't considered that the ratios between the volumes will not depend on the size of the cube)
If we select x,y,z uniformly at random within this cube, the probability of getting the and gate is 1/48, the probability of getting the or gate is 2/48, and the probability of getting the constant 1 function is 13/48 (more than 1/4).
This I found quite surprising, because of the constant 1 function requiring 4 half planes to express the conditions for it.
So, now I'm guessing that the ones that required fewer half spaces to specify, are the ones where the individual constraints are already implying other constraints, and so actually will tend to have a smaller volume.
On the other hand, I still haven't computed any of them for if projecting onto the sphere, and so this measure kind of gives extra weight to the things in the directions near the corners of the cube, compared to the measure that would be if using the sphere.
For the volumes, I suppose that because scaling all of these parameters by the same positive constant doesn't change the function computed, it would make sense to compute the volumes of the corresponding regions of the cube, and this would handle the issues with these regions having unbounded size.
(this would still work with more parameters, it would just be a higher dimensional sphere)
Er, would that give the same thing as the limit if we took the parameters within a cube?
Anyway, at least in this case, if we use the "projected onto the sphere" case, we could evaluate the areas by splitting the regions (which would be polygons of some kind, with edges being arcs of great circles) into triangles, and then using the formulas for the areas of triangles on a sphere. Actually, they might already be triangles, I'm not sure.
Would this work in higher dimensions? I don't know of formulas for computing the measure of a n-simplex (with flat facets or whatever the right terminology is) within an n-sphere, but I suspect that they shouldn't be too bad?
I'm not sure which is the more sensible thing to measure, the volumes of the intersection of the half spaces (intersected with a large cube centered at the origin and aligned with the coordinate axes), or the volume (one dimension lower) of that intersected-with/projected-onto the unit sphere.
Well, I guess if we assume that the coefficients are identically and independently distributed with a Gaussian distribution, then that would be a fairly natural choice, and should result in things being symmetric about rotations in the origin, which would seem to point to the choice of projecting it all to the (hyper-)sphere.
Well, I suppose in either case (whether on the sphere or in a cube), even before trying to apply some formulas about the area of a triangle on a sphere, there's always the "just take the integral" option.
(in the cube option, this would I think be more straightforwards. Just have to do a triple integral (more in higher dimensions) of 1 with linear inequalities for the bounds. No real issues should show up.)
I'll attempt it with the conditions for "and" for the "on the sphere" case, to check the feasibility.
If we have x+y+z>0, x+z<0, y+z<0, then we necessarily also have z<0 , x>0, y>0 , in particular x<-z , y<-z . If we have x,y,z on the unit sphere, then we have x^2+y^2+z^2=1 . So, for each value of z (which must be strictly between -1 and 0) we have x^2 + y^2 = 1 - z^2 , and because we have x>0 and y>0 , for a given z, for each value of x there is exactly one value of y, and visa versa.
So, y = sqrt(1 - z^2 - x^2) , and so we have x + sqrt(1 - z^2 - x^2) > -z , ...
this is somewhat more difficult to calculate than I had hoped.
Still confident that it can be done, but I shouldn't finish this calculation right now due to responsibilities.
It looks like, at least in this case with 3 parameters, that it would probably be easier to use the formulas for the area of triangles on a sphere, but I wouldn't be surprised if, when generalizing to higher dimensions, doing it that way becomes harder.
It looks like Chris Mingard's reply has nice results which say much of what I think one would want from this direction? Well, it is less "enumerate them specifically", and more "for functions which have a given proportion of outputs being 1", but, still. (also I haven't read it, just looked briefly at it)
I don't know what particular description language you would want to use for this. I feel like this is such a small case that small differences in choice of description language might overwhelm any difference in complexity that these would have within the given description language?
nitpick : the appendix says possible configurations of the whole grid, while it should say possible configurations. (Similarly for what it says about the number of possible configurations in the region that can be specified.)
This comment I'm writing is mostly because this prompted me to attempt to see how feasible it would be to computationally enumerate the conditions for the weights of small networks like the 2 input 2 hidden layer 1 output in order to implement each of the possible functions. So, I looked at the second smallest case by hand, and enumerated conditions on the weights for a 2 input 1 output no hidden layer perceptron to implement each of the 2 input gates, and wanted to talk about it. This did not result in any insights, so if that doesn't sound interesting, maybe skip reading the rest of this comment. I am willing to delete this comment if anyone would prefer I do that.
Of the 16 2-input-1-output gates, 2 of them, xor and xnor, can't be done with the perceptrons with no hidden layer (as is well known), for 8 of them, the conditions on the 2 weights and the bias for the function to be implemented can be expressed as an intersection of 3 half spaces, and the remaining 6 can of course be expressed with an intersection of 4 (the maximum number that could be required, as for each specific input and output, the condition on the weights and bias in order to have that input give that output is specified by a half space, so specifying the half space for each input is always enough).
The ones that require 4 are: the constant 0 function, the constant 1 function, return the first input, return the second input, return the negation of the first input, and return the negation of the second input.
These seem, surprisingly, among the simplest possible behaviors. They are the ones which disregard at least one input. It seems a little surprising to me that these would be the ones that require an intersection of 4 half spaces.
I haven't computed the proportions of the space taken up by each region so maybe the ones that require 4 planes aren't particularly smaller. And I suppose with this few inputs, it may be hard to say that any of these functions are really substantially more simple than any of the rest of them. Or it may be that the tendency for simpler functions to occupy more space only shows up when we actually have hidden layers and/or have many more nodes.
Here is a table (x and y are the weights from a and b to the output, and z is the bias on the output):
outputs for the different inputs when this function is computed
0000 (i.e. the constant 0) z<0, x+y+z<0, x+z<0, y+z<0
0001 (i.e. the and gate) x+y+z>0, x+z<0, y+z<0
0010 (i.e. a and not b) z<0, x+y+z<0, x+z>0
0011 (i.e. if input a) z<0, x+y+z>0, x+z>0, y+z<0
0100 (i.e. b and not a) z<0, x+y+z<0, y+z>0
0101 (i.e. if input b) z<0, x+y+z>0, x+z<0, y+z>0
0110 (i.e. xor) impossible
0111 (i.e. or) z<0, x+z>0, y+z>0
1000 (i.e. nor) z>0, x+z<0, y+z<0
1001 (i.e. xnor) impossible
1010 (i.e. not b) z>0, x+y+z<0, x+z>0, y+z<0
1011 (i.e. b->a ) z>0, x+y+z>0, x+z<0
1100 (i.e. not a) z>0, x+y+z<0, x+z<0, y+z>0
1101 (i.e. a->b ) z>0, x+y+z>0, y+z<0
1110 (i.e. nand ) x+y+z<0, x+z>0, y+z>0
1111 (i.e. constant 0) z>0, x+z>0, y+z>0, x+y+z>0
The link in the rss feed entry for this at https://agentfoundations.org/rss goes to https://www.alignmentforum.org/events/vvPYYTscRXFBvdkXe/ai-safety-beginners-meetup which is a broken link (though, easily fixed by replacing "events" with "posts" in the url) .
[edit: it appears that it is no longer in the rss feed? It showed up in my rss feed reader.]
I think this has also happened with other "event" type posts in the rss feed before, but I may be remembering wrong.
I suspect this is some bug in how the rss feed is generated, but possibly it is a known bug which just hasn't been deemed important enough to fix yet.
I assume that when the event is updated that the additional information will include how to join the meetup?
I am interested in attending.
The agent/thinker are limited in the time or computational resources available to them, while the predictor is unlimited.
My understanding is that this is generally situation which is meant. Well, not necessarily unlimited, just with enough resources to predict the behavior of the agent.
I don't see why you call this situation uninteresting.
That something can be modeled using some Turing machine, doesn't imply that it can be any Turing machine.
If I have some simple physical system, such that I can predict how it will behave, well, it can be modeled by a Turing machine, but me being able to predict it doesn't imply that I've solved the halting problem.
A realistic conception of agents in an environment doesn't involve all agents having unlimited compute at every time-step. An agent cannot prevent the universe from continuing simply by getting stuck in a loop and never producing its output for its next action.
Ah, thank you, I see where I misunderstood now. And upon re-reading, I see that it was because I was much too careless in reading the post, to the point that I should apologize. Sorry.
I was thinking that the agents were no longer being trained, already being optimal players, and so I didn't think the judge would need to take into account how their choice would influence future answers. This reading clearly doesn't match what you wrote, at least past the very first part.
If the debaters are still being trained, or the judge can be convinced that the debaters are still being trained, then I can definitely see the case for a debater arguing "This information is more useful, and because we are still being trained, it is to your benefit to choose the more useful information, so that we will provide the more useful information in the future".
I guess that suggests that the environments in which the judge confidently believes (and can't be convinced otherwise) that the debaters are/aren't still being trained, are substantially different, and so if training produces the optimal policy in which it is trained, then after training was done, it would likely still do the "ignoring the question" thing, even if that is no longer optimal when not being trained (when the judge knows that the debaters aren't being trained).
I am unsure as to what the judge's incentive is to select the result that was more useful, given that they still have access to both answers? Is it just because the judge will want to be such that the debaters would expect them to select the useful answer so that the debaters will provide useful answers, and therefore will choose the useful answers?
If that's the reason, I don't think you would need a committed deontologist to get them to choose a correct answer over a useful answer, you could instead just pick someone who doesn't think very hard about certain things / that doesn't see their choice of actions as being a choice of what kind of agent to be / someone who doesn't realize why one-boxing makes sense.
(Actually, this seems to me kind of similar to a variant of transparent Newcomb's problem, with the difference being that the million dollar box isn't even present if it is expected that they would two-box if it were present, and the thousand dollar box has only a trivial reward in it instead of a thousand dollars. One-boxing in this would be choosing the very-useful-but-not-an-answer answer, while two-boxing would be picking the answer that seems correct, and also using whatever useful info is in both answers.)
I suspect I'm just misunderstanding something.
This reminds me of the "Converse Lawvere Problem" at https://www.alignmentforum.org/posts/5bd75cc58225bf06703753b9/the-ubiquitous-converse-lawvere-problem a little bit, except that the different functions in the codomain have domain which also has other parts to it aside from the main space .
As in, it looks like here, we have a space of values , which includes things such as "likes to eat meat" or "values industriousness" or whatever, where this part can just be handled as some generic nice space , as one part of a product, and as the other part of the product has functions from to .
That is, it seems like this would be like, .
Which isn't quite the same thing as is described in the converse Lawvere problem posts, but it seems similar to me? (for one thing, the converse Lawvere problem wasn't looking for homeomorphisms from X to the space of functions from X to functions to [0,1] , just a surjective continuous function).
Of course, it is only like that if we are supposing that the space we are considering, , has to have all combinations of "other parts of values" with "opinions on the relative merit of different possible values". Of course if we just want some space of possible values, and where each value has an opinion of each value, then that's just a continuous function from a product of the space with itself, which isn't any problem.
I guess this is maybe more what you meant? Or at least, something that you determined was sufficient to begin with when looking at the topic? (and I guess most more complicated versions would be a special case of it?)
Oh, if you require that the "opinion on another values" decomposes nicely in ways that make sense (like, if it depends separately on the desirability of the base level values, and the values about values, and the values about values about values, etc., and just has a score for each which is then combined in some way, rather than evaluating specifically the combinations of those) , then maybe that would make the space nicer than the first thing I described (which I don't know whether such a thing exists) in a way that might make it more likely to exist.
Actually, yeah, I'm confident that it would exist that way.
Let
And let
And then let ,
and for define
which seems like it would be well defined to me. Though whether it can captures all that you want to capture about how values can be, is another question, and quite possibly it can't.
Thanks! (The way you phrased the conclusion is also much clearer/cleaner than how I phrased it)
I am trying to check that I am understanding this correctly by applying it, though probably not in a very meaningful way:
Am I right in reasoning that, for , that iff ( (C can ensure S), and (every element of S is a result of a combination of a possible configuration of the environment of C with a possible configuration of the agent for C, such that the agent configuration is one that ensures S regardless of the environment configuration)) ?
So, if S = {a,b,c,d} , then
would have , but, say
would have , because , while S can be ensured, there isn't, for every outcome in S, an option which ensures S and which is compatible with that outcome ?
There are a few places where I believe you mean to write a but instead have instead. For example, in the line above the "Applicability" heading.
I like this.
As an example, I think in the game "both players win if they choose the same option, and lose if they pick different options" has "the two players pick different options, and lose" as one of the feasible outcomes, and it is not on the Pareto frontier, because if they picked the same thing, they would both win, and that would be a Pareto improvement.
What came to mind for me before reading the spoiler-ed options, was a variation on #2, with the difference being that, instead of trying to extract P's hypothesis about B, we instead modify T to get a T' which has P replaced with a P' which is a paperclip minimizer instead of maximizer, and then run both, and only use the output when the two agree, or if they give probabilities, use the average, or whatever.
Perhaps this could have an advantage over #2 if it is easier to negate what P is optimizing for than to extract P's model of B. (edit: though, of course, if extracting the model from P is feasible, that would be better than the scheme I described)
On the other hand, maybe this could still be dangerous, if P and P' have shared instrumental goals with regards to your predictions for B?
Though, if P has a good model of you, A, then presumably if you were to do this, both P and P' would expect you would do this, and, so I don't know what would make sense for them to do?
It seems like they would both expect that, while they may be able to influence you, that insofar as the influence would effect the expected value of number of paperclips, it would be canceled out by the other's influence (assuming that the ability to influence # paperclips via changing your prediction of B, is symmetric, which, I guess it might not be..).
I suppose this would be a reason why P would want its thought processes to be inscrutable to those simulating it, so that the simulators are unable to construct P' .
__
As a variation on #4, if P is running on a computer in a physics simulation in T, then almost certainly a direct emulation of that computer running P would run faster than T does, and therefore whatever model of B that P has, can be computed faster than T can be. What if, upon discovering this fact about T, we restrict the search among Turing machines to only include machines that run faster than T?
This would include emulations of P, and would therefore include emulations of P's model of B (which would probably be even faster than emulating P?), but I imagine that a description of an emulation of P without the physics simulation and such would have a longer description than a description of just P's model of B. But maybe it wouldn't.