Posts
Comments
(I don't think I can explain why here, though I am working on a longer explanation of what framings I like and why.)
Cheers, that would be very useful.
(I do think ontological shifts continue to be relevant to my description of the problem, but I've never been convinced that we should be particularly worried about ontological shifts, except inasmuch as they are one type of possible inner alignment / robustness failure.)
I feel that the whole AI alignment problem can be seen as problems with ontological shifts: https://www.lesswrong.com/posts/k54rgSg7GcjtXnMHX/model-splintering-moving-from-one-imperfect-model-to-another-1
Thanks _
I like that way of seeing it.
Express, express away _
Enjoyed writing it, too.
Because a reputation for following up brinksmanship threats means that people won't enter into deals with you at all; extortion works because, to some extent, people have to "deal" with you even if they don't want to.
This is why I saw a Walmart-monopsony (monopolistic buyer) as closer to extortion, since not trading with them is not an option.
Kiitos!
Thanks!
Thanks!
I'm think of it this way: investigating a supplier to check they are reasonable costs $1 to Walmart. The minimum price any supplier will offer is $10. After investigating, one supplier offers $10.5. Walmart refuses, knowing the supplier will not got lower, and publicises the exchange.
The reason this is extortion, at least in the sense of this post, is that Walmart takes a cost (it will cost them at least $11 to investigate and hire another supplier) in order to build a reputation.
The connection to AI alignment is combining the different utilities of different entities without extortion ruining the combination, and dealing with threats and acausal trade.
I think the distinction is, from the point of view of the extortioner, "would it be in my interests to try and extort , *even if I know for a fact that cannot be extorted and would force me to act on my threat, to the detriment of myself in that situation?"
If the answer is yes, then it's extortion (in the meaning of this post). Trying to extort the un-extortable, then acting on the threat, makes sense as a warning to other.
That's a misspelling that's entirely my fault, and has now been corrected.
(1) You say that releasing nude photos is in the blackmail category. But who's the audience?
The other people of whom you have nude photos, who are now incentivised to pay up rather than kick up a fuss.
(2) For n=1, m large: Is an example of brinkmanship here a monopolistic buyer who will only choose suppliers giving cutrate prices?
Interesting example that I hadn't really considered. I'd say its fits more under extortion than brinksmanship, though. A small supplier has to sell, or they won't stay in business. If there's a single buyer, "I won't buy from you" is the same as "I will ruin you". Abstracting away the property rights (Walmart is definitely legally allowed to do this), this seems very much an extorsion.
"within the limits of their intelligence" can mean anything, excuse any error, bias, and failure. Thus, they are not rational, and (form one perspective) very very far from it.
Some people (me included) value a certain level of non-manipulation. I'm trying to cash out that instinct. And it's also needed for some ideas like corrigibility. Manipulation also combines poorly with value learning, see eg our paper here https://arxiv.org/abs/2004.13654
I do agree that saving the world is a clearly positive case of that ^_^
I have an article on "Anthropic decision theory". with the video version here.
Basically, it's not that the presumptuous philosopher is more likely to be right in a given universe, its that there are far more presumptuous philosophers in the large universe. So if we count "how many presumptuous philosophers are correct", we get a different answer to "in how many universes is the presumptuous philosopher correct". These things only come apart in anthropic situations.
Suart, by " is complex" are you referring to...
I mean that that defining can be done in many different ways, and hence has a lot of contingent structure. In contrast, in , the $\rho is a complex distribution on , conditional on ; hence itself is trivial and just encodes "apply to and in the obvious way.
This is a link to "An Increasingly Manipulative Newsfeed" about potential social media manipulation incentives (eg FaceBook).
I'm putting the link here because I keep losing the original post (since it wasn't published by me, but I co-wrote it).
A boundedly-rational agent is assumed to be mostly rational, failing to be fully rational because of a failure to figure things out in enough detail.
Humans are occasionally rational, often biased, often inconsistent, sometimes consciously act against their best interests, often follow heuristics without thinking, sometimes do think things through. This doesn't seem to correspond to what is normally understood as "boundedly-rational".
that paper was about fitting observations of humans to a mathematical model of "boundedly-rational agent pursuing a utility function"
It was "any sort of agent pursuing a reward function".
We don't need a special module to get an everyday definition of doorknobs, and likewise I don't think we don't need a special module to get an everyday definition of human motivation.
I disagree. Doornobs exist in the world (even if the category is loosely defined, and has lots of edge cases), whereas goals/motivations are interpretations that we put upon agents. The main result of the Occam's razor paper is that there the goals of an agent are not something that you can know without putting your own interpretation on it - even if you know every physical fact about the universe. And two very different interpretations can be equally valid, with no way of distinguishing between them.
(I like the anthropomorphising/dehumanising symmetry, but I'm focusing on the aspects of dehumanising that cause you to make errors of interpretation. For example, out-groups are perceived as being coherent, acting in concert without disagreements, and often being explicitly evil. This is an error, not just a reduction in social emotions)
For instance throughout history people have been able to model and interact with traders from neighbouring or distant civilizations, even though they might think very differently.
Humans think very very similarly to each other, compared with random minds from the space of possible minds. For example, we recognise anger, aggression, fear, and so on, and share a lot of cultural universals https://en.wikipedia.org/wiki/Cultural_universal
There haven’t been as many big accomplishments.
I think we should look at the demand side, not the supply side. We are producing lots of technological innovations, but there aren't so many major problems left for them to solve. The flush toilet was revolutionary; a super-flush ecological toilet with integrated sensors that can transform into a table... is much more advanced from the supply side, but barely more from the demand side: it doesn't fulfil many more needs than the standard flush toilet.
Cool, good summary.
Humans have a theory of mind, that makes certain types of modularizations easier. That doesn't mean that the same modularization is simple for an agent that doesn't share that theory of mind.
Then again, it might be. This is worth digging into empirically. See my post on the optimistic and pessimistic scenarios; in the optimistic scenario, preferences, human theory of mind, and all the other elements, are easy to deduce (there's an informal equivalence result; if one of those is easy to deduce, all the others are).
So we need to figure out if we're in the optimistic or the pessimistic scenario.
My understanding of the OP was that there is a robot [...]
That understanding is correct.
Then my question was: what if none of the variables, functions, etc. corresponds to "preferences"? What if "preferences" is a way that we try to interpret the robot, but not a natural subsystem or abstraction or function or anything else that would be useful for the robot's programmer?
I agree that preferences is a way we try to interpret the robot (and how we humans try to interpret each other). The programmer themselves could label the variables; but its also possible that another labelling would be clearer or more useful for our purposes. It might be a "natural" abstraction, once we've put some effort into defining what preferences "naturally" are.
but "white box" is any source code that produces the same input-output behavior
What that section is saying is that there are multiple white boxes that produce the same black box behaviour (hence we cannot read the white box simply from the black box).
modularization is super helpful for simplifying things.
The best modularization for simplification will not likely correspond to the best modularization for distinguishing preferences from other parts of the agent's algorithm (that's the "Occam's razor" result).
but the function f is not part of the algorithm, it's only implemented by us onlookers. Right?
Then isn't that just a model at another level, a (labelled) model in the heads of the onlookers?
Thanks! Useful insights in your post, to mull over.
An imminent incoming post on this very issue ^_^
Yes, things like honour and anger serve important signalling and game-theoretic functions. But they also come to be valued intrinsically (the same way people like sex, rather than just wanting to spread their genes), and strongly valued. This makes it hard to agree that "oh, your sacred core value is only in the service of this hidden objective, so we can focus on that instead".
Cool, neat summary.
Sorry, had a few terrible few days, and missed your message. How about Friday, 12pm UK time?
Stuart, I'm writing a review of all the work done on corrigibility. Would you mind if I asked you some questions on your contributions?
No prob. Email or Zoom/Hangouts/Skype?
Very good. A lot of potential there, I feel.
The information to distinguish between these interpretations is not within the request to travel west.
Yes, but I'd argue that most of moral preferences are similarly underdefined when the various interpretations behind them come apart (eg purity).
There are computer programs that can print their own code: https://en.wikipedia.org/wiki/Quine_(computing)
There are also programs which can print their own code and add something to it. Isn't that a way in which the program fully knows itself?
Thanks! It's cool to see his approach.
Wiles proved the presence of a very rigid structure - not the absence - and the presence of this structure implied FLT via the work of other mathematicians.
If you say that "Wiles proved the Taniyama–Shimura conjecture" (for semistable elliptic curves), then I agree: he's proved a very important structural result in mathematics.
If you say he proved Fermat's last theorem, then I'd say he's proved an important-but-probable lack of structure in mathematics.
So yeah, he proved the existence of structure in one area, and (hence) the absence of structure in another area.
And "to prove Fermat's last theorem, you have to go via proving the Taniyama–Shimura conjecture", is, to my mind, strong evidence for "proving lack of structure is hard".
You can see this as sampling times sorta-independently, or as sampling times with less independence (ie most sums are sampled twice).
Either view works, and as you said, it doesn't change the outcome.
Yes, I got that result too. The problem is that the prime number theorem isn't a very good approximation for small numbers. So we'd need a slightly more sophisticated model that has more low numbers.
I suspect that moving from "sampling with replacement" to "sampling without replacement" might be enough for low numbers, though.
Note that the probabilistic argument fails for n=3 for Fermat's last theorem; call this (3,2) (power=3, number of summands is 2).
So we know (3,2) is impossible; Euler's conjecture is the equivalent of saying that (n+1,n) is also impossible for all n. However, the probabilistic argument fails for (n+1,n) the same way as it fails for (3,2). So we'd expect Euler's conjecture to fail, on probabilistic grounds.
In fact, the surprising thing on probabilistic grounds is that Fermat's last theorem is true for n=3.
Good, cheers!
Another key reason for time-inconsistent preferences: bounded rationality.
Why do the absolute values cancel?
Because , so you can remove the absolute values.
Cheers, interesting read.
I also think the pedestrian example illustrates why we need more semantic structure: "pedestrian alive" -> "pedestrian dead" is bad, but "pigeon on road" -> "pigeon in flight" is fine.
Nope! Part of my own research has made more optimistic about the possibilities of understanding and creating intelligence.