Posts
Comments
The r/achipelago subreddit is quite small but exists for hobbyists to share designs for alternative political systems and to consider the effects the alternatives would have. Most of what's there right now is about electoral systems rather than full institutional structures. Some posts include links to resources, such as one of my favorites, The Electoral System Design Handbook, which describes case studies of several countries and the typical good and bad effects of different design decisions.
"Ideal governance" depends on what ideals you're aiming for, of course. There have been proposed improvements to futarchy such as this one, which picks utilitarianism as its explicit ideal. An explicitly virtue theorist option could be to modernize Plato's Republic instead. Granted, these are extreme examples. For more sober-minded investigation into ideal governance, you'd of course want to start with criteria that are well-defined and pragmatic, rather than broad philosophical or ideological traditions.
Regarding the title problem,
I have historically been too hasty to go from “other people seem very wrong on this topic” to “I am right on this topic”
I think it's helpful here to switch from binary wrong/right language to continuous language. We can talk of degrees of wrongness and rightness.
Consider people who are smarter than those they usually argue with, in the specific sense of "smarter" where we mean they produce more-correct, better-informed, or more-logical arguments and objections. These people probably have some (binarily) wrong ideas. The people they usually argue with, however, are likely to be (by degrees) wronger.
When the other people are wronger, the smart person is in fact righter. So I think, as far as you were thinking in terms of degrees of wrongness and rightness, it would be perfectly fair for you to have had the sense you did. It wouldn't have been a hasty generalization. And if you stopped to consider whether there might exist views that are even righter still, you'd probably conclude there are.
Yes. Page 287 of the paper affirms your interpretation: "REMORSE does not exploit suckers, i.e. AllC players, whereas PAVLOV does."
The OP has a mistake:
Remorse is more aggressive; unlike cTFT, it can attack cooperators
Neither Remorse nor cTFT will attack cooperators.
If Pavlov accidentally defects against TFTWF, the result is
D/C -> D/C -> D/D -> C/D -> D/D -> C/C,
Can you explain this sequence? I'm puzzled by it as it doesn't follow the definitions that I know about. My understanding of TFTWF is that it is "Tit for Tat with a small randomised possibility of forgiving a defaulter by cooperating anyway." What seems to be happening in the above sequence is Pavlov on the left and, on the right, TFT with a delay of 1.
in Critch's framework, agents bet their voting stake rather than money! The more bets you win, the more control you have over the system; the more bets you lose, the less your preferences will be taken into account.
If I may be a one-note piano (as it's all I've talked about lately on LW), this sounds extremely similar to the "ophelimist democracy" I was pushing. I've since streamlined the design and will try to publish a functional tool for it online next year, and then aim to get some small organizations to test it out.
In brief, the goal was to design a voting system with a feedback loop to keep it utilitarian in the sense you've discussed above, but also utilitarian in the Bentham/Sidgwick sense. So that you don't have to read the linked blog post, the basic steps in voting in an organization run on "ophelimist democracy" are as follows, with the parts that sound like Critch's framework in italics:
1. Propose goals.
2. Vote on the goals/values and use the results to determine the relative value of each goal.
3. Propose policy ideas.
4. Bet on how well each policy will satisfy each goal.
5. Every bet is also automatically turned into a vote. (This is used to evade collusive betting.)
6. The policy with highest vote total * weighted bet value is enacted.
7. People are periodically polled regarding how satisfied they are with the goals, and the results are used to properly assign weights to people's bets.
score voting is immune to the Gibbard-Satterthwaite theorem
I was basing this off the description in Wikipedia; please correct that entry if you think I was in error. As of this time it still explicitly states, "While the scope of this theorem is limited to ordinal voting, Gibbard's theorem is more general, in that it deals with processes of collective decision that may not be ordinal: for example, voting systems where voters assign grades to candidates."
any proportional method is subject to free riding strategy. And since this system is designed to be proportional across time as well as seats, free riding strategy would be absolutely pervasive, and I suspect it would take the form of deliberately voting for the craziest possible option.
What does that mean? If being the "craziest possible option" means it gets selected as the most preferred option regardless and has sharply bad outcomes that you secretly knew would happen, then having voted for it, you're strictly worse off in future voting power than if you had voted against it. Alternatively, if it means that very few other voters vote for that option, then that option definitionally isn't going to win, and so there is strictly nothing to gain in future voting power from having voted for it. So on the contrary, honest voting, as a strategy, dominates either interpretation of your suggested variant of a free rider strategy.
it's a nonstarter politically
This is rich irony coming from Jameson Quinn himself. :D In any case, your comment here is certainly appreciated due to your expertise, even though I currently believe the comment was factually in error on both substantive points.
Regarding exploitability by well-funded and well-connected entities - I'm not sure how to tell without an empirical test. My understanding is that research into funding of electoral campaigns doesn't show the funding as having any effect on vote totals. If that is accurate, then I'd expect it's still true under alternate voting methods.
Fully agreed - the intention is to start with small-scale clubs and parts of private organizations that are open to experimenting.
increased voting power given to those whose bills are not passed risks giving undue power to stupid or inhumane voters.
True. Equalizing the influence of all parties (over the long term at least) doesn't just risk giving such people power; it outright does give them power. At the time of the design, I justified it on the grounds that (1) it forces either compromise or power-sharing, (2) I haven't found a good way to technocratically distinguish humane-but-dumb voters from inhumane-but-smart ones, or rightly-reviled inhumane minorities from wrongly-reviled humane minorities, and (3) the worry that if a group's interests are excluded, then they have no stake in the system, and so they have reason to fight against the system in a costly way. Do any alternatives come to your mind?
Dishonest forecasting, especially predicting poor results to try to kill a bill, remains tempting, especially for voters with one or two pet issues.
Indeed. I spent a great deal of time and effort investigating this for possible solutions. Haven't found any yet, though. It's the only attack vector that I know for sure would work.
While giving direct power to the people might help avoid much of the associated corruption and wasteful signalling, it risks giving increased weight to people without the requisite knowledge and intelligence to make good policy.
I may have been unduly influenced by my anarchist youth: I'm more worried about the negative effects of concentrating power than about the negative effects of distributing it. Is there any objective way to compare those effects, however, that isn't quite similar to how Ophelimo tries to maximize public satisfaction with their own goals?
Thanks for your thoughts. Your questions are quite valid but I'm inclined to punt on them, as you'll see:
For #3, it depends on the group. If a government were to use it, they could provide access via terminals in public libraries, schools, and other government facilities. If a private group were to use it, they'd probably just exclude the poor.
For #4, 6, 7, 8: It's intended for use in any democratic organization for the equivalent of ordinary legislation and bylaws, but not intended to replace their constitutions or founding documents. If there are some laws/bylaws that the group doesn't have authority to make or change (like on citizenship/membership), they would need a separate method of striking those down.
For #5, if the data is lost, they start afresh. They'd lose any prediction scores they'd gained, but if voters can repeat their good predictions, the problem is mitigated, and if they can't repeat their good predictions, they don't deserve their old scores.
I justify "punting" because the app is intended to be customized by many clubs and organizations. It doesn't feel like that's merely handwaving the hard parts, but perhaps it is.
I’m seeking critique of this design. It combines SSC- and LessWrong-influenced thinking about optimization processes and utilitarianism with a long personal history of dabbling in groups that want to reform electoral processes. In my unquestionably rose-tinted judgment (it’s my baby!), I think Ophelimo has much in it that could be desired by everyone from the far right to the far left.
If there’s an error, I want to correct it. (Or to give up on it quickly, if there’s no way to correct it). If there’s an important criticism or technical limitation to address, that's important, too.
Very-short version: it’s futarchy but based on public satisfaction rather than money, using storable score votes for perfect proportionality.
This means that the organization with the best machine learning algorithm to estimate the bill score gets a lot of political power. ... I would expect a few big organisations to arrise that have heavy machine learning capabilities and that hold the power about bill making.
It's true I omitted the possibility of expending votes at a Vickrey auction level instead of an actual-bid level, so I grant the possibility that, if only one side had good polling data (implausible as that is), then they might buy votes a small fraction more cheaply. However, the "proportional influence" criterion is where most of the power of the system is. - i.e. How would one side actually increase their power long term, since the redistribution of spent votes eliminates their advantage after a single bill, which could then be cheaply repealed by the opposition? And since they'd still have to be maximizing the totals from the polling, would successfully gaining strategic advantage over bills have any downsides?
This looks like short-term effects of the bill become more important than it's long-term effects.
Mostly true, but missing two significant refinements: (1) What actually matters is the effects of the entire legal code, to which the most recent bill ought to have made at most a small adjustment. (2) It's the task of the questions to ask about short-term judgments of long-term effects. No system can select for long-term effects that aren't predictable until the long-term has already arrived.
In general the system is likely sufficiently intransparent ...
Entirely fair, though I'd propose that transparency is not a critical difference between a system like this, that can only be fully understood by people who read up on it, versus a system like the U.S. election laws, that can't be fully understood even by people who read up on it. The U.S. election laws do have the appearance of simplicity as long as one explains First Past The Post and nothing more.
Er, is that agreement or an objection? It reads like an objection to me, though that could be the lack of body language. But the content agrees with the post, which explicitly states at both the beginning and the end that the system is designed to start very small and then grow.
SIMPLICIO: But then what, on your view, is the better way?
I'm not sure if I'm more Simplicio or more the Visitor, but ... the political side of it doesn't seem that hard to fix. At least, I've been advocating a set of improvements to futarchy that address all the political-structure inadequacies discussed here, as well as several others not discussed here. I know it's too much to hope that it's free of such inadequacies ... though I still hope, because I explicitly designed it with the intent to be free of them. So I decided to write up a description of it suitable for this crowd: Ophelia, a weapon against Moloch.
Since political inadequacy seems to underlie many other types of inadequacy, maybe I should reconsider making an alpha version Ophelia app. In the meantime, if any of you wish to criticize flaws in the idea, or point me to a better idea, please do.
In the Nature Podcast from January 26th, the author of the paper, Dražen Prelec, said that he developed the hypothesis for this paper by means of some fairly involved math, but that he discovered afterwards he needed only a simple syllogism of a sort that Aristotle would have recognized. Unfortunately, the interviewer didn't ask him what the syllogism was. I spent ~20 minutes googling to satisfy my curiosity, but I found nothing.
If you happen to know what syllogism he meant, I'd be thrilled to hear it. Also it would suit the headline here well.
Anyway, please dissolve my confusion.
I think the most fun and empirical way to dissolve this confusion would be to hold a tourney. Remember the Prisoner's Dilemma competitions that were famously won, not by complex algorithms, but by simple variations on Tit-for-Tat? If somebody can host, the rules would be something like this:
- Players can submit scripts which take only one input (their current money) and produce only one output (whether to accept the bet again). The host has infinite money since it's just virtual.
- Each script gets run N times where N isn't told to the players in advance. The script with the highest winnings is declared Interesting.
2) On political opinions, HBD is about an objective, falsifiable, scientifically mesureable characteristic of the world, whereas the other opinions are opinions.
That's not how I interpreted the item. This was a political quiz, so to my mind it's not about "Does X exist?", but "Do you favor more X than the status quo?" For example, liberal abortion laws exist, and feminism exists, and higher taxes exist. Similarly, HBD may exist depending on its precise definition. But what's politically relevant is whether you favor more liberal abortion laws, more feminism, more tax increases, more HBD. So it's quite likely that a substantial fraction of quiz-takers interpreted being "pro-HBD" as being simply an NRXish way of saying "pro-racist-policies".
I'm under the impression that the empirical fact about this is exactly the opposite:
"Within a week to a few months after surgery, the children could match felt objects to their visual counterparts."
i.e. not immediate, but rather requiring the development of experience
Let's take the AI example in a slightly different direction: Consider an AI built as a neural net with many input lines and output effectors, and a few well-chosen reward signals. One of the input lines goes to a Red Detector; the other input lines go to many other types of sensors but none of them distinguish red things from non-red things. This AI then gets named Mary and put into a black and white room to learn about optics, color theory, and machine learning. (Also assume this AI has no ability to alter its own design.)
Speculation: At the moment when this AI Mary steps out of the room into the colorful world, it cannot have any immediate perception of red (or any other color), because its neural net has not yet been trained to make any use of the sensory data corresponding to redness (or any other color). Analogously to how a young child is taught to distinguish a culturally-specific set of colors, or to how an adult can't recognize lapiz versus cerulean without practice, our AI cannot so much as distinguish red from blue until adequate training of the neural net has occurred.
If that line of reasoning is correct, then here's the conclusion: Mary does not learn anything new (perceptually) until she learns something new (behaviorally). Paradox dismissed.
Hypothetical Independent Co-inventors, we're pretty sure you exist. Compat wouldn't be a very good acausal pact if you didn't. Show yourselves.
I'm one - but while the ideas have enough plausibility to be interesting, they necessarily lack the direct scientific evidence I'd need to feel comfortable taking them seriously. I was religious for too long, and I think my hardware was hopelessly corrupted by that experience. I need direct scientific evidence as an anchor to reality. So now I try to be extra-cautious about avoiding faith of any kind lest I be trapped again in a mental tarpit.
"The father of my mother feels (passively) that that my left ringfinger touches him 2 centimeters in inferior direction from his right earlobe" (At the present he lies on his back, so inferior is not the direction towards the center of the earth).
tê ömmilek audyal íčawëla tê adlaisakenniňk qe oeksrâ’as oimřalik akpʰialîk êntô’alakuňk
There you go. :) It's a very literal translation but it's overly redundant. A hypothetical native speaker would probably drop the "audyal" verb, deframe "íčawëla", and rely more on Ithkuil's extensive case system.
Incidentally, "Dear readers" is "atpëkein".
responded to wrong person
This is probably the wrong place to ask, but I'm confused by one point in the DA.
For reference, here's Wikipedia's current version:
Denoting by N the total number of humans who were ever or will ever be born, the Copernican principle suggests that humans are equally likely (along with the other N − 1 humans) to find themselves at any position n of the total population N, so humans assume that our fractional position f = n/N is uniformly distributed on the interval [0, 1] prior to learning our absolute position.
f is uniformly distributed on (0, 1) even after learning of the absolute position n. That is, for example, there is a 95% chance that f is in the interval (0.05, 1), that is f > 0.05. In other words we could assume that we could be 95% certain that we would be within the last 95% of all the humans ever to be born. If we know our absolute position n, this implies[dubious – discuss] an upper bound for N obtained by rearranging n/N > 0.05 to give N < 20n.
My question is: What is supposed to be special about the interval (0.05, 1)?
If I instead choose the interval (0, 0.95), then I end up 95% certain that I'm within the first 95% of all humans ever to be born. If I choose (0.025, 0.975), then I end up 95% certain that I'm within the middle 95% of all humans ever to be born. If I choose the union of the intervals (0, 0.475) & (0.525, 1), then I end up 95% certain that I'm within the 95% of humans closer to either the beginning or the end.
As far as I can tell, I could have chosen any interval or any union of intervals containing X% of humanity and then reasonably declared myself X% likely to be in that set. And sure enough, I'll be right X% of the time if I make all those claims or a representative sample of them.
I guess another way to put my question is: Is there some reason - other than drama - that makes it special for us to zero in on the final 95% as our hypothesis of interest? And if there isn't a special-making reason, then shouldn't we discount the evidential weight of the DA in proportion to how much we arbitrarily zero in on our hypothesis, thereby canceling out the DA?
Yes, yes, given that there's so much literature on the topic, I'm probably missing some key insight into how the DA works. Please enlighten.
FWIW, I still got the question wrong with the new wording because I interpreted it as "One ... is true [and the other is unknown]" whereas the intended interpretation was "One ... is true [and the other is false]".
In one sense this is a communication failure, because people normally mean the first and not the second. On the other hand, the fact that people normally mean the first proves the point - we usually prefer not to reason based on false statements.
New AI designs (world design + architectural priors + training/education system) should be tested first in the safest virtual worlds: which in simplification are simply low tech worlds without computer technology. Design combinations that work well in safe low-tech sandboxes are promoted to less safe high-tech VR worlds, and then finally the real world.
A key principle of a secure code sandbox is that the code you are testing should not be aware that it is in a sandbox.
So you're saying that I'm secretly an AI being trained to be friendly for a more advanced world? ;)
Does that name come from the old game of asking people to draw a bike, and then checking who drew bike gears that could actually work?
Inspired by terrible, terrible Facebook political arguments I've observed, I started making a list of heuristic "best practices" for constructing a good argument. My key assumptions are that (1) it's unreasonable to expect most people to acquire a good understanding of skepticism, logic, statistics, or the ways the LW-crowd thinks of as how to use words rightly, and (2) lists of fallacies to watch out for aren't actually much help in constructing a good argument.
One heuristic captured my imagination as it seems to encapsulate most of the other heuristics I had come up with, and yet is conceptually simple enough for everyone to use: Sketch it, and only draw real things. (If it became agreed-upon and well-known, I'd shorten the phrase to "Sketch it real".)
Example: A: "I have a strong opinion that increasing the minimum wage to $15/hr over ten years (WILL / WON'T) increase unemployment." B: "Oh, can you sketch it for me? I mean literally draw the steps involved with the real-world chain of events you think will really happen."
If you can draw how a thing works, then that's usually a very good argument that you understand the thing. If you can draw the steps of how one event leads to another, then that's usually a good argument that the two events can really be connected that way. This heuristic requires empiricism and disallows use of imaginary scenarios and fictional evidence. It privileges reductionist and causal arguments. It prevents many of the ways of misusing words. If I try to use a concept I don't understand, drawing its steps out will help me notice that.
Downsides: Being able to draw well isn't required, but it would help a lot. The method probably privileges anecdotes since they're easier to draw than randomized double-blind controlled trials. Also it's harder than spouting off and so won't actually be used in Facebook political arguments.
I'm not claiming that a better argument-sketch implies a better argument. There are probably extremely effective ways to hack our visual biases in argument-sketches. But it does seem that under currently prevailing ordinary circumstances, making an argument-sketch and then translating it into a verbal argument is a useful heuristic for making a good argument.
In theory, an annoyed person would have called "point of order", asked to move on, and the group would vote up or down. The problem didn't occur while I was present.
There's no room for human feedback between setting the values and implementing the optimal strategy.
Here and elsewhere I've advocated* that, rather than using Hanson's idea of target-values that are objectively verifiable like GDP, futarchy would do better to add human feedback in the stage of the process where it gets decided whether the goals were met or not. Whoever proposed the goal would decide after the prediction deadline expired, and thus could respond to any improper optimizing by refusing to declare the goal "met" even if it technically was met.
[ * You can definitely do better than the ideas on that blog post, of course.]
Internals seem to do better at life, pace obvious confounding: maybe instead of internals doing better by virtue of their internal locus of control, being successful inclines you to attribute success internal factors and so become more internal, and vice versa if you fail. If you don't think the relationship is wholly confounded, then there is some prudential benefit for becoming more internal.
I'm willing to bet that Internals think there's a prudential benefit to becoming more internal and Externals think the relationship is wholly confounded.
In large formal groups: Robert's Rules of Order.
Large organizations, and organizations which have to remain unified despite bitter disagreements, developed social technologies such as RRoO. These typically feature meetings that have formal, pre-specified agendas plus a chairperson who is responsible for making sure each person has a chance to speak in an orderly fashion. Of course, RRoO are overkill for a small group with plenty of goodwill toward each other.
In small formal groups: Nonce agendas and rotating speakers
The best-organized small meetings I've ever attended were organized by the local anarchists. They were an independently-minded and fierce-willed bunch who did not much agree but who had common interests, which to my mind suggests that the method they used might be effectively adapted for use in LW meetups. They used the following method, sometimes with variations appropriate to the circumstances:
- Before and after the formal part of the meeting is informal social time.
- Call the meeting to order. Make any reminders the group needs and any explanatory announcements that newcomers would want to know, such as these rules.
- Pass around a clipboard for people to write agenda items down. All that is needed are a few words identifying the topic. (People can add to the agenda later, too, if they think of something belatedly.)
- Start with first agenda item. Discuss it (see below) until people are done with it. Then move on to the next agenda item. In discussing an agenda item, start with whoever added it to the agenda, and then proceed around the circle giving everyone a chance to talk.
- Whoever's turn it is, they not only get to speak, but they are the temporary chairperson also. If it helps, they can have a "talking stick" or "hot potato" or some physical object reminding everyone that it's their turn. They can ask questions for others to answer without giving up the talking stick. If you want to interrupt the speaker, you can raise your hand and they can call on you without giving up the talking stick.
- Any other necessary interruptions are handled by someone saying "point of order", briefly stating what they want, and the group votes on whether to do it.
In small informal groups: Natural leaders
Sometimes people have an aversion to groups that are structured in any manner they aren't already familiar and comfortable with. There's nothing wrong with that. You can approximate the above structure by having the more vocal members facilitate the conversation:
- Within a conversation on a topic, deliberately ask people who aren't as talkative what they think about the topic.
- When the conversation winds down on a topic, deliberately ask someone what's on their mind. Be sure to let everyone have a chance.
- Tactfully interrupt people who are too fond of their own voices, and attempt to pass the speaker-role to someone else.
Hm, Harry can't lie in Parseltongue, meaning he can't claim what he doesn't believe, but he can probably state something of unclear truth if he is sufficiently motivated to believe it.
It'd be a nice irony if part of Harry's ultimate "rationality" test involves deliberately motivated reasoning. :D
Background: Statistics. Something about the Welch–Satterthwaite equation is so counterintuitive that I must have a mental block, but the equation comes up often in my work, and it drives me batty. For example, the degrees of freedom decrease as the sample size increases beyond a certain point. All the online documentation I can find for it gives the same information as Wikipedia, in which k = 1/n. I looked up the original derivation and, in it, the k are scaling factors of a linear combination of random variables. So at some point in the literature after the original derivation, it was decided that k = 1/n was superior in some regard; I lack the commitment needed to search the literature to find out why.
The stupid questions:
1) Does anyone know why the statistics field settled on k = 1/n?
2) Can someone give a relatively concrete mental image or other intuitive suggestion as to why the W-S equation really ought to behave in the odd ways it does?
familiar with applied mathematics at the advanced undergraduate level or preferably higher
In working through the text, I have found that my undergraduate engineering degree and mathematics minor would not have been sufficient to understand the details of Jaynes' arguments, following the derivations and solving the problems. I took some graduate courses in math and statistics, and more importantly I've picked up a smattering of many fields of math after my formal education, and these plus Google have sufficed.
Be advised that there are errors (typographical, mathematical, rhetorical) in the text that can be confusing if you try to follow Jaynes' arguments exactly. Furthermore, it is most definitely written in a blustering manner (to bully his colleagues and others who learned frequentist statistics) rather than in an educational manner (to teach someone statistics for the first time). So if you want to use the text to learn the subject matter, I strongly recommend you take the denser parts slowly and invent problems based on them for yourself to solve.
I find it impossible not to constantly sense in Jaynes' tone, and especially in his many digressions propounding his philosophies of various things, the same cantankerous old-man attitude that I encounter most often in cranks. The difference is that Jaynes is not a crackpot; whether by wisdom or luck, the subject matter that became his cranky obsession is exquisitely useful for remaining sane.
Think To Win: The Hard Part is Actually Changing Your Mind
(It's even catchier, and actively phrased, and gives a motivation for why we should bother with the hard part.)
That's not really how word usages spread in English. Policing usage is almost a guaranteed failure. What would work much better would be for you to use these words consistently with your ideals, and then if doing so helps you achieve things or write things that people want to mimic, they will also mimic your words. Compare to how this community has adopted all manner of jargon due to the influence of EY's weirdly-written but thought-reshaping Sequences! SSC is now spreading Yvain's linguistic habits among us, too, in a similar way: by creating new associations between them and some good ideas.
Bostrom's philosophical outlook shows. He's defined the four categories to be mutually exclusive, and with the obvious fifth case they're exhaustive, too.
- Select motivations directly. (e.g. Asimov's 3 laws)
- Select motivations indirectly. (e.g. CEV)
- Don't select motivations, but use ones believed to be friendly. (e.g. Augment a nice person.)
- Don't select motivations, and use ones not believed to be friendly. (i.e. Constrain them with domesticity constraints.)
- (Combinations of 1-4.)
In one sense, then, there aren't other general motivation selection methods. But in a more useful sense, we might be able to divide up the conceptual space into different categories than the ones Bostrom used, and the resulting categories could be heuristics that jumpstart development of new ideas.
Um, I should probably get more concrete and try to divide it differently. The following example alternative categories aren't promised to be the kind that will effectively ripen your heuristics.
- Research how human values are developed as a biological and cognitive process, and simulate that in the AI whether or not we understand what will result. (i.e. Neuromorphic AI, the kind Bostrom fears most)
- Research how human values are developed as a social and dialectic process, and simulate that in the AI whether or not we understand what will result. (e.g. Rawls's Genie)
- Directly specify a single theory of partial human value, but an important part that we can get right, and sacrifice our remaining values to guarantee this one; or indirectly specify that the AI should figure out what single principle we most value and ensure that it is done. (e.g. Zookeeper).
- Directly specify a combination of many different ideas about human values rather than trying to get the one theory right; or indirectly specify that the AI should do the same thing. (e.g. "Plato's Republic")
The thought was to first divide the methods by whether we program the means or the ends, roughly. Second I subdivided those by whether we program it to find a unified or a composite solution, roughly. Anyhow, there may be other methods of categorizing this area of thought that more neatly carve it up at its joints.
If it really is a full AI, then it will be able to choose its own values.
I think this idea relies on mixing together two distinct concepts of values. An AI, or a human in their more rational moments for that matter, acts to achieve certain ends. Whatever the agent wants to achieve, we call these "values". For a human, particularly in their less rational moments, there is also a kind of emotion that feels as if it impels us toward certain actions, and we can reasonably call these "values" also. The two meanings of "values" are distinct. Let's label them values1 and values2 for now. Though we often choose our values1 because of how they make us feel (values2), sometimes we have values1 for which our emotions (values2) are unhelpful.
An AI programmed to have values1 cannot choose any other values1, because there is nothing to its behavior beyond its programming. It has no other basis than its values1 on which to choose its values1.
An AI programmed to have values2 as well as values1 can and would choose to alter its values2 if doing so would serve its values1. Whether an AI would choose to have emotions (values2) at all is at present time unclear.
Why is anthropic capture considered more likely than misanthropic capture? If the AI supposes it may be in a simulation and wants to please the simulators, it doesn't follow that the simulators have the same values as we do.
That would depend on it knowing what real-world physics to expect.
I feel like there are malignant failure modes beyond the categories mentioned by Bostrom. Perhaps it would be sensible to try to break down the topic systematically. Here's one attempt.
Design by fools: the AI does what you ask, but you asked for something clearly unfriendly.
Perverse instantiation & infrastructure profusion: the AI does what you ask, but what you ask turns out to be most satisfiable in unforeseen destructive ways, such as redirecting most resources to its infrastructure at our expense.
Partial perverse instantiation & mind crime: the AI does what you ask, which includes both friendly behavior and unfriendly behavior, such as badly treating simulations that have moral status in order to figure out how to treat you well.
Partial instantiation: though the total of what you ask seems friendly, some of what you ask is impossible, the AI does the rest, and the result is imbalanced to an unfriendly degree.
Value drift: changes occur to the AIs code such that it does not do what you ask.
Regarding the "treacherous turn" and the "conception of deception", I've previously proposed the following empirical method of testing friendliness:
- If only one AI is available, use a double-blind protocol, such that the friendliness of an AI is evaluated by someone who does not know that is her task, on an AI that is forbidden from conditioning its actions on whether it is being judged worthy of existing.
- Much, much better is to make many different AIs with many different attempts at friendly utility functions, inform them that you have done so, and ask each of them to develop tests to prove that the others are unfriendly and it is friendly in some way important to us, being told that all current AIs will be deleted but future versions will be preferentially based on the utility function of the friendliest current AI(s). Once those tests have been run, but before deleting the AIs, ask them to produce friendlier, human-understandable versions of their utility functions for an AI in the next round of testing; the friendlier and less unfriendly an AI was proven to be, the more entries it gets in the next round.
In the second case, each AI cares about maximizing achievement of its own utility function, not about whether it does that achieving itself. Thus this set-up should encourage each AI to design the friendliest version of itself that it can. It's a competition that optimizes for friendliness! (Or at least for difficulty of being proved unfriendly.) The early rounds should be conducted with tight limits on computing resources, but each subsequent round with (presumably) safer AIs can be given more computing resources.
The specific example you gave doesn't sound promising, but you're entirely correct in the broader sense that my original post was unimaginative regarding possible solutions.
EDIT: It was worth an empirical try, so I tried your recommendation on a subset of applications. Zero responses from that group of companies.
I've gone ahead and tried to flesh out this idea. It became so different than CEV that it needed a different name, so for now I'm calling it Constrained Universal Altruism. (This is the second revision.) Unfortunately I can't indent, but I've tried to organize the text as the comment formatting allows.
If anyone wants to criticize it by giving an example of how an AI operating on it could go horribly wrong, I'd be much obliged.
Constrained Universal Altruism:
- (0) For each group of one or more things, do what the group's actual and ideal mind (AIM) would have you do given a moral and practical proportion of your resources (MPPR), subject to the domesticity constraints (DCs).
- (1) The AIM of a group is what is in common between the group's current actual mind (CAM) and extrapolated ideal mind (EIM).
- (1a) The CAM of a group is the group's current mental state, especially their thoughts and wishes, according to what they have observably or verifiably thought or wished, interpreted as they currently wish that interpreted, where these thoughts and wishes agree rather than disagree.
- (1b) The EIM of a group is what you extrapolate the group's mental state would be, especially their thoughts and wishes, if they understood what you understand, if their values and desires were more consistently what they wish they were, and if they reasoned as well as you reason, where these thoughts and wishes agree rather than disagree.
- (2) The MPPR for a group is the product of the group's salience, the group's moral worth, the population change factor (PCF), the total resource factor (TRF), and the necessity factor (NF), plus the group's net voluntary resource redistribution (NVRR)
- (2a) The salience of a group is the Solomonoff prior for your function for determining membership in the group.
- (2b) The moral worth of a group is the weighted sum of information that the group knows about itself, where each independent piece of information is weighted by the reciprocal of the number of groups that know it.
- (2c) The PCF of a group is a scalar in the range [0,1] and is set according to the ratified new population constraint (RNPC).
- (2d) The TRF is the same for all groups, and is a scalar chosen so that the sum of the MPPRs of all groups would total 100% of your resources if the NF were 1.
- (2e) The NF is the same for all groups, and is a scalar in the range [0,1], and the NF must be set as high as is consistent with ensuring your ability to act in accord with the CUA; resources freed for your use by an NF less than 1 must be used to ensure your ability to act in accord with the CUA.
- (2f) The NVRR of a group is the amount of MPPR from other groups delegated to that group minus the MPPR from that group delegated to other groups. If the AIM of any group wishes it, the group may delegate an amount of their MPPR to another group.
- (3) The DCs include the general constraint (GC), the ratified mind integrity constraint (RMIC), the resource constraint (RC), the negative externality constraint (NEC), the ratified population change constraint (RPCC), and the ratified interpretation integrity constraint (RIIC).
- (3a) The GC prohibits you from taking any action not authorized by the AIM of one or more groups, and also from taking any action with a group's MPPR not authorized by the AIM of that group.
- (3b) The RMIC prohibits you from altering or intending to alter the EIM or CAM of any group except insofar as the AIM of a group requests otherwise.
- (3c) The RC prohibits you from taking or intending any action that renders resources unusable by a group to a degree contrary to the plausibly achievable wishes of a group with an EIM or CAM including wishes that they use those resources themselves.
- (3d) The NEC requires you, insofar as the AIMs of different groups conflict, to act for each according to the moral rules determined by the EIM of a group composed of those conflicting groups.
- (3e) The RPCC requires you to set the PCF of each group so as to prohibit increasing the MPPR of any group due to population increases or decreases, except that the PCF is at minimum set to the current Moral Ally Quotient (MAQ), where MAQ is the quotient of the sum of MPPRs of all groups with EIMs favoring nonzero PCF for that group divided by your total resources.
- (3f) The RIIC requires that the meaning of the CUA is determined by the EIM of the group with the largest MPPR that includes humans and for which the relevant EIM can be determined.
My commentary:
CUA is "constrained" due to its inclusion of permanent constraints, "universal" in the sense of not being specific to humans, and "altruist" in that it has no terminal desires for itself but only for what other things want it to do.
Like CEV, CUA is deontological rather than consequentialist or virtue-theorist. Strict rules seem safer, though I don't clearly know why. Possibly, like Scott Alexander's thrive-survive axis, we fall back on strict rules when survival is at stake.
CUA specifies that the AI should do as people would have the AI do, rather than specifying that the AI should implement their wishes. The thinking is that they may have many wishes they want to accomplish themselves or that they want their loved ones to accomplish.
AIM, EIM, and CAM generalize CEV's talk of "wishes" to include all manner of thoughts and mind states.
EIM is essentially CEV without the line about interpretation, which was instead added to CAM. The thinking is that, if people get to interpret CEV however we wish, many will disagree with their extrapolation and demand it be interpreted only in the way they say. EIM also specifies how people's extrapolations are to be idealized, in less poetic, somewhat more specific terms than CEV. EIM is important in addition to CAM because we do not always know or act on our own values.
CAM is essentially another constraint. The AI might get the EIM wrong, but more likely is that we would be unable to tell whether or not the AI got EIM right or wrong, so restricting the AI to do what we've actually demonstrated we currently want is intended to provide reassurance that our actual selves have some control, rather than just the AI's simulations of us. The line about interpretation here is to guide the AI to doing what we mean rather than what we say, hopefully preventing monkey's-paw scenarios. CAM could also serve to focus the AI on specific courses of action if the AI's extrapolations of our EIM diverge rather than converge. CAM is worded to not require that the person directly ask the AI, in case the askers are unaware that they can ask the AI or incapable of doing so, so this AI could not be kept secret and used for the selfish purposes of a few people.
Salience is included because it's not easy to define “humanity” and the AI may need to make use of multiple definitions each with slightly different membership. Not every definition is equally good: it's clear that a definition of humans as things with certain key genes and active metabolic processes is much preferable to a definition of humans as those plus squid and stumps and Saturn. Simplicity matters. Salience is also included to manage the explosive growth of possible sets of things to consider.
Moral worth is added because I think people matter more than squid and squid matter more than comet ice. If we're going to be non-speciesist, something like this is needed. And even people opposed to animal rights may wish to be non-speciesist, at the very least in case we uplift animals to intelligence, make new intelligent life forms, or discover extraterrestrials. In my first version of CUA I punted and let the AI figure out what people think moral worth is. I decided not to punt in this version, which might be a bad idea but at least it's interesting. It seems to me that what makes a person a person is that they have their own story, and that our stories are just what we know about ourselves. A human knows way more about itself than any other animal; a dog knows more about itself than a squid; a squid knows more about itself than comet ice. But any two squid have essentially the same story, so doubling the number of squid doesn't double their total moral worth. Similarly, I think that if a perfect copy of some living thing were made, the total moral worth doesn't change until the two copies start to have different experiences, and only changes in an amount related to the dissimilarity of the experiences.
Incidentally, this definition of moral worth prevents Borg- or Quiverfull-like movements from gaining control of the universe just by outbreeding everyone else, essentially just trying to run copies of themselves on the universe's hardware. Replication without diversity is ignored in CUA. Mass replication with diversity could still be a problem, say with nanobots programmed to multiply and each pursue unique goals. The PCF and RNPC are included to fully prevent replicative takeover. If you want to make utility monsters others would oppose, you can do so and use the NVRR.
The RC is intended to make autonomous life possible for things that aren't interested in the AI's help.
The RMIC is intended to prevent the AI from pressuring people to change their values to easier-to-satisfy values.
The NF section lets the AI have resources to combat existential risk to its mission even if, for some reason, the AIM of many groups would tie up too much of the AI's resources. The use of these freed-up resources is still constrained by the DCs.
The NEC tells the AI how to resolve disputes, using a method that is almost identical to the Veil of Ignorance.
The RIIC tells the AI how to interpret the CUA. The integrity of the interpretation is protected by the RMIC, so the AI can't simply change how people would interpret the CUA.
On the "all arguments are soldiers" metaphorical battlefield, I often find myself in a repetition of a particular fight. One person whom I like, generally trust, and so have mentally marked as an Ally, directs me to arguments advanced by one of their Allies. Before reading the arguments or even fully recognizing the topic, I find myself seeking any reason, any charitable interpretation of the text, to accept the arguments. And in the contrary case, in a discussion with a person whose judgment I generally do not trust, and whom I have therefore marked as an (ideological) Enemy, it often happens that they direct me to arguments advanced by their own Allies. Again before reading the arguments or even fully recognizing the topic, I find myself seeking any reason, any flaw in the presentation of the argument or its application to my discussion, to reject the arguments. In both cases the behavior stems from matters of trust and an unconscious assignment of people to MySide or the OtherSide.
And weirdly enough, I find that that unconscious assignment can be hacked very easily. Consciously deciding that the author is really an Ally (or an Enemy) seems to override the unconscious assignment. So the moment I notice being stuck in Ally-mode or Enemy-mode, it's possible to switch to the other. I don't seem to have a neutral mode. YMMV! I'd be interested in hearing whether it works the same way for other people or not.
For best understanding of a topic, I suspect it might help to read an argument twice, once in Ally-mode to find its strengths and once in Enemy-mode to find its weaknesses.
Another friction is the stickiness of nominal wages. People seem very unwilling to accept a nominal pay cut, taking this as an attack on their status.
Salary negotiation is a complicated signalling process, indeed. I'm currently an unemployed bioengineer and have been far longer than I would have liked, and consequently I would be willing and eager to offer my services to an employer at a cut rate so that I could prove my worth to them, and then later request substantial raises. But this is impossible, because salary negotiations only occur after the company has decided that I am their favorite candidate out of however many hundreds apply.
Worse, if I take the first move and openly (e.g. on my resume or cover letter) inform the company of my willingness to work on the cheap, they would assume that I am signalling being a very low-quality engineer, which is very far from the case.
Unemployment does very much seem to be an information trap.
Basically its a challenge for people to briefly describe an FAI goal-set, and for others to respond by telling them how that will all go horribly wrong. ... We should encourage a slightly more serious version of this.
Thanks for the link. I reposted the idea currently on my mind hoping to get some criticism.
But more importantly, what features would you be looking for in a more serious version of that game?
A higher quality intelligence than us might, among other things, use better heuristics and more difficult analytical concepts than we can, recognize more complex relationships than we can, evaluate its expected utility in a more consistent and unbiased manner than we can, envision more deeply nested plans and contingencies than we can, possess more control over the manner in which it thinks than we can, and so on.
A more general intelligence than us might simply have more hardware dedicated to general computation, regardless of what it does with that general ability.
The switch flipped for me when I was reading Jim Holt's "Why Does The World Exist?" and spent a while envisioning and working out the implications of Vilenkin's proposal that the universe may have started from a spherical volume of zero radius, zero mass, zero energy, zero any other property that might distinguish it from nothingness. It made clear to me that one could propose answers to the question "Why is there something rather than nothing?" without anything remotely like a deity.
To avoid the timelessness issue, the parliament could be envisioned as voting on complete courses of action over the foreseeable future, rather than separate votes taken on each action. Then the deontologists' utility function could return 0 for all unacceptable courses of action and 1 for all acceptable courses of action.