Posts
Comments
I think your own message is also too extreme to be rational. So it seems to me that you are fighting fire with a fire. Yes, Remmelt has some extreme expressions, but you definitely have extreme expressions here too, while having even weaker arguments.
Could we find a golden middle road, a common ground, please? With more reflective thinking and with less focus on right and wrong? (Regardless of the dismissive-judgemental title of this forum :P)
I agree that Remmelt can improve the message. And I believe he will do that.
I may not agree that we are going to die with 99% probability. At the same time I find that his current directions are definitely worthwhile of exploring.
I also definitely respect Paul. But mentioning his name here is mostly irrelevant for my reasoning or for taking your arguments seriously, simply because I usually do not take authorities too seriously before I understand their reasoning in a particular question. And understanding a person's reasoning may occasionally mean that I disagree in particular points as well. In my experience, even the most respectful people are still people, which means they often think in messy ways and they are good just on average, not per instance of a thought line (which may mean they are poor thinkers 99% of the time, while having really valuable thoughts 1% of the time). I do not know the distribution for Paul, but definitely I would not be disappointed if he makes mistakes sometimes.
I think this part of Remmelt's response sums it up nicely: "When accusing someone of crankery (which is a big deal) it is important not to fall into making vague hand-wavey statements yourself. You are making vague hand-wavey (and also inaccurate) statements above. Insinuating that something is “science-babble” doesn’t do anything. Calling an essay formatted as shorter lines a “poem” doesn’t do anything."
In my interpretation, black-and-white thinking is not "crankery". It is a normal and essential step in the development of cognition about a particular problem. Unfortunately. There is research about that in the field of developmental and cognitive psychology. Hopefully that applies to your own black-and-white thinking as well. Note that, unfortunately this development is topic specific, not universal.
In contrast, "crankery" is too strong word for describing black-and-white thinking because it is a very judgemental word, a complete dismissal, and essentially an expression of unwillingness to understand, an insult, not just a disagreement about a degree of the claims. Is labelling someone's thoughts as "a crankery" also a form of crankery of its own then? Paradoxical isn't it?
The following is meant as a question to find out, not a statement of belief.
Nobody seems to have mentioned the possibility that initially they did not intend to fire Sam, but just to warn him or to give him a choice to restrain himself. Yet possibly he himself escalated it to firing or chose firing instead of complying with the restraint. He might have done that just in order to have all the consequences that have now taken place, giving him more power.
For example, people in power positions may escalate disagreements, because that is a territory they are more experienced with as compared to their opponents.
The paper is now published with open access here:
https://link.springer.com/article/10.1007/s10458-022-09586-2
I propose blacklists are less useful if they are about proxy measures, and much more useful if they are about ultimate objectives. Some of the ultimate objectives can also be represented in the form of blacklists. For example, listing many ways to kill a person is less useful. But saying that death or violence is to be avoided, is more useful.
I imagine that the objectives which fulfill the human needs for Power (control over AI), Self-Direction (autonomy, freedom from too much influence from AI), and maybe others, would be partially also working in ensuring that the AI does not start moving towards wireheading. Wireheading would surely be in contradiction to these objectives.
If we consider wireheading as a process, not a black and white event, then there are steps along the way. These steps could be potentially detected or even foreseen before the process finishes in a new equilibrium.
A question. Is it relevant for your current problem formulation that you also want to ensure that authorised people still have reasonable access to the diamond? In other words, is it important here that the system still needs to yield to actions or input from certain humans, be interruptible and corrigible? Or, in ML terms, does it have to avoid both false negatives and false positives when detecting or avoiding intrusion scenarios?
I imagine that an algorithmically more trivial way to make the system both "honest" and "secured" is to make it so heavily secured that almost certainly nobody can access the diamond.
Until the Modem website is down, you can access our workshop paper here: https://drive.google.com/file/d/1qufjPkpsIbHiQ0rGmHCnPymGUKD7prah/view?usp=sharing
You can apply the nonlinear transformation either to the rewards or to the Q values. The aggregation can occur only after transformation. When transformation is applied to Q values then the aggregation takes place quite late in the process - as Ben said, during action selection.
Both the approach of transforming the rewards and the approach of transforming the Q values are valid, but have different philosophical interpretations and also have different experimental outcomes to the agent behaviour. I think both approaches need more research.
For example, I would say that transforming the rewards instead of Q values is more risk-averse as well as "fair" towards individual timesteps, since it does not average out the negative outcomes across time before exponentiating them. But it also results in slower learning by the agent.
Finally there is a third approach which uses lexicographical ordering between objectives or sets of objectives. Vamplew has done work on this direction. This approach is truly multi-objective in the sense that there is no aggregation at all. Instead the vectors must be compared during RL action selection without aggregation. The downside is that it is unwieldy to have many objectives (or sets of objectives) lexicographically ordered.
I imagine that the lexicographical approach and our continuous nonlinear transformation approaches are complementary. There could be for example two main sets of objectives: one set for alignment objectives, the other set for performance objectives. Inside a set there would be nonlinear transformation and then aggregation applied, but between the sets there would be lexicographical ordering applied. In other words there would be a hierarchy of objectives. By having only two sets in lexicographical ordering the lexicographical ordering does not become unwieldy.
This approach would be a bit analogous to the approach used by constraint programming, though more flexible. The safety objectives would act as a constraint against performance objectives. An approach that is almost in absurd manner missing from classical naive RL, but which is very essential, widely known, and technically developed in practical applications, that is, in constraint programming! In the hybrid approach proposed in the above paragraph the difference from classical constraint programming would be that among the safety objectives there would still be flexibility and ability to trade (in a risk-averse way).
Finally, when we say "multi-objective" then it does not just refer to the technical details of the computation. It also stresses the importance of acknowledging the need for researching and making more explicit the inherent presence and even structure of multiple objectives inside any abstract top objective. To encode knowledge in a way that constrains incorrect solutions but not correct solutions. As well as acknowledging the potential existence of even more complex, nonlinear interactions between these multiple objectives. We did not focus on nonlinear interactions between the objectives yet, but these interactions are possibly relevant in the future.
I totally agree that in a reasonable agent the objectives or target values / set-points do change, as it is also exemplified by biological systems.
Until the Modem website is down, you can access our workshop paper here: https://drive.google.com/file/d/1qufjPkpsIbHiQ0rGmHCnPymGUKD7prah/view?usp=sharing
Yes, maybe the the minimum cost is 3 even without floor or ceiling? But the question is then how to find concrete solutions that can be proven using realistic efforts. I interpret the challenge as request for submission of concrete solutions, not just theoretical ones. Anyway, my finding is below, maybe it can be improved further. And could there be any way to emulate floor or ceiling using the functions permitted in the initial problem formulation?
By the way, for me the >! works reliably when entered right in the beginning of the message. After a newline it does not work reliably.
ceil(3!! * sqrt(sqrt(5! / 2 + 2)))
If you would allow ceiling function then I could give you a solution with score 60 for the Puzzle 1. Ceiling or floor functions are cool because they add even more branches to the search, and enable involving irrational number computations too. :P Though you might want to restrict the number of ceiling or floor functions permitted per solution.
By the way, please share a hint about how do you enter spoilers here?
Submitting my post for early feedback in order to improve it further:
Abstract.
Utility maximising agents have been the Gordian Knot of AI safety. Here a concrete VNM-rational formula is proposed for satisficing agents, which can be contrasted with the hitherto over-discussed and too general approach of naive maximisation strategies. For example, the 100 paperclip scenario is easily solved by the proposed framework, since infinitely rechecking whether exactly 100 paper clips were indeed produced yields to diminishing returns. The formula provides a framework for specifying how we want the agents to simultaneously fulfil or at least trade off between the many different common sense considerations, possibly enabling them to even surpass the relative safety of humans. A comparison with the formula introduced in “Low Impact Artificial Intelligences” paper by S. Armstrong and B. Levinstein is included.
It looks like there is so much information on this page that trying to edit the question kills the browser.
An additional idea: Additionally to supporting the configuration of the default behaviours, perhaps the agent should interactively ask for confirmation of shutdown instead of running deterministically?
I have a question about the shutdown button scenario.
Vika already has mentioned that the interruptibility is ambivalent and information about desirability of enabling interruptions needs to be externally provided.
I think same observation applies to corrigibility - the agent should accept goal changes only from some external agents and even that only in some situations, and not accept in other cases: If I break the vase intentionally (for creating a kaleidoscope) it should keep this new state as a new desired state. But if I or a child breaks the vase accidentally - the agent should restore it to original state. Even more, if I was about to break the vase by accident, the agent may try to interfere using slightly more force than in the case of a child who would be smaller and more fragile.
How to achieve this using the proposed AUP framework?
In other words the question can be formulated as following: Lets keep all the symbols used in the gridworld same, and the agent's code also same. Lets only change the meaning of the symbols. So each symbol in the environment should be assigned some additional value or meaning. Without that they are just symbols dancing around based on their own default rules of game. The default rules might be an useful starting point, but they need to be supplemented with additional information for practical applications.
For example, in case of the shutdown button scenario the assigned meaning of symbols would be something like Vika suggested: Lets assume that instead of shutdown button there is an accidental water bucket falling on the agent's head, and the button available to agent disables the bucket.
You might be interested in Prospect Theory:
https://en.wikipedia.org/wiki/Prospect_theory
Hello!
Here are my submissions for this time. They are all strategy related.
The first one is a project for popularisation AI safety topics. This is not a technical text by its content but the project itself is still technological.
https://medium.com/threelaws/proposal-for-executable-and-interactive-simulations-of-ai-safety-failure-scenarios-7acab7015be4
As a bonus I would add a couple of non-technical ideas about possible economic or social partial solutions for slowing down AI race (which would enable having more time for solving the AI alignment) :
https://medium.com/threelaws/making-the-tax-burden-of-robot-usage-equal-to-the-tax-burden-of-human-labour-c8e97df751a1
https://medium.com/threelaws/starting-a-human-self-sufficiency-movement-the-handicap-principle-eb3a14f7f5b3
The latter text is not totally new - it is a distilled and edited version of one of my other old texts, that was originally multiple times longer and had a narrower goal than the new one.
Regards:
Roland
To people who become interested in the topic of side effects and whitelists, I would add links to a couple of additional articles of my own past work on related subjects that you might be interested in - for developing the ideas further, for discussion, or for cooperation:
The principles are based mainly on the idea of competence-based whitelisting and preserving reversibility (keeping the future options open) as the primary goal of AI, while all task-based goals are secondary.
https://medium.com/threelaws/implementing-a-framework-of-safe-robot-planning-43636efe7dd8
More technical details / a possible implementation of the above.
This is intended as a comment, not as a prize submission, since I first published these texts 10 years ago.
A question: can one post multiple initial applications, each less than a page long? Is there a limit for the total volume?
Hey! I believe we were in a same IRC channel at that time and I also did read your story back then. I still remember some of it. What is the backstory? :)
Hello! Thanks for the prize announcement :)
Hope these observations and clarifying questions are of some help:
https://medium.com/threelaws/a-reply-to-aligned-iterated-distillation-and-amplification-problem-points-c8a3e1e31a30
Summary of potential problems spotted regarding the use of AlphaGoZero:
- Complete visibility vs Incomplete visibility.
- Almost complete experience (self-play) vs Once-only problems. Limits of attention.
- Exploitation (a game match) vs Exploration (the real world).
- Having one goal vs Having many conjunctive goals. Also, having utility maximisation goals vs Having target goals.
- Who is affected by the adverse consequences (In a game vs In the real world)? - The problems of adversarial situation, and also of the cost externalising.
- The related question of different timeframes.
Summary of clarifying questions:
- Could you build a toy simulation? So we could spot assumptions and side-effects.
- In which ways does it improve the existing social order? Will we still stay in mediocristan? Long feedback delay.
- What is the scope of application of the idea (Global and central vs Local and diverse?)
- Need concrete implementation examples. Any realistically imaginable practical implementation of it might not be so fine anymore, each time for different reasons.
Hello!
I have significantly elaborated and extended my article of self deception in the last couple of months (before that it was about two pages long).
"Self-deception: Fundamental limits to computation due to fundamental limits to attention-like processes"
https://medium.com/threelaws/definition-of-self-deception-in-the-context-of-robot-safety-721061449f7
I included some examples for the taxonomy, positioned this topic in relation to other similar topics, compared the applicability of this article to applicability of other known AI problems.
Additionally, I described or referenced a few ideas to potential partial solutions to the problem (some of the descriptions of solutions are new, some of them I have published before).
One of the motivations for the post is that when we are building an AI that is dangerous in a certain manner, we should at least realise that we are doing that.
I will probably continue updating the post. The history of the post and state by 31. March can be seen from the linked Google Doc’s history view (that link is in top of the article).
When it comes to feedback to postings, I have noticed that people are more likely to get feedback when they ask for it.
I am always very interested in feedback, regardless whether it is given to my past, current or future postings. So if possible, please send any feedback you have. It would be of great help!
I will post the same message to your e-mail too.
Thank you and regards:
Roland
Why should one option exclude the other?
Having the blinders would not be so good either.
I propose that with proper labeling these options can both be implemented. So that people can themselves decide what to pay attention to and what to develop further.
Besides potential solutions that are oriented towards being robust to scale, I would like to emphasise that there are also failure modes that are robust to scale - that is, problems which do not go away with scaling up the resources:
Fundamental limits to computation due to fundamental limits to attention-like processes:
https://medium.com/threelaws/definition-of-self-deception-in-the-context-of-robot-safety-721061449f7
Hello Scott! You might be interested in my proposals for AI goal structures that are designed to be robust to scale:
Using homeostasis-based goal structures:
https://medium.com/threelaws/making-ai-less-dangerous-2742e29797bd
and
Permissions-then-goals based AI user “interfaces” + legal accountability:
https://medium.com/threelaws/first-law-of-robotics-and-a-possible-definition-of-robot-safety-419bc41a1ffe
Hello! My newest proposal:
https://medium.com/threelaws/making-ai-less-dangerous-2742e29797bd
I would like to propose a certain kind of AI goal structures that would be an alternative to utility maximisation based goal structures. The proposed alternative framework would make AI significantly safer, though it would not guarantee total safety. It can be used at strong AI level and also much below, so it is well scalable. The main idea would be to replace utility maximisation with the concept of homeostasis.