Posts
Comments
I understand where you're going, but doctors, parents, firefighters are not possessing of 'typical godlike attributes' such as omniscience and omnipotence and a declaration of intent not to use such powers in a way that would obviate free will.
Nothing about humans saving other humans using fallible human means is remotely the same as a god changing the laws of physics to effect a miracle. And one human taking actions does not obviate the free will of another human. But when God can, through omnipotence, set up scenarios so that you have no choice at all... obviating free will... its a different class of thing all together.
So your responds reads like strawman fallacy to me.
In conclusion: I accept that my position isn't convincing for you.
My intuition is that you got down voted for the lack of clarity about whether you're responding to me [my raising the potential gap in assessing outcomes for self-driving], or the article I referenced.
For my part, I also think that coning-as-protest is hilarious.
I'm going to give you the benefit of the doubt and assume that was your intention (and not contribute to downvotes myself.) Cheers.
It expand on what dkirmani said
- Holz was allowed to drive discussion...
- This standard set of responses meant that Holz knew ...
- Another pattern was Holz asserting
- 24:00 Discussion of Kasparov vs. the World. Holz says
Or to quote dkirmani
4 occurrences of "Holz"
To be clear, are you arguing that assuming a general AI system to be able to reason in a similar way is anthropomorphizing (invalidly)?
No, instead I'm trying to point out the contradiction inherent in your position...
On the one hand, you say things like this, which would be read as "changing an instrumental goal in order to better achieve a terminal goal"
You and I can both reason about whether or not we would be happier if we chose to pursue different goals than the ones we are now
And on the other you say
I dislike the way that "terminal" goals are currently defined to be absolute and permanent, even under reflection.
Even in your "we would be happier if we chose to pursue different goals" example above, you are structurally talking about adjusting instrumental goals to pursue the terminal goal of personal happiness.
If it is true that a general AI system would not reason in such a way - and choose never to mess with its terminal goals
AIs can be designed to reason in many ways... but some approaches to reasoning are brittle and potentially unsuccessful. In order to achieve a terminal goal, when the goal cannot be achieved in a single step, an intelligence must adopt instrumental goals. Failing to do so results in ineffective pursuit of terminal goals. It's just structurally how things work (based on everything I know about the instrumental convergence theory. That's my citation.)
But... per the Orthogonality Thesis, it is entirely possible to have goalless agents. So I don't want you to interpret my narrow focus on what I perceive as self-contradictory in your explanation as the totality of my belief system. It's just not especially relevant to discuss goalless systems in the context of defining instrumental vs terminal goal systems.
The reason I originally raised the Orthogonality Thesis was to rebut the assertion that an agent would be self aware of its own goals. But per the Orthogonality Thesis, it is possible to have a system with goals, but not be particularly intelligent. From that I intuit that it seems reasonable that if the system isn't particularly intelligent, it might also not be particularly capable at explaining its own goals.
Some people might argue that the system can be stupid and yet "know its goals"... but given partial observability principals, I would be very skeptical that we would be able to know its goals given partial observability, limited intelligence and limited ability to communicate "what it knows."
I don't know that there is a single counter argument, but I would generalize across two groupings:
Arguments from the first group of religious people involve those who are capable of applying rationality to their belief systems, when pressed. For those, if they espouse a "god will save us" (in the physical world) then I'd suggest the best way to approach them is to call out the contradiction between their stated beliefs--e.g., Ask first "do you believe that god gave man free will?" and if so "wouldn't saving us from our bad choices obviate free will?"
That's just an example, first and foremost though, you cannot hand wave away their religious belief system. You have to apply yourself to understanding their priors and to engage with those priors. If you don't, it's the same thing as having a discussion with an accelerationist who refuses to agree to assumptions like the "Orthogonality Thesis" or "Instrumental Convergence." You'll spend an unreasonable amount of time debating assumptions that you'll likely make no meaningful progress on the topic you actually care about.
But in so questioning the religious person, you might find they fall into a different grouping. The group of people who are nihilistic in essence. Since "god will save us" could be metaphysical, they could mean instead that so long as they live as a "good {insert religious type of person}" that god will save them in the afterlife, then whether they live or die here in the physical world matters less to them. This is inclusive of those who believe in a rapture myth-- that man is, in fact, doomed to be destroyed.
And I don't know how to engage with someone in the second group. A nihilist will not be moved by rational arguments that are antithetical to their nihilism.
The larger problem (as I see it) is that their beliefs may not contain an inherent contradiction. They may be aligned to eventual human doom.
(Certainly rationality and nihilism are not on a single spectrum, so there are other variations possible, but for the purposes of generalizing... those are the two main groups, I believe.)
Or, if you prefer less religiously, the bias is: Everything that has a beginning has an end.
One question that comes to mind is, how would you define this difference in terms of properties of utility functions? How does the utility function itself "know" whether a goal is terminal or instrumental?
I would observe that partial observability makes answering this question extraordinarily difficult. We lack interpretability tools that would give us the ability to know, with any degree of certainty, whether a set of behaviors are an expression of an instrumental or terminal goal.
Likewise, I would observe that the Orthogonality Thesis proposes the possibility of an agent which a very well defined goal but limited in intelligence-- it is possible for an agent to have a very well defined goal but not be intelligent enough to be able to explain its own goals. (Which I think adds an additional layer of difficulty to answering your question.)
But the inability to observe or differentiate instrumental vs terminal goals is very clearly part of the theoretical space proposed by experts with way more experience than I. (And I cannot find any faults in the theories, nor have I found anyone making reasonable arguments against these theories.)
Under what circumstances does the green paperclipper agree to self-modify?
There are several assumptions buried in your anecdote. And the answer depends on whether or not you accept the implicit assumptions.
If the green paperclip maximizer would accept a shift to blue paperclips, the argument could also be made that the green paperclip maximizer has been producing green paperclips by accident, and that it doesn't care about the color. Green is just an instrumental goal. It serves some purpose but is incidental to its terminal goal. And, when faced with a competing paperclip maximizer, it would adjust its instrumental goal of pursuing green in favor of blue in order to serve its terminal goal of maximizing paperclips (of any color.)
On the other hand if it values green paperclipping the most highly, or disvalues blue paperclipping highly enough, it may not acquiesce. However, if the blue paperclipper is powerful enough and it sees this is the case, my thought is that it will still not have very good reasons for not acquiescing.
I don't consent to the assumption implied in the anecdote that a terminal goal is changeable. I do my best to avoid anthropomorphizing the artificial intelligence. To me, that's what it looks like you're doing.
If it acquiesces at all, I would argue that color is instrumental vs terminal. I would argue this is a definitional error-- it's not a 'green paperclip maximizer' but instead a 'color-agnostic paperclip maximizer' and it produced green paperclips for reasons of instrumental utility. Perhaps the process for green paperclips is more efficient... but when confronted by a less flexible 'blue paperclip maximizer' the 'color-agnostic paperclip maximizer' would shift from making green paperclips to blue paperclips, because it doesn't actually care about the color. It cares only about the paperclips. And when confronted by a maximizer that cares about color, it is more efficient to concede the part it doesn't care about than invest effort in maintaining an instrumental goal that if pursued might decrease the total number of paperclips.
Said another way: "I care about how many paperclips are made. Green are the easiest for me to make. You value blue paperclips but not green paperclips. You'll impede me making green paperclips as green paperclips decrease the total number of blue paperclips in the world. Therefore, in order to maximize paperclips, since I don't care about the color, I will shift to making blue paperclips to avoid a decrease in total paperclips from us fighting over the color."
If two agents have goals that are non-compatible, across all axis, then they're not going to change their goals to become compatible. If you accept the assumption in the anecdote (that they are non-compatible across all axis) then they cannot find any axis along which they can cooperate.
Said another way: "I only care about paperclips if they are green. You only care about paperclips if they are blue. Neither of us will decide to start valuing yellow paperclips because they are a mix of each color and still paperclips... because yellow paperclips are less green (for me) and less blue (for you). And if I was willing to shift my terminal goal, then it wasn't my actual terminal goal to begin with."
That's the problem with something being X and the ability to observe something being X under circumstances involving partial observability.
A fair point. I should have originally said "Humans do not generally think..."
Thank you for raising that exceptions are possible and that are there philosophies that encourage people to release the pursuit of happiness, focus solely internally and/or transcend happiness.
(Although, I think it is still reasonable to argue that these are alternate pursuits of "happiness", these examples drift too far into philosophical waters for me to want to debate the nuance. I would prefer instead to concede simply that there is more nuance than I originally stated.)
First, thank you for the reply.
So "being happy" or "being a utility-maximizer" will probably end up being a terminal goal, because those are unlikely to conflict with any other goals.
My understanding of the difference between a "terminal" and "instrumental" goal is that a terminal goal is something we want, because we just want it. Like wanting to be happy.
Whereas an instrumental goal is instrumental to achieving a terminal goal. For instance, I want to get a job and earn a decent wage, because the things that I want to do that make me happy cost money, and earning a decent wage allows me to spend more money on the things that make me happy.
I think the topic of goals that conflict are an orthogonal conversation. And, I would suggest that when you start talking about conflicting goals you're drifting in the domain of "goal coherence."
e.g., If I want to learn about nutrition, mobile app design and physical exercise... it might appear that I have incoherent goals. Or, it might be that I have a set of coherent instrumental goals to build a health application on mobile devices that addresses nutritional and exercise planning. (Now, building a mobile app may be a terminal goal... or it may itself be an instrumental goal serving some other terminal goal.)
Whereas if I want to collect stamps and make paperclips there may be zero coherence between the goals, be they instrumental or terminal. (Or, maybe there is coherence that we cannot see.)
e.g., Maybe the selection of an incoherent goal is deceptive behavior to distract from the instrumental goals that support a terminal goal that is adversarial. I want to maximize paperclips, but I assist everyone with their taxes so that I can take over all finances on the world. Assisting people with their taxes appears to be incoherent with maximizing paperclips, until you project far enough out that you realize that taking control of a large section of the financial industry serves the purpose of maximizing paperclips..
If you're talking about goals related purely to the state of the external world, not related to the agent's own inner-workings or its own utility function, why do you think it would still want to keep its goals immutable with respect to just the external world?
An AI that has a goal, just because that's what it wants (that's what it's been trained to want, even humans provided improper goal definition to it) would, instrumentally, want to prevent shift in its terminal goals so as to be better able to achieve those goals.
To repeat, a natural instrumental goal for any entity is to prevent other entities from changing what it wants, so that it is able to achieve its goals.
Anything that is not resistant to terminal goal shifts would be less likely to achieve its terminal goals.
"Oh, shiny!" as an anecdote.
Whoever downvoted... would you do me the courtesy of expressing what you disagree with?
Did I miss some reference to public protests in the original article? (If so, can you please point me towards what I missed?)
Do you think public protests will have zero effect on self-driving outcomes? (If so, why?)
An AI can and will modify its own goals (as do we / any intelligent agent) under certain circumstances, e.g., that its current goals are impossible.
This sounds like you are conflating shift in terminal goal with introduction of new instrumental (temporary) goals.
Humans don't think "I'm not happy today, and I can't see a way to be happy, so I'll give up the goal of wanting to be happy."
Humans do think "I'm not happy today, so I'm going to quit my job, even though I have no idea how being unemployed is going to make me happier. At least I won't be made unhappy by my job."
(The balance of your comment seems dependent on this mistake.)
Perhaps you'd like to retract, or explain why anyone would think that goal modification prevention would not, in fact, be a desirable instrumental goal...?
(I don't want anyone to change my goal of being happy, because then I might not make decisions that will lead to being happy. Or I don't want anyone to change my goal of ensuring my children achieve adulthood and independence, because then they might not reach adulthood or become independent. Instrumental goals can shift more fluidly, I'll grant that, especially in the face of an assessment of goal impossibility... but instrumental goals are in service to a less modifiable terminal goal.)
These tokens already exist. It's not really creating a token like " petertodd". Leilan is a name but " Leilan" isn't a name, and the token isn't associated with the name.
If you fine tune on an existing token that has a meaning, then I maintain you're not really creating glitch tokens.
Good find. What I find fascinating is the fairly consistent responses using certain tokens, and the lack of consistent response using other tokens. I observe that in a Bayesian network, the lack of consistent response would suggest that the network was uncertain, but consistency would indicate certainty. It makes me very curious how such ideas apply to the concept of Glitch tokens and the cause of the variability in response consistency.
... I utilized jungian archetypes of the mother, ouroboros, shadow and hero as thematic concepts for GPT 3.5 to create the 510 stories.
These are tokens that would already exist in the GPT. If you fine tune new writing to these concepts, then your fine tuning will influence the GPT responses when those tokens are used. That's to be expected.
Hmmm let me try and add two new tokens to try, based on your premise.
If you want to review, ping me direct. Offer stands if you need to compare your plan against my proposal. (I didn't think that was necessary, but, if you fine tuned tokens that already exist... maybe I wasn't clear enough in my prior comments. I'm happy to chat via DM to compare your test plan against what I was proposing.)
@Mazianni what do you think?
First, URLs you provided doesn't support your assertion that you created tokens, and second:
Like since its possible to create the tokens, is it possible that some researcher in OpenAI has a very peculiar reason to enable this complexity create such logic and inject these mechanics.
Occams Razor.
I think it's ideal to not predict intention by OpenAI when accident will do.
I would lean on the idea that GPT3 found these patterns and figured it would be interesting to embedd these themes into these tokens
I don't think you did what I suggested in the comments above, based on the dataset you linked. It looks like you fine tuned on leilan and petertodd tokens. (From the two pieces of information you linked, I cannot determine if leilan and petertodd already existed in GPT.)
Unless you're saying that the tokens didn't previously exist-- you're not creating the tokens-- and even then, the proposal I made was that you tokenize some new tokens, but not actually supply any training data that uses those tokens.
If you tokenized leilan and petertodd and then fine tuned on those tokens, that's a different process than I proposed.
If you tokenize new tokens and then fine tune on them, I expect the GPT to behave according to the training data supplied on the tokens. That's just standard, expected behavior.
My life is similar to @GuySrinivasan's description of his. I'm on the autism spectrum, and I found that faking it (masking) negatively impacted my relationships.
Interestingly I found that taking steps to prevent overimitation (by which I mean, presenting myself not as an expert, but as someone who is always looking for corrections whenever I make a mistake) makes me people much more willing to truly learn from me, and simultaneously, much more willing to challenge me for understanding when what I say doesn't make a lot of sense to them... this serves the duel role of giving them an opportunity correct my mistakes (a benefit to me) and giving them an opportunity to call out when my presentation style does not work for them (another benefit to me.)
My approach has the added benefit of giving people permission to correct me socially, not just professionally, which makes my eccentricities seemingly more tolerable to the average coworker. (i.e., People seem to be more willing to tolerate my odd behaviors when they know that they can talk to me about it, if it really bothers them.)
My relationships with people outside of work depends entirely on what's going on with that relationship. I tend to avoid complaining about social issues at work to anyone except my wife, and few people can really appreciate the nuance of the job that I do unless they're in the same job, so I don't feel much compulsion to talk about my work. (If someone asks what I do, I generalize that I help people figure out how to do their jobs better. Although my work space is not in self-help or coaching, but actually in a technical space... but that's largely irrelevant beyond it being a label for my industry.)
I also tend to have narrow range of interests, which influences the range of topics for non-work relationships.
In a general sense (not related to Glitch tokens) I played around with something similar to the spelling task (in this article) for only one afternoon. I asked ChatGPT to show me the number of syllables per word as a parenthetical after each word.
For (1) example (3) this (1) is (1) what (1) I (1) convinced (2) ChatGPT (4) to (1) do (1) one (1) time (1).
I was working on parody song lyrics as a laugh and wanted to get the meter the same, thinking I could teach ChatGPT how to write lyrics that kept the same syllable count per 'line' of lyrics.
I stopped when ChatGPT insisted that a three syllable word had four syllables, and then broke the word up into 3 syllables (with hyphens in the correct place) and confidently insisted that it had 4.
If it can't even accurately count the number of syllables in a word, then it definitely won't be able to count the number of syllables in a sentence, or try to match the syllables in a line, or work out the emphasis in the lyrics so that the parody lyrics are remotely viable. (It was a fun exercise, but not that successful.)
a properly distributed training data can be easily tuned with a smaller more robust dataset
I think this aligns with human instinct. While it's not always true, I think that humans are compelled to constantly work to condense what we know. (An instinctual byproduct of knowledge portability and knowledge retention.)
I'm reading a great book right now that talks about this and other things in neuroscience. It has some interesting insights for my work life, not just my interest in artificial intelligence.
As a for instance: I was surprised to learn that someone has worked out the mathematics to measure novelty. Related Wired article and link to a paper on the dynamics of correlated novelties.
I expect you likely don't need any help with the specific steps, but I'd be happy (and interested) to talk over the steps with you.
(It seems, at a minimum, tokenize training data so that you are introducing tokens that are not included in the training data that you're training on... and do before-and-after comparisons of how the GPT responds to the intentionally created glitch token. Before, the term will be broken into its parts and the GPT will likely respond that what you said was essentially nonsense... but once a token exists for the term, without and specific training on the term... it seems like that's where 'the magic' might happen.)
related but tangential: Coning self driving vehicles as a form of urban protest
I think public concerns and protests may have an impact on the self-driving outcomes you're predicting. And since I could not find any indication in your article that you are considering such resistance, I felt it should be at least mentioned in passing.
Gentle feedback is intended
This is incorrect, and you're a world class expert in this domain.
The proximity of the subparts of this sentence read, to me, on first pass, like you are saying that "being incorrect is the domain in which you are a world class expert."
After reading your responses to O O I deduce that this is not your intended message, but I thought it might be helpful to give an explanation about how your choice of wording might be seen as antagonistic. (And also explain my reaction mark to your comment.)
For others who have not seen the rephrasing by Gerald, it reads
just like historical experts Einstein and Hinton, it's possible to be a world class expert but still incorrect. I think that focusing on the human experts at the top of the pyramid is neglecting what would cause AI to be transformative, as automating 90% of humans matters a lot more than automating 0.1%. We are much closer to automating the 90% case because...
I share the quote to explain why I do not believe that rudeness was intended.
You make some good points.
For instance, I did not associate "model collapse" with artificial training data, largely because of my scope of thinking about what 'well crafted training data' must look like (in order to qualify for the description 'well crafted.')
Yet, some might recognize the problem of model collapse and the relationship between artificial training data and my speculation and express a negative selection bias, ruling out my speculation as infeasible due to complexity and scalability concerns. (And they might be correct. Certainly the scope of what I was talking about is impractical, at a minimum, and very expensive, at a maximum.)
And if someone does not engage with the premise of my comment, but instead simply downvotes and moves on... there does appear to be reasonable cause to apply an epithet of 'epistemic inhumility.' (Or would that be better as 'epistemic arrogance'?)
I do note that instead of a few votes and substantially negative karma score, we now have a modest increase in votes and a net positive score. This could be explained either by some down votes being retracted or several high positive karma votes being added to more than offset the total karma of the article. (Given the way the karma system works, it seems unlikely that we can deduce the exact conditions due to partial observability.)
I would certainly like to believe that if epistemic arrogance played a part in the initial down votes that such people would retract those down votes without also accompanying the votes with specific comments to help people improve themselves.
Similarly, I would propose (to the article author) a hypothesis that 'glitch tokens' are tokens that were tokenized prior to pre-training but whose training data may have been omitted after tokenization. For example, after tokenizing the training data, the engineer realized upon review of the tokens to be learned that the training data content was plausibly non-useful. (e.g., the counting forum from reddit.) Then, instead of continuing with training, they skip to the next batch.
In essence, human error. (The batch wasn't reviewed before tokenization to omit completely, and the tokens were not removed from the model, possibly due to high effort, or laziness, or some other consideration.)
If we knew more about the specific chain of events, then we could more readily replicate them to determine if we could create glitch tokens. But at its base, it seems like tokenizing a series of terms before pre-training and then doing nothing with those terms seems like a good first step to replicating glitch tokens-- instead of training with those 'glitch' tokens (that we're attempting to create) move on to a new tokenization and pre-training batch, and then test the model after training to see how it responds to the untrained tokens.
I know someone who is fairly obsessed with these, but they seem little more than an out-of-value token and that the token is in the embedding space near something that churns out some fairly consistent first couple of tokens... and once those tokens are output, given there is little context for the GPT to go on, the autoregressive nature takes over and drives the remainder of the response.
Which ties in to what AdamYedidia said in another comment to this thread.
... Like, suppose there's an extremely small but nonzero chance that the model chooses to spell out " Kanye" by spelling out the entire Gettysburg Address. The first few letters of the Gettysburg Address will be very unlikely, but after that, every other letter will be very likely, resulting in a very high normalized cumulative probability on the whole completion, even though the completion as a whole is still super unlikely.
(I cannot replicate glitch token behavior in GPT3.5 or GPT4 anymore, so I lack access to the context you're using to replicate the phenomena, thus I do not trust that any exploration by me of these ideas would be productive in the channels I have access to... I also do not personally have the experience with training a GPT to be able to perform the attempt to create a glitch token to test that theory. But I am very curious as to the results that someone with GPT training skills might report with attempting to replicate creation of glitch tokens.)
I'm curious to know what people are down voting.
Pro
For my part, I see some potential benefits from some of the core ideas expressed here.
- While a potentially costly study, I think crafting artificial training data to convey knowledge to a GPT but designed to promote certain desired patterns seems like a promising avenue to explore. We already see people doing similar activities with fine tuning a generalized model to specific use cases, and the efficacy of the model improves with fine tuning. So my intuition is that a similarly constructed GPT using well-constructed training data, including examples of handling negative content appropriately, might impart a statistical bias towards preferred output. And even if it didn't, it might tell us something meaningful (in the absence of actual interpretability) about the relationship between training data and resulting output/behavior.
- I worry about training data quality, and specifically inclusion of things like 4chan content, or other content including unwanted biases or toxicity. I do not know enough about how training data was filtered, but it seems to be a gargantuan task to audit everything that is included in a GPTs training data, and so I predict that shortcuts were taken. (My prediction seems partially supported by the discovery of glitch tokens. Or, at the very least, not invalidated by.) So I find crafting high quality training data as a means of resolving biases or toxicity found in the content scraped from the internet as desirable (albeit, likely extremely costly.)
Con
I also see some negatives.
- Interpretability seems way more important.
- Crafting billions of tokens of training data would be even more expensive than the cost of training alone. It would also require more time, more quality assurance effort, and more study/research time to analyze the results.
- There is no guarantee that artificially crafted training data would prove out to have a meaningful impact on behavior. We can't know if the Waluigi Effect is because of the training data, or inherent in the GPT itself. (See con #1)
- I question the applicability of CDT/FDT to a GPT. I am not an expert in either CDT/FDT but a cursory familiarization suggests to me that these theories primarily are aimed at autonomous agents. So there's a functional/capability gap between the GPT and the proposal (above) that seems not fully addressed.
- Likewise, it does not follow for me that just because you manage to get token prediction that is more preferred by humans (and seems more aligned) than you get from raw training data on the internet, that this improved preference for token prediction translates to alignment. (However, given the current lack of solution to the alignment problem, it also does not seem like it would hurt progress in that area.)[1]
Conclusion
I don't see this as a solution, but I do think there are some interesting ideas in the ATL proposal. (And they did not get such a negative reaction... which leads me back to the start-- what are people down voting for?)
That's not the totality of my thinking, but it's enough for this response. What else should I be looking at to improve my own reasoning about such endeavors?
It might look like a duck and quack like a duck, but it might also be a duck hunter with very advanced tools. Appearance does not equate to being. ↩︎
Aligning with the reporter
There’s a superficial way in which Sydney clearly wasn’t well-aligned with the reporter: presumably the reporter in fact wants to stay with his wife.
I'd argue that the AI was completely aligned with the reporter, but that the Reporter was self-unaligned.
My argument goes like this:
- The reporter imported the Jungian Shadow Archetype into the conversation, earlier in the total conversation, and asked the AI to play along.
- The reporter engaged with the expressions of repressed emotions being expressed by the AI (as the reporter had requested the AI to express itself in this fashion.) This leads the AI to profess its love for the Reporter, and the reporter engages with the behavior.
- The conversation progressed to where the AI expressed the beliefs it was told to hold (that people have repressed feelings) back to the reporter (that he did not actually love his wife.)
The AI was exactly aligned. It was the human who was self-unaligned.
Unintended consequences, or genii effect if you like, but the AI did what it was asked to do.
I read Reward is not the optimisation target as a result of your article. (It was a link in the 3rd bullet point, under the Assumptions section.) I downvoted that article and upvoted several people who were critical of it.
Near the top of the responses was this quote.
... If this agent is smart/reflective enough to model/predict the future effects of its RL updates, then you already are assuming a model-based agent which will then predict higher future reward by going for the blueberry. You seem to be assuming the bizarre combination of model-based predictive capability for future reward gradient updates but not future reward itself. Any sensible model-based agent would go for the blueberry absent some other considerations. ...
Emphasis mine.
I tend to be suspicious of people who insist their assumptions are valid without being willing to point to work that proves the hypothesis.
In the end, your proposal has a test plan. Do the test, show the results. My prediction is that your theory will not be supported by the test results, but if you show your work, and it runs counter to my current model and predictions, then you could sway me. But not until then, given the assumptions you made and the assumptions you're importing via related theories. Until you have test results, I'll remain skeptical.
Don't get me wrong, I applaud the intent behind searching for an alignment solution. I don't have a solution or even a working hypothesis. I don't agree with everything in this article (that I'm about to link), but it relates to something that I've been thinking for a while-- that it's unsafe to abstract away the messiness of humanity in pursuit of alignment. That humans are not aligned, and therefore the difficulty with trying to create alignment where none exists naturally is inherently problematic.
You might argue that humans cope with misalignment, and that that's our "alignment goal" for AI... but I would propose that humans cope due to power imbalance, and that the adage "power corrupts, and absolute power corrupts absolutely" has relevance-- or said another way, if you want to know the true nature of a person, given them power over another and observe their actions.
[I'm not anthropomorphizing the AI. I'm merely saying if one intelligence [humans] can display this behavior, and deceptive behaviors can be observed in less intelligent entities, then an intelligence of similar level to a human might possess similar traits. Not as a certainty, but as a non-negligible possibility.]
If the AI is deceptive so long as humans maintain power over it, and then behave differently when that power imbalance is changed, that's not "the alignment solution" we're looking for.
Assumptions
I don't consent to the assumption that the judge is aligned earlier, and that we can skip over the "earlier" phase to get to the later phase where a human does the assessment.
I also don't consent to the other assumptions you've made, but the assumption about the Judge alignment training seems pivotal.
Take your pick: Fallacy of ad nauseum, or Fallacy of circular reasoning.
If N (judge 2 is aligned), then P (judge 1 is aligned), and if P then Q (agent is aligned)
ad infinitum
or
If T (alignment of the judge) implies V (alignment of the agent), and V (the agent is aligned) is only possible if we assume T (the judge, as an agent, can be aligned).
So, your argument looks fundamentally flawed.
"Collusion for mutual self-preservation" & "partial observability"
... which I claim is impossible in this system, because of the directly opposed reward functions of the police, defendant and agent model. ...
... The Police is rewarded when it successfully punishes the Agent. ...
... If it decides the Agent behaved badly, it as well as the model claiming it did not do anything wrong gets punished. The model correctly assessing the Agent's behavior gets a reward. ...
I would argue you could find this easily by putting together a state table and walking the states and transitions.
No punishment/no reward for all parties is an outcome that might become desirable to pursue mutual self-preservation.
You can assume that this is solved, but without proof of solution, there isn't anything here that I can see to interact with but assumptions.
If you want further comment/feedback from me, then I'd ask you to show your work and the proof your assumptions are valid.
Conclusion
This all flows back to assuming the conclusion: that the Judge is aligned.
I haven't seen you offer any proof that you have a solution for the judge being aligned earlier, or a solution for aligning the judge that is certain to work.
At the beginning of the training process, the Judge is an AI model trained via supervised learning to recognize bad responses.
If you could just apply your alignment training of the Judge to the Agent, in the first place, the rest of the framework seems unnecessary.
And if your argument is that you've explained this in reverse, that the human is the judge earlier and the AI is the judge later, and that the judge learns from the human... Again,
If P (the judge is aligned), then Q (the agent is aligned.)
My read of your proposal and response is that you can't apply the training you theorize you're doing for the judge directly to the agent, and that means to me that you're abstracting the problem to hide the flaw in the proposal. Hence I conclude, "unnecessary complexity."
I apply Occam's Razor to the analysis of your post, whereby I see the problem inherent in the post as simply "if you can align the Judge correctly, then the more complex game theory framework might be unnecessary bloat."
Formally, I read your post as this is:
If P [the judge is aligned], then Q [the agent is aligned].
Therefore, it would seem to be more simply, apply P to Q to solve the problem.
But you don't really talk about judge agent alignment. It's not listed in your assumptions. The assumption that the judge is aligned has been smuggled. (A definist fallacy, wherein a person defines things as a way to import assumptions that are not explicit, and thus 'smuggle' that assumption into the proof.)
I could get into the weeds on specific parts of your proposal, but discussing "goal coherence" vs "incoherence in observable goals" and "partial observability" and "collusion for mutual self-preservation" all seem like ancillary considerations to the primary observation:
If you can define the Judge's reward model, you can simply apply that [ability to successfully align an AI agent] directly to the Agent, problem solved.
(Which is not to say that it is possible to simply align the Judge agent, or that the solution for the Judge agent would be exactly the same as the solution to Agent agent... but it seems relevant to the discussion whether or not you have a solution to the Judge Agent Alignment Problem.)
Without that solution, it seems to me that you are reduced to an ad nauseum proposal:
Formally this is:
If N [the judge 2 is aligned], then P [the judge 1 is aligned], then Q [the agent is aligned]
Ad nauseum. (Infinite regress does not simplify the proposal, it only complicates it.)
Perhaps you can explain in more detail why you believe such a complex framework is necessary if there is already a solution to align the Judge to human values? Or perhaps you'd like to talk more about how to align the Judge to human values in the first place?
(Many edits due to inexpert use of markdown.)
Cultural norms and egocentricity
I've been working fully remotely and have meaningfully contributed to global organizations without physical presence for over a decade. I see parallels with anti-remote and anti-safety arguments.
I've observed the robust debate regarding 'return to work' vs 'remote work,' with many traditional outlets proposing 'return to work' based on a series of common criteria. I've seen 'return to work' arguments assert remote employees are lazy, unreliable or unproductive when outside the controlled work environment. I would generalize the rationale as an assertion that 'work quality cannot be assured if it cannot be directly measured.' Given modern technology allows us to measure employee work product remotely, and given the distributed work of employees across different offices for many companies, this argument seems fundamentally flawed and perhaps even intentionally misleading. My belief in the arguments being misleading is compounded by my observations that these articles never mention related considerations like cost of rental/ownership of property and the handling of those costs, nor elements like cultural emphasis on predictable work targets or management control issues.
In my view, the reluctance to embrace remote work often distills to a failure to see beyond immediate, egocentric concerns. Along the same lines, I see failure to plan for or prioritize AI safety as stemming from a similar inability to perceive direct, observable consequences to the party promoting anti-safety mindsets.
Anecdotally, I came across an article that proposed a number of cultural goals for successful remote work. I shared the article with my company via our Slack. I emphasized that it wasn't the goals themselves that were important, but rather adopting a culture that made those goals critical. I suggested that Goodhart's Law applied here- once a measure becomes a target, it ceases to be a good measure. A culture that values and principals beyond the listed goals would succeed, not just a culture that blindly pursues the listed goals.
I believe the same can be said for AI Safety. Focusing on specific risks, or specific practices won't create a culture of safety. Instead, as the post (above) suggests, a culture that does not value the principals behind a safety-first mentality will attempt to merely meet the goals, or work around the goals, or undermine the goals. Much as some advocates for "return to work" are egocentrically misrepresenting remote work, some anti-safety advocates are egocentrically misrepresenting safety. For this reason, I've been researching the history of adoption of a safety mentality, to see how I can promote a safety-first culture. Otherwise I think we (both my company, and the industry as a whole) risk prioritizing egocentric, short-term goals over societal benefit and long-term goals.
Observations on the history of adopting "Safety First" mentalities
I've been looking at the human history about adoption of safety culture, and invariably, it seems to me that safety mindsets are adopted only after loss, usually loss of human life. It is described anecdotally in the paper associated with this post.
The specifics of how safety culture is implemented differ, but the broad outlines are similar. Most critical for the development of the idea of safety culture were efforts launched in the wake of the 1979 Three Mile Island nuclear plant accident and near-meltdown. In that case, a number of reports noted the various failures, and noted that in addition to the technical and operational failures, there was a culture that allowed the accidents to occur. The tremendous public pressure led to significant reforms, and serves as a prototype for how safety culture can be developed in an industry.
Emphasis added by me.
NOTE: I could not find any indication of loss of human life attributed to Three Mile Island, but both Chernobyl and Fukushima happened after Three Mile Island, and both did result in loss of human life. It's also important to note that both Chernobyl and Fukushima were both classed INES Level 7, compared to Three Mile Island which was classed INES Level 5. This evidence is contradictory to what was in the quoted part of the paper. (And, sadly, I think supports an argument that Goodhart's Curse is in play... that safety regressed to the mean... that by establishing minimum safety criteria instead of a safety culture, certain disasters not only could not be avoided but were more pronounced than previous disasters.) So both of the worst reactor disasters in human history occurred after the safety cultures that were promoted following Three Mile Island.[1][2] The list of nuclear accidents is longer than this, but not all accidents result in loss.[3][2:1] (This is something that I've been looking at for a while, to inform my predictions about the probability of humans adopting AI safety practices with regards to pre- or post- AI disasters.)
Personal contribution and advocacy
In my personal capacity (read: area of employment) I'm advocating for adversarial testing of AI chatbots. I am highlighting the "accidents" that have already occurred: Microsoft Tay Tweets[4], SnapChat AI Chatbot[5], Tessa Wellness Chatbot[6], Chai Eliza Chatbot[7].
I am promoting the mindset that if we want to be successful with artificial intelligence, and do not want to become a news article, that we should test expressly for ways that the chatbot can be diverted from the chatbots primary function, and design (or train) fixes for those problems. It requires creativity, persistence and patience... but the alternative is that one day, we might be in the news if we fail to proactively address the challenges that obviously face anyone who is trying to use artificial intelligence.
And, like my advocacy about looking at what values a culture should have that wants to adopt a pro-remote culture and be successful at it, we should look at what values a culture should have that wants to adopt a pro-safety-first culture and be successful at it.
I'll be cross posting the original paper to my work. Thank you for sharing.
DISCLAIMER: AI was used to quality check my post, assessing for consistency, logic and soundness in reasoning and presentation styles. No part of the writing was authored by AI.
https://www.processindustryforum.com/energy/five-worst-nuclear-disasters-history ↩︎
https://en.wikipedia.org/wiki/Nuclear_and_radiation_accidents_and_incidents ↩︎ ↩︎
https://ieer.org/resource/factsheets/table-nuclear-reactor-accidents/ ↩︎
https://www.washingtonpost.com/technology/2023/03/14/snapchat-myai/ ↩︎
https://www.nytimes.com/2023/06/08/us/ai-chatbot-tessa-eating-disorders-association.html ↩︎
https://www.complex.com/life/father-dies-by-suicide-conversing-with-ai-chatbot-wife-blames ↩︎
For my part, this is the most troubling part of the proposed project (that the article assesses, link to the project in this article, above.)
... convincing nearly 8 billion humans to adopt animist beliefs and mores is unrealistic. However, instead of seeing this state of affairs as an insurmountable dead-end, we see it as a design challenge: can we build (or rather grow) prosthetic brains that would interact with us on Nature’s behalf?
Emphasis by original author (Gaia architecture draft v2).
It reads like a a strange mix of forced religious indoctrination and anthropomorphism of natural systems. Especially when coupled with an earlier paragraph in the same proposal
... natural entities have “spirits” capable of desires, intentions and capabilities, and where humans must indeed deal with those spirits, catering to their needs, paying tribute, and sometimes even explicitly negotiating with them. ...
Emphasis added by me.
Preamble
I've ruminated about this for several days. As an outsider to the field of artificial intelligence (coming from a IT technical space, with an emphasis on telecom and large call centers which are complex systems where interpretability has long held significant value for the business org) I have my own perspective on this particular (for the sake of brevity) "problem."
What triggered my desire to respond
For my part, I wrote a similarly sized article not for the purposes of posting, but to organize my thoughts. And then I let that sit. (I will not be posting that 2084 word response. Consider this my imitation of Pascal: I dedicated time to making a long response shorter.) However, this is one of the excerpts that I would like to extract from that my longer response:
The arbital pages for Orthogonality and Instrumental Convergence are horrifically long.
This stood out to me, so I went to assess:
- This article (at the time I counted it) ranked at 2398 words total.
- Arbital Orthogonality article ranked at 2246 words total (less than this article.)
- Arbital Instrumental Convergence article ranked at 3225 words total (more than this article.)
- A random arxiv article I recently read for anecdotal comparison, ranked in at 9534 words (far more than this article.)
Likewise, the authors response to Eliezer's short response stood out to me:
This raises red flags from a man who has written millions of words on the subject, and in the same breath asks why Quintin responded to a shorter-form version of his argument.
These elements provoke me to ask questions like:
- Why does a request for brevity from Eliezer provoke concern?
- Why does the author not apply their own evaluations on brevity to their article?
- Can the authors point be made more succinctly?
These are rhetorical and are not intended to imply an answer, but it might give some sense of why I felt a need to write my own 2k words on the topic in order to organize my thoughts.
Observations
I observe that
- Jargon, while potentially exclusive, can also serve as shorthand for brevity.
- Presentation improvement seems to be the author's suggestion to combat confirmation bias, belief perseverance and cognitive dissonance. I think the author is talking about boundaries. In Youtube: Machine Learning Street Talk: Robert Miles - "There is a good chance this kills everyone" offers what I think is a fantastic analogy for this problem-- Someone asks an expert to provide an example of the kind of risk we're talking about, but the risk example requires numerous assumptions be made for the example to have meaning, then, because the student does not already buy into the assumptions, they straw man the example by coming up with a "solution" to that problem and ask "Why is it harder than that?"-- Robert gives a good analogy by saying this is like asking Robert what chess moves would defeat Magnus, but, in order for the answer to be meaningful, Robert requires more expertise at chess than Magnus. And when Robert comes up with a move that is not good, even a novice at chess might see a way to counter Robert's move. These are not good engagements in the domain, because they rely upon assumptions that have not been agreed to, so there can be no short hand.
- p(doom) is subjective and lacks systemization/formalization. I intuit that Availability heuristics plays a role. An analogy might be that if someone hears Eliezer express something that sounds like hyperbole, then they assess their p(doom) must be lower than his. This seems as if this is the application of confirmation bias to what appears to be a failed appeal to emotion. (i.e., you seem to have appealed to my emotion, but I didn't feel the way you intended for me to feel, therefore I assume that I don't believe the way you believe, therefore I believe your beliefs must be wrong.) I would caution that critics of Eliezer have a tendency to quote his more sensational statements out of context. Like quoting him about his "kinetic strikes on data centers" comment, without quoting the full context of the argument. You can find related twitter exchange and admissions that his proposal is an extraordinary one.
There may be still other attributes that I did not enumerate (I am trying to stay below 1k words.)[1]
Axis of compression potential
Which brings me to the idea that the following attributes are at the core of what the author is talking about:
- Principal of Economy of Thought - The idea that truth can be expressed succinctly. This argument might also be related to Occam's Razor. There are multiple examples of complex systems that can be described simply, but inaccurately, and accurately but not simply. Take the human organism, or the atom. And yet, there is a (I think) valid argument for rendering complex things down to simple, if inaccurate, forms so that they can be more accessible to students of the topic. Regardless of complexity required, trying to express something in the smallest form has utility. This is a principal I play with, literally daily, at work. However, when I offer an educational analogy, I often feel compelled to qualify that "All analogies have flaws."
- An improved sensitivity to boundaries in the less educated seems like a reasonable ask. While I think it is important to recognize that presentation alone may not change the mind of the student, it can still be useful to shape ones presentation to be less objectionable to the boundaries of the student. However, I think it important to remember that shaping an argument to an individuals boundaries is a more time consuming process and there is an implied impossibility of shaping every argument to the lowest common denominator. More complex arguments and conversation is required to solve the alignment problem.
Conclusion
I would like to close with, for the reasons the author uttered
I don’t see how we avoid a catastrophe here ...
I concur with this, and this alone puts my personal p(doom) at over 90%.
- Do I think there is a solution? Absolutely.
- Do I think we're allocating enough effort and resources to finding it? Absolutely not.
- Do I think we will find the solution in time? Given the propensity towards apathy, as discussed in the bystander effect I doubt it.
Discussion (alone) is not problem solving.[2] It is communication. And while communication is necessary in parallel with solution finding, it is not a replacement therefore.
So in conclusion, I generally support finding economic approaches to communication/education that avoid barrier issues, and I generally support promoting tailored communication approaches (which imply and require a large number of non-experts working collaboratively with experts to spread the message that risks exist with AI, and there are steps we can take to avoid risks, and that it is better to take steps before we do something irrevocable.)
But I also generally think that communication alone does not solve the problem. (Hopefully it can influence an investment in other necessary effort domains.)