Posts
Comments
I think the important factors w.r.t. risks re [morally relevant disvalue that occurs during inference in ML models] are probably more like:
- The training algorithm. Unsupervised learning seems less risky than model-free RL (e.g. the RLHF approach currently used by OpenAI maybe?); the latter seems much more similar, in a relevant sense, to the natural evolution process that created us.
- The architecture of the model.
Being polite to GPT-n is probably not directly helpful (though it can be helpful by causing humans to care more about this topic). A user can be super polite to a text generating model, and the model (yielded by model-free RL) can still experience disvalue, particularly during an 'impossible inference', one in which the input text (the "environment") is bad in the sense that there is obviously no way to complete the text in a "good" way.
See also: this paper by Brian Tomasik.
My question was about whether ARC gets to evaluate [the most advanced model that the AI company created so far] before the company creates a slightly more advanced model (by scaling up the architecture, or by continuing the training process of the evaluated model).
Did OpenAI/Anthropic allow you to evaluate smaller scale versions* of GPT4/Claude before training the full-scale model?
* [EDIT: and full-scale models in earlier stages of the training process]
Will this actually make things worse? No, you're overthinking this.
This does not seem like a reasonable attitude (both in general, and in this case specifically).
Having thought a bunch about acausal trade — and proven some theorems relevant to its feasibility — I believe there do not exist powerful information hazards about it that stand up to clear and circumspect reasoning about the topic.
Have you discussed this point with other relevant researchers before deciding to publish this post? Is there a wide agreement among relevant researchers that a public, unrestricted discussion about this topic is net-positive? Have you considered the unilateralist's curse and biases that you may have (in terms of you gaining status/prestige from publishing this)?
Re impact markets: there's a problem regarding potentially incentivizing people to do risky, net-negative things (that can end up being beneficial). I co-authored this post about the topic.
(Though even in that case it's not necessarily a generalization problem. Suppose every single "test" input happens to be identical to one that appeared in "training", and the feedback is always good.)
Generalization-based. This categorization is based on the common distinction in machine learning between failures on the training distribution, and out of distribution failures. Specifically, we use the following process to categorize misalignment failures:
- Was the feedback provided on the actual training data bad? If so, this is an instance of outer misalignment.
- Did the learned program generalize poorly, leading to bad behavior, even though the feedback on the training data is good? If so, this is an instance of inner misalignment.
This categorization is non-exhaustive. Suppose we create a superintelligence via a training process with good feedback signal and no distribution shift. Should we expect that no existential catastrophe will occur during this training process?
Relevant & important: The unilateralist's curse.
I'm interested in hearing what you think the counterfactuals to impact shares/retroactive funding in general are, and why they are better.
The alternative to launching an impact market is to not launch an impact market. Consider the set of interventions that get funded if and only if an impact market it launched. Those are interventions that no classical EA funder decides to fund in a world without impact markets, so they seem unusually likely to be net-negative. Should we move EA funding towards those interventions, just because there's a chance that they'll end up being extremely beneficial? (Which is the expected result of launching a naive impact market.)
I expect prosocial projects to still be launched primarily for prosocial reasons, and funding to be a way of enabling them to happen and publicly allocating credit. People who are only optimizing for money and don't care about externalities have better ways available to pursue their goals, and I don't expect that to change.
It seems that according to your model, it's useful to classify (some) humans as either:
(1) humans who are only optimizing for money, power and status; and don't care about externalities.
(2) humans who are working on prosocial projects primarily for prosocial reasons.
If your model is true, how come the genes that cause humans to be type (1) did not completely displace the genes that cause humans to be type (2) throughout human evolution?
According to my model (without claiming originality): Humans generally tend to have prosocial motivations, and people who work on projects that appear prosocial tend to believe they are doing it for prosocial reasons. But usually, their decisions are aligned with maximizing money/power/status (while believing that their decisions are purely due to prosocial motives).
Also, according to my model, it is often very hard to judge whether a given intervention for mitigating x-risks is net-positive or net-negative (due to an abundance of crucial considerations). So subconscious optimizations for money/power/status can easily end up being extremely harmful.
If you describe the problem as "this encourages swinging for the fences and ignoring negative impact", impact shares suffer from it much less than many parts of effective altruism. Probably below average. Impact shares at least have some quantification and feedback loop, which is more than I can say for the constant discussion of long tails, hits based giving, and scalability.
But a feedback signal can be net-negative if it creates bad incentives (e.g. an incentive to regard an extremely harmful outcome that a project can end up causing as if that potential outcome was neutral).
(To be clear, my comment was not about the funding of your specific project but rather about the general funding approach that is referred to in the title of the OP.)
How do you avoid the problem of incentivizing risky, net-negative projects (that have a chance of ending up being beneficial)?
You wrote:
Ultimately we decided that impact shares are no worse than the current startup equity model, and that works pretty well. “No worse than startup equity” was a theme in much of our decision-making around this system.
If the idea is to use EA funding and fund things related to anthropogenic x-risks, then we probably shouldn't use a mechanism that yields similar incentives as "the current startup equity model".
The smooth graphs seem like good evidence that there are much smoother underlying changes in the model, and that the abruptness of the change is about behavior or evaluation rather than what gradient descent is learning.
If we're trying to predict abrupt changes in the accuracy of output token sequences, the per-token log-likelihood can be a useful signal. What's the analogous signal when we're talking about abrupt changes in a model's ability to deceptively conceal capabilities, hack GPU firmware, etc.? What log-likelihood plots can we use to predict those types of abrupt changes in behavior?
Does everyone who work at OpenAI sign a non-disparagement agreement? (Including those who work on governance/policy?)
Yes. To be clear, the point here is that OpenAI's behavior in that situation seems similar to how, seemingly, for-profit companies sometimes try to capture regulators by paying their family members. (See 30 seconds from this John Oliver monologue as evidence that such tactics are not rare in the for-profit world.)
Another bit of evidence about OpenAI that I think is worth mentioning in this context: OPP recommended a grant of $30M to OpenAI in a deal that involved OPP's then-CEO becoming a board member of OpenAI. OPP hoped that this will allow them to make OpenAI improve their approach to safety and governance. Later, OpenAI appointed both the CEO's fiancée and the fiancée's sibling to VP positions.
Sorry, that text does appear in the linked page (in an image).
The Partnership may never make a profit
I couldn't find this quote in the page that you were supposedly quoting from. The only google result for it is this post. Am I missing something?
That being said, I think that, most of the time, alignment work ending up in training data is good, since it can help our AI systems be differentially better at AI alignment research (e.g. relative to how good they are at AI capabilities research), which is something that I think is pretty important.
That consideration seems relevant only for language models that will be doing/supporting alignment work.
Maybe the question here is whether including certain texts in relevant training datasets can cause [language models that pose an x-risk] to be created X months sooner than otherwise.
The relevant texts I'm thinking about here are:
- Descriptions of certain tricks to evade our safety measures.
- Texts that might cause the ML model to (better) model AIS researchers or potential AIS interventions, or other potential AI systems that the model might cooperate with (or that might "hijack" the model's logic).
Is that because you think it would be hard to get the relevant researchers to exclude any given class of texts from their training datasets [EDIT: or prevent web crawlers from downloading the texts etc.]? Or even if that part was easy, you would still feel that that lever is very small?
First point: by "really want to do good" (the really is important here) I mean someone who would be fundamentally altruistic and would not have any status/power desire, even subconsciously.
Then I'd argue the dichotomy is vacuously true, i.e. it does not generally pertain to humans. Humans are the result of human evolution. It's likely that having a brain that (unconsciously) optimizes for status/power has been very adaptive.
Regarding the rest of your comment, this thread seems relevant.
I'd add to that bullet list:
- Severe conflicts of interest are involved.
I strong downvoted your comment in both dimensions because I found it disagreeable and counterproductive.
Generally, I think it would be net-negative to discourage such open discussions about unilateral, high-risk interventions—within the EA/AIS communities—that involve conflicts of interest. Especially, for example, unilateral interventions to create/fund for-profit AGI companies, or to develop/disseminate AI capabilities.
Like, who knew that the thing would become a Discord server with thousands of people talking about ML? That they would somewhat succeed? And then, when the thing is pretty much already somewhat on the rails, what choice do you even have? Delete the server? Tell the people who have been working hard for months to open-source GPT-3 like models that "we should not publish it after all"?
I think this eloquent quote can serve to depict an important, general class of dynamics that can contribute to anthropogenic x-risks.
I don't think discussing whether someone really wants to do good or whether there is some (possibly unconscious?) status-optimization process is going to help us align AI.
Two comments:
- [wanting to do good] vs. [one's behavior being affected by an unconscious optimization for status/power] is a false dichotomy.
- Don't you think that unilateral interventions within the EA/AIS communities to create/fund for-profit AGI companies, or to develop/disseminate AI capabilities, could have a negative impact on humanity's ability to avoid existential catastrophes from AI?
This concern seems relevant if (1) a discount factor is used in an RL setup (otherwise the systems seems as likely to be deceptively aligned with or without the intervention, in order to eventually take over the world), and (2) a decision about whether the system is safe for deployment is made based on its behavior during training.
As an aside, the following quote from the paper seems relevant here:
Ensuring copies of the states of early potential precursor AIs are preserved to later receive benefits would permit some separation of immediate safety needs and fair compensation.
I think this comment is lumping together the following assumptions under the "continuity" label, as if there is a reason to believe that either they are all correct or all incorrect (and I don't see why):
- There is large distance in model space between models that behave very differently.
- Takeoff will be slow.
- It is feasible to create models that are weak enough to not pose an existential risk yet able to sufficiently help with alignment.
I bet more on scenarios where we get AGI when politics is very different compared to today.
I agree that just before "super transformative" ~AGI systems are first created, the world may look very differently than it does today. This is one of the reasons I think Eliezer has too much credence on doom.
Even with adequate closure and excellent opsec, there can still be risks related to researchers on the team quitting and then joining a competing effort or starting their own AGI company (and leveraging what they've learned).
Do you generally think that people in the AI safety community should write publicly about what they think is "the missing AGI ingredient"?
It's remarkable that this post was well received on the AI Alignment Forum (18 karma points before my strong downvote).
Regarding the table in the OP, there seem to be strong selection effects that are involved. For example, the "recruitment setting" for the "Goërtz 2020" study is described as:
Recruitment from Facebook groups for COVID-19 patients with persistent symptoms and registries on the website of the Lung Foundation on COVID-19 information
Hey there!
And then finally there are actually some formal results where we try to formalize a notion of power-seeking in terms of the number of options that a given state allows a system. This is work [...] which I'd encourage folks to check out. And basically you can show that for a large class objectives defined relative to an environment, there's a strong reason for a system optimizing those objectives to get to the states that give them many more options.
Do you understand the main theorems in that paper and for what environments they are applicable? (My impression is that very few people do, even though the work has been highly praised within the AI alignment community.)
[EDIT: for more context see this comment.]
Rather than letting super-intelligent AI take control of human's destiny, by merging with the machines humans can directly shape their own fate.
.
Since humans connected to machines are still “human”, anything they do definitionally satisfies human values.
We are already connected to machines (via keyboards and monitors). The question is how a higher bandwidth interface will help in mitigating risks from huge, opaque neural networks.
Suppose that each subnetwork does general reasoning and thus up until some point during training the subnetworks are useful for minimizing loss.
[EDIT: sorry, I need to think through this some more.]
I wouldn't use the myopic vs. long-term framing here. Suppose a model is trained to play chess via RL, and there are no inner alignment problems. The trained model corresponds to a non-myopic agent (a chess game can last for many time steps). But the environment that the agent "cares" about is an abstract environment that corresponds to a simple chess game. (It's an environment with less than states). The agent doesn't care about our world. Even if some potential activation values in the network correspond to hacking the computer that runs the model and preventing the computer from being turned off etc., the agent is not interested in doing that. The computer that runs the agent is not part of the agent's environment.
If the model that is used as a Microscope AI does not use any optimization (search), how will it compute the probability that, say, Apple's engineers will overcome a certain technical challenge?
Agents that don't care about influencing our world don't care about influencing the future weights of the network.
(Haven't read the OP thoroughly so sorry if not relevant; just wanted to mention...)
If any part of the network at any point during training corresponds to an agent that "cares" about an environment that includes our world then that part can "take over" the rest of the network via gradient hacking.
Should we take this seriously? I'm guessing no, because if this were true someone at OpenAI or DeepMind would have encountered it also and the safety people would have investigated and discovered it and then everyone in the safety community would be freaking out right now.
(This reply isn't specifically about Karpathy's hypothesis...)
I'm skeptical about the general reasoning here. I don't see how we can be confident that OpenAI/DeepMind will encounter a given problem first. Also, it's not obvious to me that the safety people at OpenAI/DeepMind will be notified about a concerning observation that the capabilities-focused team can explain to themselves with a non-concerning hypothesis.
What I can do is point to my history of acting in ways that, I hope, show my consistent commitment to doing what is best for the longterm future (even if of course some people with different models of what is “best for the longterm future” will have legitimate disagreements with my choices of past actions), and pledge to remain in control of Conjecture and shape its goals and actions appropriately.
Sorry, do you mean that you are actually pledging to "remain in control of Conjecture"? Can some other founder(s) make that pledge too if it's necessary for maintaining >50% voting power?
Will you have the ability to transfer full control over the company to another individual of your choice in case it's necessary? (Larry Page and Sergey Brin, for example, are seemingly limited in their ability to transfer their 10x-voting-power Alphabet shares to others).
Your website says: "WE ARE AN ARTIFICIAL GENERAL INTELLIGENCE COMPANY DEDICATED TO MAKING AGI SAFE", and also "we are committed to avoiding dangerous AI race dynamics".
How are you planning to avoid exacerbating race dynamics, given that you're creating a new 'AGI company'? How will you prove to other AI companies—that do pursue AGI—that you're not competing with them?
Do you believe that most of the AI safety community approves of the creation of this new company? In what ways (if any) have you consulted with the community before starting the company?
Who, in practice, pulls the EA-world fire alarm? Is it Holden Karnofsky?
FYI, him having that responsibility would seemingly entail a conflict of interest; he said in an interview:
Anthropic is a new AI lab, and I am excited about it, but I have to temper that or not mislead people because Daniela, my wife, is the president of Anthropic. And that means that we have equity, and so [...] I’m as conflict-of-interest-y as I can be with this organization.
The founders also retain complete control of the company.
Can you say more about that? Will shareholders not be able to sue the company if it acts against their financial interests? If Conjecture will one day become a public company, is it likely that there will always be a controlling interest in the hands of few individuals?
[...] to train and study state-of-the-art models without pushing the capabilities frontier.
Do you plan to somehow reliably signal to AI companies—that do pursue AGI—that you are not competing with them? (In order to not exacerbate race dynamics).
I'm late to the party by a month, but I'm interested in your take (especially Rohin's) on the following:
Conditional on an existential catastrophe happening due to AI systems, what is your credence that the catastrophe will occur only after the involved systems are deployed?
Simple metrics, like number of views, or number of likes, are easy for companies to optimise for. Whereas figuring out how to optimise for what people really want is a trickier problem. So it’s not surprising if companies haven’t figured it out yet.
It's also not surprising for a different reason: The financial interests of the shareholders can be very misaligned with what the users "really want". (Which can cause the company to make the product more addictive, serve targeted ads that exploit users' vulnerabilities, etc.).
PSA for Edge browser users: if you care about privacy, make sure Microsoft does not silently enable syncing of browsing history etc. (Settings->Privacy, search and services).
They seemingly did so to me a few days ago (probably along with the Windows "Feature update" 20H2); it may be something that they currently do to some users and not others.
BTW that foldable design makes the respirator fit in a pocket, which can be a big plus.
This is one of those "surprise! now that you've read this, things might be different" posts.
The surprise factor may be appealing from the perspective of a writer, but I'm in favor of having a norm against it (e.g. setting an expectation for authors to add a relevant preceding content note to such posts).