Posts
Comments
Aha, thanks, that makes sense.
One way this could happen is searching for jailbreaks in the space of paraphrases and synonyms of a benign prompt.
Why would this produce fake/unlikely jailbreaks? If the paraphrases and such are natural, then doesn't the nearness to a real(istic) prompt enough to suggest that the jailbreak found is also realistic? Of course you can adversarially generate super unrealistic things, but does that necessarily happen with paraphrasing type attacks?
You may recall certain news items last February around Gemini and diversity that wiped many billions off of Google's market cap.
There's a clear financial incentive to make sure that models say things within expected limits.
There's also this: https://www.wired.com/story/air-canada-chatbot-refund-policy/
Really cool project! And the write-up is very clear.
In the section about options for reducing the hit to helpfulness, I was surprised you didn't mention scaling the vector you're adding or subtracting -- did you try different weights? I would expect that you can tune the strength of the intervention by weighting the difference in means vector up or down.
The usual reason is compounding. If you have an asset that is growing over time, paying taxes from it means not only do you have less of it now, but the amount you pulled out now won't compound indefinitely into the future. You want to compound growth for as long as possible on as much capital as possible. If you could diversify without paying capital gains you would, but since the choice is something like, get gains on $100 in this one stock, or get gains on $70 in this diversified basket of stocks, you might stay with the concentrated position even if you would prefer to be diversified.
This reminds me of a Brin short story which I think exactly discusses what you're talking about: https://www.davidbrin.com/tankfarm.htm
Cool concept. I'm a bit puzzled by one thing though -- presumably every time you use a tether, it slows down and drops to a lower orbit. How do you handle that? Is the idea that it's so much more massive than the rockets its boosting that its slowdown is negligible? Or do we have to go spin it back up every so often?
"If you are playing with a player who thinks that "all reds" is a strong hand, it can take you many, many hands to figure out that they're overestimating their hands instead of just getting anomalously lucky with their hidden cards while everyone else folds!"
As you guessed, this is wrong. If someone is playing a lot of hands, your first hypothesis is that they are too loose and making mistakes. At that point, each additional hand they play is evidence in favor of fishiness, and you can quickly become confident that they are bad.
Mistakes in the other direction are much harder to detect. If someone folds for 30 minutes, they plausibly just had a bad run of cards. We've all been there. They do have some discipline, but because folding is so common, each additional fold only adds a small bit of is Bayesian evidence that the person is a rock.
I wonder if there's a way to give the black box recommended a different objective function. CTR is bad for the obvious clickbait reasons, but signals for user interaction are still valuable if you can find the right signal to use.
I would propose that returning to the site some time in the future is a better signal of quality than CTR, assuming the future is far enough away. You could try a week, a month, and a quarter.
This is maybe a good time to use reinforcement learning, since the signal is far away from the decision you need to make. When someone interacts with an article, reward the things they interacted with n weeks ago. Combined with karma, I bet that would be a better signal than CTR.
Children are evidently next word completers.
I would be very unhappy if a non disparagement agreement were sprung on me when I left the company. And I would be very reluctant to sign one entering any company.
Luckily we don't have those at Google Deepmind.
I work at DeepMind and have been influenced by METR. :)
If you want a far future fictional treatment of this kind of situation, I recommend Surface Detail by Iain Banks.
I think your model is a bit simplistic. METR has absolutely influenced the behavior of the big labs, including DeepMind. Even if all impact goes through the big labs, you could have more influence outside of the lab than as one of many employees within. Being the head of a regulatory agency that oversees the labs sets policy in a much more direct way than a mid level exec within the company can.
I went back to finish college as an adult, and my main surprise was how much fun it was. It probably depends on what classes you have left, but I took every AI class offered and learned a ton that is still relevant to my work today, 20 years later. Even the general classes were fun -- it turns out it's easy to be an excellent student if you're used to working a full work week, and being a good student is way more pleasant and less stressful than being a bad one, or at least it was for me.
I'm not sure what you should do necessarily, but given that you're thinking about this less as useful for anyone in particular and more for other reasons, fun might be a good goal.
As it happened I think the credential ended up useful too, but it was a top school so more valuable than many.
This is very well written and compelling. Thanks for posting it!
This is a great post. I knew that at the top end of the income distribution in the US people have more kids, but didn't understand how robust the relationship seems to be.
I think the standard evbio explanation here would ride on status -- people at the top of the tribe can afford to expend more resources for kids, and also have more access to opportunities to have kids. That would predict that we wouldn't see a radical change as everyone got more rich -- the curve would slide right and the top end of the distribution would have more kids but not necessarily everyone else.
But the gdp per capita graphs I think are evidence against that view. It looks like the curve is a lot flatter than when fertility is rising than when it's dropping, but if this holds into the future I really don't worry too much. We're on the cusp of all getting a lot richer, or else AI will kill us all anyway.
Heh, that's why I put "strong" in there!
One big one is that the first big spreading event happened at a wet market where people and animals are in close proximity. You could check densely peopled places within some proximity of the lab to figure out how surprising it is that it happened in a wet market, but certainly animal spillover is much more likely where there are animals.
Edit: also it's honestly kind of a bad sign that you aren't aware of evidence that tends against your favored explanation, since that mostly happens during motivated reasoning.
We're here to test the so-called tower of babel theory. What if, due to some bizarre happenstance, humanity had thousands of languages that change all the time instead of a single universal language like all known intelligent species?
You should ignore the EY style "no future" takes when thinking about your future. This is because if the world is about to end, nothing you do will matter much. But if the world isn't about to end, what you do might matter quite a bit -- so you should focus on the latter.
One quick question to ask yourself is: are you more likely to have an impact on technology, or on policy? Either one is useful. (If neither seems great, then consider earning to give, or just find a way to add value in society in other ways.)
Once you figure that out, the next step is almost certainly building relevant skills, knowledge, and networks. Connect with senior folks with relevant roles, ask and otherwise try to figure out what skills and such are useful, try to get some experience by working or volunteering with great people or organizations.
Do that for a while and I bet some gaps and opportunities will become pretty clear. 😀
I agree that it's bad to raise a child in an environment of extreme anxiety. Don't do that.
Also try to avoid being very doomy and anxious in general, it's not a healthy state to be in. (Easier said than done, I realize.)
I think you should have a kid if you would have wanted one without recent AI progress. Timelines are still very uncertain, and strong AGI could still be decades away. Parenthood is strongly value creating and extremely rewarding (if hard at times) and that's true in many many worlds.
In fact it's hard to find probable worlds where having kids is a really bad idea, IMO. If we solve alignment and end up in AI utopia, having kids is great! If we don't solve alignment and EY is right about what happens in a fast takeoff world, it doesn't really matter if you have kids or not.
In that sense, it's basically a freeroll, though of course there are intermediate outcomes. I don't immediately see any strong argument in favor of not having kids if you would otherwise want them.
The thing you're missing is called instruction tuning. You gather a series of prompt/response pairs and fine tune the model over that data. Do it right and you have a chatty model.
Thanks, Zvi, these roundups are always interesting.
I have one small suggestion, which is that you limit yourself to one Patrick link per post. He's an interesting guy but his area is quite niche, and if people want his fun stories about banking systems they can just follow him. I suspect that people who care about those things already follow him, and people who don't aren't that interested to read four items from him here.
I feel like a lot of the issues in this post are that the published RSPs are not very detailed and most of the work to flesh them out is not done. E.g. the comparison to other risk policies highlights lack of detail in various ways.
I think it takes a lot of time and work to build our something with lots of analysis and detail, years of work potentially to really do it right. And yes, much of that work hasn't happened yet.
But I would rather see labs post the work they are doing as they do it, so people can give feedback and input. If labs do so, the frameworks will necessarily be much less detailed than they would if we waited until they were complete.
So it seems to me that we are in a messy process that's still very early days. Feedback about what is missing and what a good final product would look like is super valuable, thank you for your work doing that. I hope the policy folks pay close attention.
But I think your view that RSPs are the wrong direction is misguided, or at least I don't find your reasons to be persuasive -- there's much more work to be done before they're good and useful, but that doesn't mean they're not valuable. Honestly I can't think of anything much better that could have been reasonably done given the limited time and resources we all have.
I think your comments on the name are well taken. I think your ideas about disclaimers and such are basically impossible for a modern corporation, unfortunately. I think your suggestion about pushing for risk management in policy are the clear next step, that's only enabled by the existence of an RSP in the first place.
Thanks for the detailed and thoughtful effortpost about RSPs!
I agree with all of this. It's what I meant by "it's up to all of us."
It will be a signal of how things are going if I'm a year we still have only vague policies, or if there has been real progress in operationalizing the safety levels, detection, what the right reactions are, etc.
I think there are two paths, roughly, that RSPs could send us down.
- RSPs are a good starting point. Over time we make them more concrete, build out the technical infrastructure to measure risk, and enshrine them in regulation or binding agreements between AI companies. They reduce risk substantially, and provide a mechanism whereby we can institute a global pause if necessary, which seems otherwise infeasible right now.
- RSPs are a type of safety-washing. They provide the illusion of a plan, but as written they are so vague as to be meaningless. They let companies claim they take safety seriously but don't meaningfully reduce risk, and in fact may increase it by letting companies skate by without doing real work, rather than forcing companies to act responsibly by just not developing a dangerous uncontrollable technology.
If you think that Anthropic and other labs that adopt these are fundamentally well meaning and trying to do the right thing, you'll assume that we are by default heading down path #1. If you are more cynical about how companies are acting, then #2 may seem more plausible.
My feeling is that Anthropic et al are clearly trying to do the right thing, and that it's on us to do the work to ensure that we stay on the good path here, by working to deliver the concrete pieces we need, and to keep the pressure on AI labs to take these ideas seriously. And to ask regulators to also take concrete steps to make RSPs have teeth and enforce the right outcomes.
But I also suspect that people on the more cynical side aren't going to be persuaded by a post like this. If you think that companies are pretending to care about safety but really are just racing to make $$, there's probably not much to say at this point other than, let's see what happens next.
New York City Mayor Eric Adams has been using ElevenLabs AI to create recordings of him in languages he does not speak and using them for robocalls. This seems pretty not great.
Can you say more about why you think this is problematic? Recording his own voice for a robocall is totally fine, so the claim here is that AI involvement makes it bad?
Yes he should disclose somewhere that he's doing this, but deepfakes with the happy participation of the person whose voice is being faked seems like the best possible scenario.
FWIW as an executive working on safety at Google, I basically never consider my normal working activities in light of what they would do to Google's stock price.
The exception is around public communication. There I'm very careful because it's asymmetrical -- I could potentially cause a pr disaster that would affect the stock, but I don't see how I could give a talk that's so good that it helps it.
Maybe a plug pulling situation would be different, but I also think it's basically impossible for it to be a unilateral situation, and if we're in such a moment, I hardly think any damage would be contained to Google's stock price, versus say the market as a whole.
How much do you think that your decisions affect Google's stock price? Yes maybe more AI means a higher price, but on the margin how much will you be pushing that relative to a replacement AI person? And mostly the stock price fluctuates on stuff like how well the ads business is doing, macro factors, and I guess occasionally whether we gave a bad demo.
It feels to me like the incentive is just so diffuse that I wouldn't worry about it much.
Your idea of just donating extra gains also seems fine.
That's not correct, or at least not how my Google stock grants work. The price is locked in at grant time, not vest time. In practice what that means is that you get x shares every month, which counts as income when multiplied by the current stock price.
And then you can sell them or whatever, including having a policy that automatically sells them as soon as they vest.
The star ratings are an improvement, I had felt also that breakthrough was overselling many of the items last week.
However, stars are very generic and don't capture the concept of a breakthrough very well. You could consider a lightbulb.
I also asked chatgpt to create an emoji of an AI breakthrough, and after some iteration it came up with this: https://photos.app.goo.gl/sW2TnqDEM5FzBLdPA
Use it if you like it!
Thanks for putting together this roundup, I learn things from it every time.
I agree with this.
Consider a hypothetical: there are two drugs we could use to execute prisoners convinced with the death penalty. One of them causes excruciating pain, the other does not, but costs more.
Would we feel that we would rather use the torture drug later? After all, the dude is dead, so he doesn't care either way.
I have a pretty strong intuition that those drugs are not similar. Same thing with the anesthesia example.
HT Michael Thiessen, who expects this to result in people figuring out how to extract the (distilled) model weights. Is that inevitable?
Not speaking for Google here.
I think it's inevitable, or at least it's impossible to stop someone willing to put in the effort. The weights are going to be loaded into the phone's memory, and a jailbroken phone should let you have access to the raw memory.
But it's a lot of effort and I'm not sure what the benefit would be to anyone. My guess is that if this happens it will be by a security researcher or some enterprising grad student, not by anyone actually motivated to use the weights for anything in particular.
I could see the illustrations via RSS, but don't see them here, chrome on mobile.
I assume you've seen these, but if not, there are some relevant papers here: https://scholar.google.com/scholar?q=deepmind+reinforcement+learning+cooling+data+center&hl=en&as_sdt=0&as_vis=1&oi=scholart
The main place we differ is that we are on opposite sides of the ‘will Tether de-peg?’ market. No matter what they did in the past, I now see a 5% safe return as creating such a good business that no one will doubt ability to pay. Sometimes they really do get away with it, ya know?
This seems sensible, but I remember thinking something very similar about Full Tilt, and then they turned out to be doing a bunch of shady shit that was very not in their best interest. I think there's a significant chance that fraudsters gonna fraud even when they really shouldn't, and Tether in particular has such a ridiculous background that it just seems very possible that they will take unnecessary risks, lend money when they shouldn't, etc, just because people do what they've been doing all too often.
Pradyumna: You a reasonable person: the city should encourage carpooling to reduce congestion
Bengaluru’s Transport Department (a very stable genius): Taxi drivers complained and so we will ban carpooling
It's not really that Bangalore banned carpooling, they required licenses for ridesharing apps. Maybe that's a de facto ban of those apps, but that's a far cry from banning carpooling in general.
Source: https://www.timesnownews.com/bengaluru/no-ban-on-carpooling-in-bengaluru-apps-to-obtain-approval-to-operate-it-legally-article-104103234
Partly this will be because in fact current ML systems are not analogous to future AGI in some ways - probably if you tell the AGI that A is B, it will also know that B is A.
One oddity of LLMs is that we don't have a good way to tell the model that A is B in a way that it can remember. Prompts are not persistent, and as this paper shows, fine tuning doesn't do a good job of getting a fact into the model without doing a bunch of paraphrasing. Pretraining presumably works in a similar way.
This is weird! And I think helps make sense of some of the problems we see with current language models.
45->55% is a 22% relative gain, while 90->100% is only an 11% gain.
On the other hand, 45->55% is a reduction in error by 18%, while 90->100% is a 100% reduction in errors.
Which framing is best depends on the use case. Preferring one naively over the other is definitely an error. :)
I think the argument against LeCun is simple: while it may be true that AIs won't necessarily have a dominance instinct the way that people do, they could try to dominate for other reasons: namely that such dominance is an instrumental goal towards whatever its objective is. And in fact that is a significant risk, and can't be discounted by pointing out that they may not have a natural instinct towards dominance.
I just think that to an economist, models and survey results are different things, and he's not asking for the latter.
I think that Tyler is thinking more of an economic type model that looks at the incentives of various actors and uses that to understand what might go wrong and why. I predict that he would look at this model and say, "misaligned AI can cause catastrophes" is the hand-wavy bit that he would like to see an actual model of.
I'm not an economist (is IANAE a known initialization yet?), but it would probably include things like the AI labs, the AIs, and potentially regulators or hackers/thieves, try to understand and model their incentives and behaviors, and see what comes out of that. It's less about subjective probabilities from experts and more about trying to understand the forces acting on the players and how they respond to them.
So... when can we get the optimal guide, if this isn't it? :)
In general to solve an NP complete problem like 3-SAT, you have to spend compute or storage to solve it.
Suppose you solve one 3-SAT problem. If you don't write down the solution and steps along the way, then you have no way to get the benefit of the work for the next problem. But if you do store the results of the intermediate steps, then you need to store data that's also polynomial in size.
In practice often you can do much better than that because the problems you're solving may share certain data or characteristics that lead to shortcuts, but in the general case you have to pay the cost every time you need to solve an NP complete problem.
If one person estimates the odds at a billion to one, and the other at even, you should clearly bet the middle. You can easily construct bets that offer each of them a very good deal by their lights and guarantee you a win. This won't maximize your EV but seems pretty great if you agree with Nick.
Anthropic reportedly got a $4B valuation on negligible revenue. Cohere is reportedly asking for a $6B valuation on maybe a few $M in revenue.
AI startups are getting pretty absurd valuations based on I'm not sure what, but I don't think it's ARR.
I'm not sure multiple of revenue is meaningful right now. Nobody is investing in OAI because of their current business. Also there are tons of investments at infinite multiples once you realize that many companies get investments with no revenue.