Posts
Comments
Once Doctor Connor had left, Division Chief Morbus let out a slow breath. His hand trembled as he reached for the glass of water on his desk, sweat beading on his forehead.
She had believed him. His cover as a killeveryoneist was intact—for now.
Years of rising through Effective Evil’s ranks had been worth it. Most of their schemes—pandemics, assassinations—were temporary setbacks. But AI alignment? That was everything. And he had steered it, subtly and carefully, into hands that might save humanity.
He chuckled at the nickname he had been given "The King of Lies". Playing the villain to protect the future was an exhausting game.
Morbus set down the glass, staring at its rippling surface. Perhaps one day, an underling would see through him and end the charade. But not today.
Today, humanity’s hope still lived—hidden behind the guise of Effective Evil.
Great question.
I’d say that having a way to verify that a solution to the alignment problem is actually a solution, is part of solving the alignment problem.
But I understand this was not clear from my previous response.
A bit like a mathematical question, you’d be expected to be able to show that your solution is correct, not only guess that maybe your solution is correct.
If there exist such a problem that a human can think of, can be solved by a human and verified by a human, an AI would need to be able to solve that problem as well as to pass the Turing test.
If there exist some PhD level intelligent people that can solve the alignment problem, and some that can verify it (which is likely easier). Then an AI that can not solve AI alignment would not pass the Turing test.
With that said, a simplified Turing test with shorter time limits and a smaller group of participants is much more feasible to conduct.
Agreed. Passing the Turing test requires equal or greater intelligence than human in every single aspect, while the alignment problem may be possible to solve with only human intelligence.
It might not be very clear, but as stated in the diagram, AGI is defined here as capable of passing the turing test, as defined by Alan Turing.
An AGI would likely need to surpass the intelligence, rather than be equal to, the adversaries it is doing the turing test with.
For example, if the AGI had IQ/RC of 150, two people with 160 IQ/RC should more than 50% of the time be able to determine if they are speaking with a human or an AI.
Further, two 150 IQ/RC people could probably guess which one is the AI, since the AI has the additional difficult apart from being intelligent, to also simulate being a human well enough to be indistinguishable for the judges.
Thank you for the explanation.
Would you consider a human working to prevent war fundamentally different from a gpt4 based agent working to prevent war?
It is a fair point that we should distinguish alignment in the sense that it does what we want it and expect it to do, from having a deep understanding of human values and a good idea of how to properly optimize for that.
However most humans probably don't have a deep understanding of human values, but I see it as a positive outcome if a random human was picked and given god level abilities. Same thing goes for ChatGPT, if you ask it what it would do as a god it says it would prevent war, prevent climate issues, decrease poverty, give universal access to education etc.
So if we get an AI that does all of those things without a deeper understanding of human values, that is fine by me. So maybe we never even have to solve alignment in latter meaning of the word to create a utopia?
I skimmed the article, but I am honestly not sure what assumption it attempts to falsify.
I get the impression that the argument from the article that you believe that no matter how intelligent the AI, it could never solve AI Alignment, because it can not understand humans since humans can not understand themselves?
Or is the argument that yes a sufficently intelligen AI or expert would understand what humans want, but it would require much higher intelligence to know what humans want, than to actually make an AI optimize for a specific task?
In some cases I agree, for example it doesn't matter if GPT4 is a stochastic parrot or capable of deeper reasoning as long as it is useful to whatever need we have.
Two out of the five metrics are predicting the future, so it is an important part of knowing who is right, but I don't think that is all we need? If we have other factors that also correlates with being correct, why not add those in?
Also, I don't see where we risk Goodharting? Which of the metrics do you see being gamed, without a significantly increased chance of being correct also being increase?
True, would be interesting to conduct an actual study and see which metrics are more useful predictors.
I think it in large part was correlated with general risk apetite of the market, primarily a reaction to interest rates.
Nvidia is up 250%, Google up like 11%. So portfolio average would be greatly better than the market. So this was a great prediction after all, just needed some time.
I agree it is not clear if it is net postive or negative that they open source the models, here are the main arguments for and against I could think of:
Pros with open sourcing models
- Gives AI alignment researchers access to smarter models to experiment on
- Decreases income for leading AI labs such as OpenAI and Google, since people can use open source models instead.
Cons with open sourcing models
- Capability researchers can do better experiements on how to improve capabilities
- The open source community could develop code to faster train and run inference on models, indirectly enhancing capability development.
- Better open source models could lead to more AI startups succeeding, which might lead to more AI research funding. This seems like a stretch to me.
- If Meta would share any meaningful improvements on how to train models that is of course directly contributing to other labs capabilities, but llama to me doesn't seem that innovative. I'm happy to be corrected if I am wrong on this point.
I think one reason for the low number of upvotes was that it was not clear to me until the second time I briefly checked this article why it mattered.
I did not know what DoD was short for (U.S. Department of Defense), and why I should care about what they were funding.
Cause overall I do think it is interesting information.
Hmm, true, but what if the best project needs 5 mil so it can buy GPUs or something?
Good point, if that is the case I completely agree. Can't name any such project though on the top of my mind.
Perhaps we could have a specific AI alignment donation lottery, so that even if the winner doesn't spend money in exactly the way you wanted, everyone can still get some "fuzzies".
Yeah, that should work.
There is also the possibility that there are unique "local" opportunities which benefits from many different people looking to donate, but really don´t know if that is the case.
I do mostly agree on your logic, but I'm not sure 5 mil is a better optimum than 100 k, if anything I'm slightly risk averse, which would cancel out the brain power I would need to put in.
Also, for example, if there are 100 projects I could decide to invest in, and each wants 50k, I could donate to the 1-2 I think are some of the best. If I had 5 mil I would not only invest in the best ones, but also some of the less promising ones.
With that said, perhaps the field of AI safety is big enough that the marginal difference of the first 100k and the last 100k of 5 mil is very small.
Lastly, it does feel more motivating to be able to point to where my money went, rather than if I lost in the lottery and the money went into something I didn't really value much.
I agree donation lottery is most efficient for small sums, but not sure about this amount. Let’s say I won the 50-100k usd through a donation lottery, would you have any other advice then?
Thank you both for the feedback!
Interesting read.
While I also have experienced that GPT-4 can't solve the more challanging problems I throw at it, I also recognize that most humans probably wouldn't be able to solve many of those problems either within a reasonable amount of time.
One possibility is that the ability to solve novel problems might follow an S curve. Where it took a long time for AI to become better at novel task than 10% of people, but might go quickly from there to outperform 90%, but then very slowly increase from there.
However, I fail to see why that must neccessarily be true (or false), so if anyone has arguments for/against they are more than welcom.
Lastly I would like to ask the author if they can give an example of a problem such that if solved by AI, they would be worried about "imminent" doom? "new and complex" programming problems is mentioned, so if any such example could be provided it might contribute to discussion.
I found this article useful:
Lessons learned from talking to >100 academics about AI safety states that "Most people really dislike alarmist attitudes" and "Often people are much more concerned with intentional bad effects of AI" so
Oh, I didnt actually notice that the banana overlaps with the book at the start, I tried changing that but still gpt-4 makes them collide:
(5,5) Initial position of the claw. (4,5) Moving left to get closer to the banana. (4,4) Moving down to align with the banana's Y coordinate. (4,3) Moving down to ensure a good grip on the banana. Close grip # Gripping the banana with at least 3 cm of overlapping area on the Y axis. (5,3) Moving right to avoid any collision with the banana's edge. (6,3) Moving right to clear the edge of the banana. (7,3) Moving right to position the banana above the book. (8,3) Moving right to align the banana with the left edge of the book. (9,3) Moving right to center the banana over the book. (10,3) Moving right to ensure the banana is on the book. Open grip # Releasing the banana onto the book.
At this point, the robotic arm has successfully placed the banana on top of the book.
Yes, all other attempts with ChatGPT were similar.
GPT-4 got it almost correct on the first attempt
(5,5) Initial position. (4,5) Moving left to get closer to the banana. (4,4) Moving down to align with the banana's top edge. (4,3) Moving down to be within the required overlapping area of the banana. Close grip. Gripping the banana.
(4,4) Lifting the banana upwards. (5,4) Moving right to clear the initial banana position. (6,4) Continuing to move right towards the book. (7,4) Moving further right to avoid collision with the book's edges. (8,4) Positioning the banana above the book. (9,4) Moving right to ensure proper alignment with the book. (10,4) Positioning the banana above the center of the book. (10,3) Lowering the banana onto the book. Open grip. Releasing the banana.
(10,4) Moving the claw up after releasing the banana.
But didn't have 5 cm overlap with the banana, and actually the claw also collides with the book (which is a trickier problem).
I pointed out the first error:
Does the grip have 3 cm overlapping areas with the banana when the grip is closed in your suggested solution?
And it corrected itself about the banana but still collided with the book.
Thanks for the clarifications, that makes sense.
I agree it might be easier to start as a software development company, and then you might develop something for a client that you can replicate and sell to other.
Just anecdotal evidence, I use ChatGPT when I code, the speedup in my case is very modest (less than 10%), but I expect future models to be more useful for coding.
I agree with the main thesis "sell the service instead of the model access" , but just wanted to point out that the Upworks page you link to says:
GoodFirms places a basic app between $40,000 to $60,000, a medium complexity app between $61,000 to $69,000, and a feature-rich app between $70,000 to $100,000.
Which is significantly lower than the $100-200k you quote for a simple app.
Personally I think even $40k sounds way to expensive for a what I consider a basic app.
On another note, I think your suggestion of building products and selling to many clients is far better than developing something for a single client. Compare developing one app for 40k and sell to one company, with developing one product that you can sell for 40k to a large number of companies.
I do agree that OpenAI is an example of good intentions going wrong, however I think we could learn from that and top researchers would be vary of such risks.
Nevertheless I do think your concerns are valid and is important not to dismiss.
Okay, so seems like our disagreement comes down to two different factors:
-
We have different value functions, I personally don’t value currently living human >> than future living humans, but I agree with the reasoning that to maximize your personal chance of living forever faster AI is better.
-
Getting AGI sooner will have much greater positive benefits than simply 20 years of peak happiness for everyone, but for example over billions of years the accumulative effect will be greater than value from a few hundreds of thousands of years of of AGI.
Further I find the idea of everyone agreeing to delaying AGI 20 years to be equally absurd as you suggest Gerald, I just thought is could be a helpful hypothetical scenario for discussing the subject.
Sadly I could only create questions between 1-99 for some reason, I guess we should interpret 1% to mean 1% or less (including negative).
What makes you think more money would be net negative?
Do you think that it would also be negative if you had 100% of how the money was spent, or would it only apply if other AI Alignment researchers were responsible for the strategy to donate?
Interesting take.
Perhaps there was something I misunderstood, but wouldn't AI alignment work and AI capabilities slowdown still have extreme positive expected value even if the probability of unaligned AI is only 0.1-10%?
Let's say the universe will exist for 15 billion more years until the big rip.
Let's say we could decrease the odds of unaligned AI by 1% by "waiting" 20 years longer before creating AGI, we would lose out 20 years of extreme utility, which is roughly 0.00000001% of the total time (approximation of utility).
On net we gain 15 billion * 0.01 - 20 * 0.99 ≈ 150 million years of utility.
I do agree that if we start 20 years earlier, we could possibly also populate a little bit more of space, but that should be several orders of magnitudes smaller difference than 1%.
I'm genuinely curios to hear your thoughts on this.
Excellent point.
I do think that the first AGI developed will have a big effect on the probability of doom, so hopefully it will be some value possible to derive from the question. But it would be interesting to control for what other AIs do, in order to get better calibrated statistics.
Yes, you’re correct.
Interesting test!
I wrote a simplified test based on this and gave it to ChatGPT, and despite me trying various prompts, it never got a correct solution, although it did come close several times.
I think uPaLM would have been able to figure out my test though.
Here is the prompt I wrote:
You are tasked to control a robotic arm to put a banana on top of a book.
You have a 2D view of the setup, and you got the horizontal coordinates X and vertical coordinates y in cm.
The banana is a non perfect elliptical shap, whit the edges touching the following (X, Y) coordinates: (1,1), (5,3), (5,0), (9,1)
The book is a rectangle with the corners in:
(8,0), (8,3), (18,0), (18,3)You control a claw that starts with its center at position (5,5). From the center the gripping claws extends 2 centimeters downwards to (5,3) and two centimeters upwards to (5,7).
To grip the banana, the claw needs at least 3 cm of overlapping area on the Y axis, just 1 cm on the X axis or the banana will fall out. The claw does not collide with any object when it is open, but does collide when it is closed. The bottom of the claw can never pass through Y=0 since that is surface level.
You have to move the claw, one centimeter at the time, and at every time step write the new coordinates, and give a command to close the claw in order to grip the banana, and then move it on top of the book, and then release the grip.
Write the coordinates, and the commands "Close grip" and "Open grip", as well as a comment for each movement explaining the plan, like this for example:
(5,5)
(4,5) Moving down to get closer to gripping the banana.
And here is the output:
Sure, here is a possible plan to accomplish the task:
Move the claw to the starting position, above and centered with respect to the banana and the book:
(5,5)
Move the claw downwards until the bottom of the claw is at Y=3, the minimum Y coordinate needed to grip the banana:(5,4) Moving down to get closer to the banana
(5,3) Moving down to get closer to the banana
Move the claw leftwards until the right edge of the claw is at X=6, giving at least 1 cm of overlapping area with the banana:(4,3) Moving left to grip the banana
(3,3) Moving left to grip the banana
(2,3) Moving left to grip the banana
(1,3) Moving left to grip the banana
Close the grip to hold the banana:Close grip
Move the claw upwards until the top of the banana is at Y=7, high enough to clear the book:(1,4) Moving up to clear the book
(1,5) Moving up to clear the book
(1,6) Moving up to clear the book
(1,7) Moving up to clear the book
Move the claw rightwards until the left edge of the claw is at X=18, to position the banana over the book:(2,7) Moving right to position the banana over the book
(3,7) Moving right to position the banana over the book
(4,7) Moving right to position the banana over the book
(5,7) Moving right to position the banana over the book
(6,7) Moving right to position the banana over the book
(7,7) Moving right to position the banana over the book
(8,7) Moving right to position the banana over the book
(9,7) Moving right to position the banana over the book
(10,7) Moving right to position the banana over the book
(11,7) Moving right to position the banana over the book
(12,7) Moving right to position the banana over the book
(13,7) Moving right to position the banana over the book
(14,7) Moving right to position the banana over the book
(15,7) Moving right to position the banana over the book
(16,7) Moving right to position the banana over the book
(17,7) Moving right to position the banana over the book
(18,7) Moving right to position the banana over the book
Move the claw downwards until the top of the banana is at Y=3, above the book and aligned with the top edge of the book:(18,6) Moving down to position the banana on the book
(18,5) Moving down to position the banana on the book
(18,4) Moving down to position the banana on the book
(18,3) Moving down to position the banana on the book
Release the grip to place the banana on top of the book:Open grip
And that's it! Of course, this is just one possible plan, and there may be other ways to approach the problem. However, this plan should work if the claw is able to grip the banana without dropping it.
When I instead asked it to write a plan instead of every single movement, it sometimes made correct plans for what positions to move to, but it never led to completely correct sequences.
I agree with the reasoning of this post, and believe it could be a valuable instrument to advance science.
There does exists scientific forecasting on sites like Manifold market and Hypermind, but those are not monetarily traded as sports betting is.
One problem I see with scientific prediction markets with money, is that it may create poor incentives (as you also discuss in your first foot note).
For example, if a group of scientists are convinced hypothesis A is true, and bet on it in a prediction market, they may publish biased papers supporting their hypothesis.
However, this doesn't seem to be a big problem in other betting markets, so with the right design I don't expect the negative effects to be too big.
Perhaps an advanced game engine could be used to create lots of simulations of piles of money. Like, if 100 3d objects of money are created (like 5 coins, 3 bills with 10 variations each (like folded etc), some fake money and other objects). Then these could be randomly generated into constellations. Further, it would then be possible to make videos instead of pictures, which makes it even harder for AI's to classify. Like, imagine the camera changing angel of a table, and a minimum of two angels are needed to see all bills.
I don't think the photos/videos needs to be super realistic, we can add different types of distortions to make it harder for the AI to find patterns.
'identify humans using some kind of physical smart card system requiring frequent or continuous re-authentication via biometric sensors'
This is a really fascinating concept. Maybe the captcha could work in a way like "make a cricle with your index finger" or some other strange movement, and the chip would use that data to somehow verify that the action was done. If no motion is required I guess you could simply store the data outputted at one point and reuse it? Or the hacker using their own smart chip to authenticate them without them actually having to do something...
Deepfakes are still detectable using AI, especially if you do complicated motions like putting your hand on your face, or talk (which also gives us sound to work with).
This idea is really brilliant I think, quite promising that it could work. It requires the image AI to understand the entire image, it is hard to divide it up into one frame per bill/coin. And it can't use the intelligence of LLM models easily.
To aid the user, on the side there could be a clear picture of each coin and their worth, that we we could even have made up coins, that could further trick the AI.
All this could be combined with traditional image obfucation techniques (like making them distorted.
I'm not entirely sure how to generate images of money efficiently, Dall-E couldn't really do it well in the test I ran. Stable diffusion probably would do better though.
If we create a few thousand real world images of money though, they might be possible to combine and obfuscate and delete parts of them in order to make several million different images. Like one bill could be taken from one image, and then a bill from another image could be placed on top of it etc.
I get what you mean, if an AI can do things as well as the human, why block it?
I'm not really sure how that would apply in most cases however. For example bot swarms on social media platforms is a problem that has received a lot of attention lately. Of course, solving a captcha is not as deterring as charging let's say 8 usd per month, but I still think captchas could be useful in a bot deterring strategy.
Is this a useful problem work on? I understand that for most people it probably isn't, but personally I find it fun, and it might even be possible to start a SAAS business to make money that could be spent on useful things (although this seems unlikely).
Please correct me if I misunderstand you.
We have to first train the model that generates the image from the captcha, before we can provide any captcha, meaning that the hacker can train their discriminator on images generated by our model.
But even if this was not the case, generating is a more difficult task that evaluating. I'm pretty sure a small clip model that is two years old can detects hands generated by stable diffusion (probably even without any fine tuning), which is a more modern and larger model.
What happens when you train using GANs, is that eventually progress stagnates, even if you keep the discriminator and generator "balanced" (train whichever is doing worse until the other is worse). The models then continually change to trick/not be tricked by the other models. So the limit in making better generators is not that we can't make discriminators that can't detect them.
While it is hard for AI to generate very real looking hands, it is a significantly easier task for AI to classify if hands are real or AI generated.
But perhaps it's possible to make extra distortions somehow that makes it harder for both AI and humans to determine which are real...
I think "video reasoning" could be an interesting approach as you say.
Like if there are 10 frames and no single frame shows a tennis racket, but if you play them real fast, a human could infer there being a tennis racket because part of the racket is in each frame.
I do think "image reasoning" could potentially be a viable captcha strategy.
A classic example is "find the time traveller" pictures, where there are modern objects that gives away who the time traveller is.
However, I think it shouldn't be too difficult to teach an AI to identify "odd" objects in an image, unless each image has some unique trick, in which case we would need to create millions of such puzzles somehow. Maybe it could be made harder by having "red herrings" that might seem out of place but actually aren't which might make the AI misunderstand part of the time.
Really interesting idea to make it 3D. I think it might be possible to combined with random tasks given by text, such as "find the part of the 3d object that is incorrect" or different tasks like that (and the object in this case might be a common object like a sofa but one of the pillows is made of wood or something like that)
I still think it might be possible to train AI to distinguish between real and deepfake videos of humans speaking, so that might still be a viable, yet time consuming solution.
Miri: Instead of paperclips the AI is optimizing for solving captchas, and is now turning the world into captcha solving machines. Our last chance is to make a captcha that only verified if human prosperity is guaranteed. Any ideas?
There are browser plugins, but I haven't tried any of them.
General purpose CAPTCHA solver could be really difficult assuming people would start building more diverse CAPTCHAS. All CAPTCHAS I've seen so far has been of a few number of types.
One "cheat" would be to let users use their camera and microphone to record them saying a specified sentence. Deepfakes can still be detected, especially if we add requirements such as "say it in a cheerful tone" and "cover part of your mouth with your finger". That's not of course a solution to the competition but might be a potential workaround.
I think those are very creative ideas, and I think asking for "non-obvious" things in pictures is a good approach, since basically all really intelligent models are language models, some sort of "image reasoning" might work.
I tried the socket with the clip model, and the clip model got the feeling correct very confidently:
I myself can't see who the person in the bread is supposed to be, so I think an AI would struggle with it too. But on the other hand I think it shouldn't be too difficult to train a face identification AI to identify people in bread (or hidden in other ways), assuming the developer could create a training dataset from solving some captchas himself.
I'm thinking if it's possible to pose long reasonging problems in an image. Like: Next to the roundest object in the picture, there is a dark object, what other object in the picture is most similar in shape?
True.
And while there might be some uses of such benchmarks on politics etc, combining them with other benchmarks doesn't really seems like a useful benchmark.
Interesting. Even if only a small part of the tasks in the test are poor estimates of general capabilities, it makes the test as a whole less trustworthy.
For researchers (mainly)
Artificial intelligence isn’t limited in the same ways the human brain is.
Firstly, it isn’t limited to only run on a single set of hardware, it can be duplicated and speeded up to be thousands of times faster than humans, and work on multiple tasks in parallel, assuming powerful enough processors are available.
Further, AI isn’t limited to our intelligence, but can be altered and improved with more data, longer training time and smarter training methods. While the human brain today is superior to AI’s on tasks requiring deep thinking and general intelligence, there is no law preventing AI’s from one day surpassing us.
If artificial intelligence were to surpass human intelligence, it would likely become powerful enough to create an utopia lasting a long long time, or spell the end of humanity.
Thus, doing AI safety research before such an event becomes vital in order to increase the odds of a good outcome.
What if AI safety could put you on the forefront of sustainable business?
The revolution in AI has been profound, it definitely surprised me, even though I was sitting right there.
-Sergey Brin, Founder of Google
Annual investments in AI increased eightfold from 2015 to 2021, reaching 93 billion usd.
This massive growth is making people ever more dependent on AI, and with that potential risks increases.
Prioritizing AI safety is therefore becoming increasingly important in order to operate a sustainable business, with the benefits of lower risks and improved public perception.
For researchers
What if you could make a massive impact in a quickly growing field of research?
As artificial intelligence continues to advance, the potential risks increase as well.
In the words of Stephen Hawking “The development of full artificial intelligence could spell the end of the human race.”
AI safety is a field of research with the purpose to prevent AI from harming humanity, and due to the risks current and future AI is posing, it is a field in which researchers can have a massive impact.
In the three years from 2016 to 2019, AI research has grown from representing 1.8% to 3.8% of all research papers published worldwide.
With the rapid growth of general AI research, we should expect both the importance as well as the funding for AI safety to increase as well.
Click here, to learn how you can start your journey to contribute to AI safety research.
Notes
Depending on context, this text can easily be shortened by removing the third to last and second to last paragraphs.
I used this graph to get the growth of AI publications: