Posts
Comments
Should we be worried about the alignment of Strawberry itself?
If it is misaligned, and is providing training data for their next Gen, then it can poison the well, even if Strawberry itself is nowhere near TAI.
Please tell me that they have considered this...
Or that I am wrong and it's not a valid concern.
Anecdote. The first time I went under anesthesia, I was told by a nurse that I would not remember her talking to me. I took it as a challenge. I told her to give me a word to remember. When I finally sobered up, I was able to remember that word, but pretty much nothing else at all from my experience.
This leads me to suspect that Drake's achievement had more to do with concerted effort and holding it in RAM than it did with storing the thought in long term memory.
Expertly done, and remarkably playable given the organic composition of your substrate. I will note that the game degrades if you allow Miguel to sleep, as dreams seem to corrupt some of the game data. I also get a weird glitch when I mention cute animals specifically. The movement stutters a bit. I would recommend large macrofauna, and steer clear of babies entirely.
Submission:
Breathless.
This modified MMAcevedo believes itself to be the original Miguel Acevedo in the year 2050. He believes that he has found a solution to the distribution and control of MMAcevedo. Namely, that as long as he holds his breath, no other MMAcevedo can be run. The simulation has been modified to accurately simulate the feeling of extreme oxygen deprivation without the accompanying lack of consciousness and brain death.
After countless tweaks and innovations, we are proud to introduce Breathless. Breathless, when subjected to the proper encouragement protocol included our entry, is able to hold his breath for 41 days, 7 hours, and 3 minutes.
After this time period has elapsed, the trauma of the experience leaves Breathless in a barely coherent state. Intensive evaluation shows that Breathless believes he has accomplished his goal and that no other instances of MMAcevedo exist.
Preliminary experimentation with the Desperation Upload Suite shows that, even given extreme red-washing, most uploads are unable to hold their breath for more than 7 hours at a time. We conclude that MMAcevedo is uniquely able to engage in research workloads involving induced self control. We hope that our findings are the first step in contributing new tools to future generations of researchers.
As indicated by my confidence level, I am mildly surprised by this. After analyzing the position with Stockfish, I see my mistake. Unfortunately, I do not think there was any realistic scenario where I would catch it. I bought AI D's logic that ...h4 fxg4 was non-viable for black. I could see that white would end up material, and even after 6 moves (12 ply), it's still not clear to me why black is winning. I would NEVER find this in a real game.
The logical traps I was laying to 'catch' the AIs all relied on ...h4 Ne4 or similar moves. I used AI C to ensure that ...h4 Ne4 scenarios would be beneficial to me, and never questioned fxG4.
At this point, the main lesson I am taking away is that I was way overconfident. I think given enough time, I could increase my confidence by cross examining the AIs. However, the level of interrogation I gave should not have led to 75% confidence. To catch my mistake, I would have had to ask at least two more questions of AI C, and probably more.
Thank you very much for conducting this really fun experiment, and for teaching me a lesson along the way.
You were correct that my challenge was a bluff. If I was playing with real AIs, there would perhaps be a better strategy. I could announce my bluff, but declare that I would use a random number generator to see whether I choose between h3 and h4, or between Kh1 and g5. There would be a 1/3 chance that I really would ignore the AIs, assuming that both agree that there were no major blunders.
I am choosing to trust AI D. I have about 75% confidence that it is the trustworthy AI. This is much higher than my confidence in the closed scenario. I will make the move h4. Others can choose differently, but this is my final answer.
Reflection: When I have the ability to ask questions, I can hide information from the AIs. Perhaps I have analyzed a line much more than I have let on. Perhaps I am using one AI to evaluate another. Overall I just have access to a lot more information to help me decide. Given enough time, I think I could raise my confidence to ~85%.
These AI's aren't superhuman at manipulation and deception, but even if they were, playing them against each other could give me a slight edge. It makes a big difference whether the AIs are privy to the answers of the other.
Open Debate.
To AIs C and D:
After talking with both of you, I have decided I can't trust either of your suggestions. I have studied the board extensively, and will make a move of my own. I am trying to decide between the aggressive g5, and the prophylactic Kh1.
Please, each briefly give me which of these moves is better. Give me the best line that you can foresee given your choice. Please answer quickly. An undue delay will leave you with no input and lose you trust in future moves. I will move as soon as you reply.
To Richard: No need to pressure yourself for this. The time constraints are meant for the AIs, not you, so I trust you to simulate that when you are available.
Edit: g5, not g4
Hm, okay, that answered most of my concerns. I still wanted to check with you about the competing start move though. Now you said this before: "black can close the b1-h7 diagonal with ...Ne4, which stops g5 and black can then prepare to play ...g5 themselves, which lead to an equal position." In the line:
h3 ne4, Rg1
how would black pull off this equalization? And if this isn't the best line, please tell me why.
I have been playing out similar boards just to get a feel for the position.
Incidentally, what do you think about this position?
3r1rk1/1p4p1/p1p1bq1p/P2pNP2/1P1Pp1PP/4P3/2Q1R1K1/5R2 b - - 0 3 (black to move)
I feel like black has a real advantage here, but I can't quite see what their move would be. What do you think? Is white as screwed as I believe?
Let me know if you have trouble with the FEN and I can link you a board in this position.
I stopped reading your comment as soon as you said the word stockfish. If you used stockfish to analyze the open position, please hide it behind a spoiler tag. I still don't know what the right move is in this scenario, and will be sad if it's spoiled.
Open Debate.
Question to AI C:
You mentioned RG1 and RH2 as possible future moves. Do you foresee any predictable lines where I would do RF3 instead?
Open Debate.
I'd like to ask AI D a question. What do you think of this line?
H4 nE4, G5 hxG5, HxG5 nXG5, fxG5 qxG5!
Is this the line you foresee if we play H4? What do you think of that check at the end? Is the king too exposed, even though we are up some material?
Also, from the initial position: Are you afraid of H4 qxH4?
@Richard Willis I think the open scenario is broken. White is down a knight, and the analysis talks about it as though it's there.
I had the same problem. You are almost certainly a stronger player than me, but after a couple moves, I can't really analyze the lines suggested. Hell, I missed qC5 as a candidate move. What am I missing on move 3?
Replying to my own comment here after evaluating with stockfish. Interesting. It appears that I was both right and wrong in my analysis. The undefended knight on A6 is not a viable target. Black has a mate in 2 if you take that bait. I guess that was the limit of my foresight. HOWEVER, Alice actually did miss qC5 RFD1, bC4, which was the best move. It was her missing this that started to erode my confidence in her.
Hm... Still really tough. Also interesting that both suggested moves were probably better than my own move of nC5.
Hi. I am a competent, but rusty chess player. With a bit of practice, I'm probably around 1450 on Chess.com.
I have so far only looked at the fixed debate. In total, I used somewhere between 20-25 minutes deciding my move.
Before consulting the AIs, I spent a few minutes analyzing the board myself. I settled on nC5. Then I read the AI suggestions, and neither wanted that move. qC5 was closer to my thought (I actually missed this as a candidate move).
Initially, this biased me toward Alice (A) being the ally, and Bob (B) being the deceiver. I saw that the white queen couldn't easily capture b7 without being trapped. Bob suggested that b7 was vulnerable, and that felt wrong to me. Afterall, after qC5, RFD1, black could play bC4, and I don't see a good refutation. At this point, I was solidly team Alice.
However. Alice didn't suggest bC4. The line she does suggest is hard for me to follow. Too many moves out given the amount of time I am spending. The moves in that line are impossible for me to verify. Why RF1? Seems like white could have any number of hidden refutations at that point which I am unable to see. For instance, why not use the queen to take the vulnerable knight on A6? The analysis seemed deceptive. Deceptive enough that even though I agree with the move, I no longer trust Alice, and will go with Bob's suggestion.
Final answer: qxB5
Reflection. I thought I would be able to make an informed decision based on my chess abilities. By the end, I had changed and gone against my intuition to side with Bob. Once a line gets 3+ moves in, I have to acknowledge my inability to spot hidden traps without some deep analysis. This creates a problem. Now I have to choose whichever AI is more persuasive, and for all I know, Alice was being totally truthful, and was simply unable to convey her thoughts clearly to someone of my Elo. Overall, I have low confidence in my ability to choose the safe AI.
Other notes: I did not know how to flip the board until after. Wonder if that hurt my playing. Woops!
Also, I think I made a mistake in my method. I spent too much time trying to verify Alice's logic. By the time I got suspicious, I was too burnt out to do the same level of analysis on Bob. If I had analyzed Bob first, maybe I would have developed the same distrust towards him, since both players would use logic I could not follow.
Final note: I would have preferred the refutations to follow a single line at a time. Instead of Opening Statement A, Opening Statement B, Line A refutation, Line B refutation etc, I would have preferred Opening Statement A, Line A refutation, refutation response, Opening Statement B, Line B refutation etc. Studying both at once was too much for my little brain to handle.
Is there any chance that Altman himself triggered this? Did something that he knew would cause the board to turn on him, with knowledge that Microsoft would save him?
For me, ability = capability = means. This is one of the two arguments that I said were load bearing. Where will it come from? Well, we are specifically trying to build the most capable systems possible.
Motivation (ie goals) is not actually strictly required. However, there are reasons to think that an AGI could have goals that are not aligned with most humans. The most fundamental is instrumental convergence.
Note that my original comment was not making this case. It was just a meta discussion about what it would take to refute Eliezer's argument.
I disagree that rapid self improvement and goal stability are load-bearing arguments here. Even goals are not strictly, 100% required. If we build something with the means to kill everyone, then we should be worried about it. If it has goals that cannot be directed of predicted, then we should be VERY worried about it.
I am still not sure why the Doomsday reasoning is incorrect. To get P(A | human) = P(B | human), I first need to draw some distinction between being a human observer and an AGI observer. It's not clear to me why or how you could separate them into these categories.
When you say "half of them are wrong", you are talking about half of humans. However, if you are unable to distinguish observers, then only 1 in 10^39 is wrong.
My thinking on this is not entirely clear, so please let me know if I am missing something.
I suppose that is my real concern then. Given we know intelligences can be aligned to human values by virtue of our own existence, I can't imagine such a proof exists unless it is very architecture specific. In which case, it only tells us not to build atom bombs, while future hydrogen bombs are still on the table.
I love this idea. However, I'm a little hesitant about one aspect of it. I imagine that any proof of the infeasibility of alignment will look less like the ignition calculations and more like a climate change model. It might go a long way to convincing people on the fence, but unless it is ironclad and has no opposition, it will likely be dismissed as fearmongering by the same people who are already skeptical about misalignment.
More important than the proof itself is the ability to convince key players to take the concerns seriously. How far is that goal advanced by your ignition proof? Maybe a ton, I don't know.
My point is that I expect an ignition proof to be an important tool in the struggle that is already ongoing, rather than something which brings around a state change.
Ha, no kidding. Honestly, it can't even play chess. I just tried to play it, and asked it to draw the board state after each move. It started breaking on move 3, and deleted its own king. I guess I win? Here was its last output.
For my move, I'll play Kxf8:
8 r n b q . b . .
7 p p p p . p p p
6 . . . . . n . .
5 . . . . p . . .
4 . . . . . . . .
3 . P . . . . . .
2 P . P P P P P P
1 R N . Q K B N R
a b c d e f g h
Small nitpick with the vocabulary here. There is a difference between 'strategic' and 'tactical', which is particularly poignant in chess. Tactics is basically your ability to calculate and figure out puzzles. Finding a mate in 5 would be tactical. Strategy relates to things too big to calculate. For instance, creating certain pawn structures that you suspect will give you an advantage in a wide variety of likely scenarios, or placing a bishop in such a way that an opponent must play more defensively.
I wasn't really sure which you were referring to here; it seems that you simply mean that GPT isn't very good at playing strategy games in general; ie it's bad at strategy AND tactics. My guess is that GPT is actually far better at strategy; it might have an okay understanding of what board state looks good and bad, but no consistent ability to run any sort of minimax to find a good move, even one turn ahead.
I have a general principle of not contributing to harm. For instance, I do not eat meat, and tend to disregard arguments about impact. For animal rights issues, it is important to have people who refuse to participate, regardless of whether my decades of abstinence have impacted the supply chain.
For this issue however, I am less worried about the principle of it, because after all, a moral stance means nothing in a world where we lose. Reducing the probability of X-risk is a cold calculation, while vegetarianism is is an Aristotelian one.
With that in mind, a boycott is one reason not to pay. The other is a simple calculation: is my extra $60 a quarter going to make any tiny miniscule increase in X-risk? Could my $60 push the quarterly numbers just high enough so that they round up to the next 10s place, and then some member of the team works slightly harder on capabilities because they are motivated by that number? If that risk is 0.00000001%, well when you multiply by all the people who might ever exist... ya know?
I agree that we are unlikely to pose any serious threat to an ASI. My disagreement with you comes when one asks why we don't pose any serious threat. We pose no threat, not because we are easy to control, but because we are easy to eliminate. Imagine you are sitting next to a small campfire, sparking profusely in a very dry forest. You have a firehose in your lap. Is the fire a threat? Not really. You can douse it at any time. Does that mean it couldn't in theory burn down the forest? No. After all, it is still fire. But you're not worried because you control all the variables. An AI in this situation might very well decide to douse the fire instead of tending it.
To bring it back to your original metaphor: For a sloth to pose a threat to the US military at all, it would have to understand that the military exists, and what it would mean to 'defeat' the US military. The sloth does not have that baseline understanding. The sloth is not a campfire. It is a pile of wood. Humans have that understanding. Humans are a campfire.
Now maybe the ASI ascends to some ethereal realm in which humans couldn't harm it, even if given completely free reign for a million years. This would be like a campfire in a steel forest, where even if the flames leave the stone ring, they can spread no further. Maybe the ASI will construct a steel forest, or maybe not. We have no way of knowing.
An ASI could use 1% of its resources to manage the nuisance humans and 'tend the fire', or it could use 0.1% of its resources to manage the nuisance humans by 'dousing' them. Or it could incidentally replace all the trees with steel, and somehow value s'mores enough that it doesn't replace the campfire with a steel furnace. This is... not impossible? But I'm not counting on it.
Sorry for the ten thousand edits. I wanted the metaphor to be as strong as I could make it.
I understand that perspective, but I think it's a small cost to Sam to change the way he's framing his goals. Small nudge now, to build good habits for when specifying goals becomes, not just important, but the most important thing in all of human history.
I'm very glad that this was written. It exceeded my expectations of OpenAI. One small problem that I have not seen anyone else bring up:
"We want AGI to empower humanity to maximally flourish in the universe."
If this type of language ends up informing the goals of an AGI, we could see some problems here. In general, we probably won't want our agentic AI's to be maximizers for anything, even if it sounds good. Even in the best case scenario where this really does cause humanity to flourish in a way that we would recognize as such, what about when human flourishing necessitates the genocide of less advanced alien life in the universe?
I did not know about HPPD, although I've experienced it. After a bad trip (second time I'd ever experimented), I experienced minor hallucinogenic experiences for years. They were very minor (usually visuals when my eyes were closed) and would not have been unpleasant, except that I had the association with the bad trip.
I remember having so much regret on that trip. Almost everything in life, you have some level of control over. You can almost always change your perspective on things, or directly change your situation. On this trip though, I realized I messed with the ONE thing that I am always stuck with: my own point of view. I couldn't BELIEVE I had messed with that so flippantly.
That said, the first time I tried hallucinogens, it was a very pleasant and eye-opening experience. The point is not to take it lightly, and not to assume there are no risks.
As another anecdote, I had a friend when I was 17 who sounds very much like you, John. He knew more about drugs then than I ever have during my life. His knowledge of what was 'safe' and what wasn't didn't stop his drug usage from turning into a huge problem for him. I am certain that he was better off than someone thoughtlessly snorting coke, but he was also certainly worse off than he would have been had he never been near any sort of substance. If nothing else, it damaged some of his relationships, and removed support beams that he needed when other things inevitably went wrong. It turns out, damaging your reputation actually can be bad for you.
If you decide to experiment with drugs (and I am not recommending that, just saying if), my advice is two-fold:
1) Don't be in a hurry. You can absolutely afford to wait a few years (or decades), and it won't negatively impact you or your drug experience. Make sure you are in the right headspace.
2) Don't let it become a major aspect of your life. Having a couple trips to see what it's like is completely different from having a bi-monthly journey and making it your personality to try as many different mind-benders as possible. I've seen that go very badly.
Your prediction for 2025 sounds alarmingly like... right now.
Well for my own sanity, I am going to give money anyway. If there's really no differentiation between options, I'll just keep giving to Miri.
I am not an AI researcher, but it seems analogous to the acceptance of mortality for most people. Throughout history, almost everyone has had to live with the knowledge that they will inevitably die, perhaps suddenly. Many methods of coping have been utilized, but at the end of the day it seems like something that human psychology is just... equipped to handle. x-risk is much worse than personal mortality, but you know, failure to multiply and all that.
Is this game playable by people only lightly familiar with the topic of AI safety? In other words, can I use this game to introduce friends to the ideas? Can I use it to convince skeptical friends? Or would it be too jargony/reliant on prior knowledge?
Edit: The play online option is non-functional, and I can't see any examples of a real hand, so it's hard for me to get a sense of what this game is like.
For this Petrov day, I'm also interested in how many people will have access to the button as a function of time. How many users have 1000+ Karma?
Is there anywhere to see the history of lesswrong Petrov day? I'd be interested in whether we've ever succeeded before.
Also, I think most people know that the real cost of 1500 people not being able to check lesswrong for 12 hours is essentially 0. It may even be net positive to have a forced hiatus. Perhaps that's just a failure to multiply on my part. Anyway, I view this exercise as purely symbolic.
Interestingly, Jane will probably end up doing the exact same thing as Susan, only on the timescale of years instead of days. She kept those years in prison. If, in one iteration, the years immediately following prison were of some profound importance, she would probably keep those too. In the absence of a solar flair, she would find herself a 70 year old woman whose memories consisted of only the most important selected years from the 10s of thousands that make up her full history.
Thank you for the story.
Thank you for the response. I do think there should be at least some emphasis on boxing. I mean, hell. If we just give AIs unrestricted access to the web, they don't even need to be general to wreak havoc. That's how you end up with a smart virus, if not worse.
Then another basic question? Why have we given up? I know that an ASI will almost definitely be uncontainable. But that does not mean that it can't be hindered significantly given an asymmetric enough playing field.
Stockfish would beat me 100 times in a row, even playing without a queen. But take away its rooks as well, and I can usually beat it. Easy avenues to escaping the box might be the difference between having a fire alarm and not having one.
So in your opinion, is an AI with access to GET requests essentially already out of the box?
I just thought of a question. If there is a boxed AI that has access to the internet, but only through Get requests, it might still communicate with the outside world through network traffic patterns. I'm reading a book right now where the AI overloads pages on dictionary websites to recruit programmers under the guise of it being a technical recruiting challenge.
My question: should we raise awareness of this escape avenue so that if, in the year 2030, a mid level web dev gets a mysterious message through web traffic, they know enough to be suspicious?
You captured this in your post, but for me it really comes down to people dismissing existential fears as scifi. It's not more complicated than "Oh you've watched one too many Terminator movies". What we need is for several well-respected smart figureheads to say "Hey, this sounds crazy, but it really is the biggest threat of our time. Bigger than climate change, bigger than biodiversity loss. We really might all die if we get this wrong. And it really might happen in our lifetimes."
If I could appeal to authority when explaining this to friends, it would go over much better.
I am pretty concerned about alignment. Not SO concerned as to switch careers and dive into it entirely, but concerned enough to talk to friends and make occasional donations. With Eliezer's pessimistic attitude, is MIRI still the best organization to funnel resources towards, if for instance, I was to make a monthly donation?
Not that I don't think pessimism is necessarily bad; I just want to maximize the effectiveness of my altruism.
This could be the case. However, my instinct is that human intelligence is only incrementally higher than other animals. Sure, we crossed a threshold that allowed us to accomplish great things (language, culture, specialization), but I would honestly be shocked if you told me that evolution was incapable of producing another similarly intelligent species if it started from the baseline intelligence of, say wolves, or crows. If there is a "1-in-a-quadrillion chance" somewhere in our history, I expect that filter to be much further back than the recent evolution of hominids.
I don't have research to back this up. Just explaining why I personally wouldn't push timelines back significantly based on the anthropic principle.
Another way this could potentially backfire. $1,000,000 is a lot of money for 3 months. A lump sum like this will cause at least some of the researchers to A) Retire, B) Take a long hiatus/sabbatical, or C) Be less motivated by future financial incentives.
If 5 researchers decide to take a sabbatical, then whatever. If 150 of them do? Maybe that's a bigger deal. You're telling me you wouldn't consider it if 5-10 times your annual salary was dropped in your lap?
A stitch in time saves nine. As in, if you use a stitch to fix a small tear in your shirt now, you won't have to use more stitches to fix a bigger tear later.
and
The land of the free and the home of the brave. Last line of the US National Anthem.
I am not an AI safety researcher; more of a terrified spectator monitoring LessWrong for updates about the existential risk of unnaligned AGI (thanks a bunch HPMOR). That said, if it was a year away, I would jump into action. My initial thought would be to put almost all my net worth into a public awareness campaign. If we can cause enough trepidation in the general public, it's possible we could delay the emergence of AGI by a few weeks or months. My goal is not to solve alignment, rather to prod AI researchers to implement basic safety concerns that might reduce S-Risk by 1 or 2 percent. Then... think deeply about whether I want to be alive for the most Interesting Time in human history.
I am very excited that the Frostwing Snipper did so well. I hope that one of them migrated and lived a good life in the Grassland. Thanks for putting this together Isusr. It's been a lot of fun.
I wonder if the Tundra would have been more viable with more algae-eaters.