Posts
Comments
The post makes clear that two very different models of the world will lead to very different action steps, and the "average" of those steps isn't what follows the average of probabilities. See how the previous sentence felt awkward and technical, compared to the story? Sure, it's much longer, but the point gets across better, that's the value. I have added this story to my collection of useful parables.
Re-reading it, the language remains technical, one needs to understand a bit more probability theory to get the latter parts. I would like to see a retelling of the story, same points, different style, to test if it speaks to a different audience.
I filled out the survey. Thank you so much for running this!
Oh, glad I scrolled to find this comment. Adding a request for France, which does have charity tax deductions... but needs an appropriate receipt.
Could you provide an example of prediction the Γ Framework makes which highlights the divergence between it and the Standard Model? Especially in cases the Standard Model falls short of describing reality well enough?
Best weekend of the year. Been there in 2017, 2018, 2019, 2023, will be delighted to attend again. Consistent source of excellent discussions, assorted activities, fun and snacks. Does indeed feel like home.
Welcome! One gateway for you might be the LW Concepts page about it!
Most of the posts discuss, of course, infohazard policy and properties of information that would be harmful to know, or think about. Directly sharing blatantly harmful information would be irresponsible.
My raw and mostly confused/snarky comments as I was going through the paper can be found here (third section).
Cleaner version: this is not a technical agenda. This is not something that would elicit interesting research questions from a technical alignment researcher. There are however interesting claims:
- what a safe system ought to be like; it proposes three scales describing its reliability;
- how far up the scales we should aim for at minimum;
- how low on the scales currently large deployed models are.
While it positions a variety of technical agendas (mainly those of the co-authors) on the scales, the paper does not advocate for a particular approach, only the broad direction of "here are the properties we would like to have". Uncharitably, it's a reformulation of the problem.
The scales can be useful to compare the agenda that belong to the "let's prove that the system adheres to this specification" family. It makes no claims over what the specification entails, nor failure modes of various (combinations of) levels.
I appreciate this paper as a gateway to the related agendas and relevant literature, but I'm not enthusiastic about it.
Typo: Mira Murati, not Mutari.
Here's a spreadsheet version you can copy. Fill your answers in the "answers" tab, make your screenshot from the "view" tab.
I plan to add more functionality to this (especially comparison mode, as I collect some answers found on the Internet). You can now compare between recorded answers! Including yours, if you have filled them!
I will attempt to collect existing answers, from X and LW/EA comments.
Survey complete! I enjoyed the new questions, this should bring about some pretty graphs. Thank you for coordinating this.
I am producing videos in French (new batch in progress) on my main channel, Suboptimal.
But I also have a side channel in English, Suboptimal Voice, where I do readings of rationalist-sphere content. Some may appear in the future, I received requests for dramatic readings.
The Mindcrime tag might be relevant here! More specific than both concepts you mentioned, though. Which posts discussing them were you alluding to? Might be an opportunity to create an extra tag.
(also, yes, this in an Open Thread, your comment is in the right place)
Strongly upvoted for the clear write-up, thank you for that, and engagement with a potentially neglected issue.
Following your post I'd distinguish two issues:
(a) Lack of data privacy enabling a powerful future agent to target/manipulate you personally, because your data is just there for the taking, stored in not-so-well-protected databases, cross-reference is easier at higher capability levels, singling you out and fine-tuning a behavioral model on you in particular isn't hard ;
(b) Lack of data privacy enabling a powerful future agent to build that generic behavioral model of humans from the thousands/millions of well-documented examples from people who aren't particularly bothered by privacy, from the same databases as above, plus simply (semi-)public social media records.
From your deception examples we already have strong evidence that (b) is possible. LLM capabilities will get better, and it will get worse when [redacted plausible scenario because my infohazard policies are ringing].
In (b) comes to pass, I would argue that the marginal effort needed to prevent (a) would only be useful to prevent certain whole coordinated groups of people (who should already be infosec-aware) to be manipulated. Rephrased: there's already a ton of epistemic failures all over the place but maybe there can be pockets of sanity linked to critical assets.
I may be missing something as well. Also seconding the Seed webtoon recommendation.
Quick review of the review, this could indeed make a very good top-level post.
No need to apologize, I'm usually late as well!
I don't think there is a great answer to "What is the most comprehensive repository of resources on the work being done in AI Safety?"
There is no great answer, but I am compelled to list some of the few I know of (that I wanted to update my Resources post with) :
- Vael Gates's transcripts, which attempts to cover multiple views but, by the nature of conversations, aren't very legible;
- The Stampy project to build a comprehensive AGI safety FAQ, and to go beyond questions only, they do need motivated people;
- Issa Rice's AI Watch, which is definitely stuck in a corner of the Internet, if I didn't work with Issa I would never have discovered it, lots of data about orgs, people and labs, not much context.
Other mapping resources involve not the work being done but arguments and scenarios, as an example there's Lukas Trötzmüller's excellent argument compilation, but that wouldn't exactly help someone get into the field faster.
Just in case you don't know about it there's the AI alignment field-building tag on LW, which mentions an initiative run by plex, who also coordinates Stampy.
I'd be interested in reviewing stuff, yes, time permitting!
Answers in order: there is none, there were, there are none yet.
(Context starts, feel free to skip, this is the first time I can share this story)
After posting this, I was contacted by Richard Mallah, who (if memory serves right) created the map, compiled the references and wrote most of the text in 2017, to help with the next iteration of the map. The goal was to build a Body of Knowledge for AI Safety, including AGI topics but also more current-capabilities ML Safety methods.
This was going to happen in conjunction with the contributions of many academic & industry stakeholders, under the umbrella of CLAIS (Consortium on the Landscape of AI Safety), mentioned here.
There were design documents for the interactivity of the resource, and I volunteered Back in 2020 I had severely overestimated both my web development skills and ability to work during a lockdown, never published a prototype interface, and for unrelated reasons the CLAIS project... winded down.
(End of context)
I do not remember Richard mentioning a review of the map contents, apart from the feedback he received back when he wrote them. The map has been a bit tucked in a corner of the Internet for a while now.
The plans to update/expand it failed as far as I can tell. There is no new version and I'm not aware of any new plans to create one. I stopped working on this in April 2021.
There is no current map with this level of interactivity and visualization, but there has been a number of initiatives trying to be more comprehensive and up-to-date!
I second this, and expansions of these ideas.
Thank you, that is clearer!
But let's suppose that the first team of people who build a superintelligence first decide not to turn the machine on and immediately surrender our future to it. Suppose they recognize the danger and decide not to press "run" until they have solved alignment.
The section ends here but... isn't there a paragraph missing? I was expecting the standard continuation along the lines of "Will the second team make the same decision, once they reach the same capability? Will the third, or the fourth?" and so on.
Thank you for this post, I find this distinction very useful and would like to see more of it. Has the talk been recorded, by any chance (or will you give it again)?
Thank you, that's was my understanding. Looking forward to the second competition! And, good luck sorting out all the submissions for this one.
[Meta comment]
The deadline is past, should we keep the submissions coming or is it too late? Some of the best arguments I could find elsewhere are rather long, in the vein of the Superintelligence FAQ. I did not want to copy-paste chunks of it and the arguments stand better as part of a longer format.
Anyway, signalling that the lack of money incentive will not stop me from trying to generate more compelling arguments... but I'd rather do it in French instead of posting here (I'm currently working on some video scripts on AI alignment, there's not enough French content of that type).
(Policymakers) We have a good idea of what make bridges safe, through physics, materials science and rigorous testing. We can anticipate the conditions they'll operate in.
The very point of powerful AI systems is to operate in complex environments better than we can anticipate. Computer science can offer no guarantees if we don't even know what to check. Safety measures aren't catching up quickly enough.
We are somehow tolerating the mistakes of current AI systems. Nothing's ready for the next scale-up.
(ML researchers) We still don't have a robust solution to specification gaming: powerful agents find ways to get high reward, but not in the way you'd want. Sure, you can tweak your objective, add rules, but this doesn't solve the core problem, that your agent doesn't seek what you want, only a rough operational translation.
What would a high-fidelity translation would look like? How would create a system that doesn't try to game you?
(Policymakers) There is outrage right now about AI systems amplifying discrimination and polarizing discourse. Consider that this was discovered after they were widely deployed. We still don't know how to make them fair. This isn't even much of a priority.
Those are the visible, current failures. Given current trajectories and lack of foresight of AI research, more severe failures will happen in more critical situations, without us knowing how to prevent them. With better priorities, this need not happen.
(Tech execs) "Don’t ask if artificial intelligence is good or fair, ask how it shifts power". As a corollary, if your AI system is powerful enough to bypass human intervention, it surely won't be fair, nor good.
(ML researchers) Most policies are unsafe in a large enough search space; have you designed yours well, or are you optimizing through a minefield?
(Policymakers) AI systems are very much unlike humans. AI research isn't trying to replicate the human brain; the goal is, however, to be better than humans at certain tasks. For the AI industry, better means cheaper, faster, more precise, more reliable. A plane flies faster than birds, we don't care if it needs more fuel. Some properties are important (here, speed), some aren't (here, consumption).
When developing current AI systems, we're focusing on speed and precision, and we don't care about unintended outcomes. This isn't an issue for most systems: a plane autopilot isn't making actions a human pilot couldn't do; a human is always there.
However, this constant supervision is expensive and slow. We'd like our machines to be autonomous and quick. They perform well on the "important" things, so why not give them more power? Except, here, we're creating powerful, faster machines that will reliably do thing we didn't have time to think about. We made them to be faster than us, so we won't have time to react to unintended consequences.
This complacency will lead us to unexpected outcomes. The more powerful the systems, the worse they may be.
(Tech execs) Tax optimization is indeed optimization under the constraints of the tax code. People aren't just stumbling on loopholes, they're actually seeking them, not for the thrill of it, but because money is a strong incentive.
Consider now AI systems, built to maximize a given indicator, seeking whatever strategy is best, following your rules. They will get very creative with them, not for the thrill of it, but because it wins.
Good faith rules and heuristics are no match for adverse optimization.
(ML researchers) Powerful agents are able to search through a wide range of actions. The more efficient the search, the better the actions, the higher the rewards. So we are building agents that are searching in bigger and bigger spaces.
For a classic pathfinding algorithm, some paths are suboptimal, but all of them are safe, because they follow the map. For a self-driving car, some paths are suboptimal, but some are unsafe. There is no guarantee that the optimal path is safe, because we really don't know how to tell what is safe or not, yet.
A more efficient search isn't a safer search!
(Policymakers) The goals and rules we're putting into machines are law to them. What we're doing right now is making them really good at following the letter of this law, but not the spirit.
Whatever we really mean by those rules, is lost on the machine. Our ethics don't translate well. Therein lies the danger: competent, obedient, blind, just following the rules.
Thank you for curating this, I had missed this one and it does provide a useful model of trying to point to particular concepts.
Hi! Thank you for this project, I'll attempt to fill the survey.
My apologies if you already encountered the following extra sources I think are relevant to this post:
- the Modeling Transformative AI Risk (MTAIR) project (an attempt to map out the relationships between key hypotheses and cruxes involved in debates about catastrophic risks from advanced AI);
- Turchin & Derkenberger's Classification of global catastrophic risks connected with artificial intelligence (lists and categorizes a wide range of catastrophic scenarios, from narrow or general AI, near-term or long-term, misuse or accidents, and many other factors, with references);
- Sotala's Disjunctive Scenarios of Catastrophic AI Risk.
Hi! Thank you for this outline. I would like some extra details on the following points:
- "They will find bugs! Maybe stack virtual boxes with hard limits" - Why is bug-finding an issue, here? Is your scheme aimed at producing agents that will not want to escape, or agents that we'd have to contain?
- "Communicate in a manner legible to us" - How would you incentivize this kind of legibility, instead of letting communication shift to whatever efficient code is most useful for agents to coordinate and get more XP?
- "Have secret human avatars steal, lie and aggress to keep the agents on their toes" - What is the purpose of this part? How is this producing aligned agents from definitely adversarial behavior from humans?
Congratulations on your launch!
As Michaël Trazzi in the other post, I'm interested in the kind of products you'll develop, but more specifically in how the for-profit part interacts with both the conceptual research part and the incubator part. Are you expecting the latter two to yield new products as they make progress? Do these activities have different enough near-term goals that they mostly just coexist within Conjecture?
(also, looking forward to the pluralism sequence, this sounds great)
Thank you for this, I resonate with this a lot. I have written an essay about this process, a while ago: Always go full autocomplete. One of its conclusions:
It cannot be trained by expecting perfection from the start. It's trained by going full autocomplete, and reflecting on the result, not by dreaming up what the result could be. Now I wrote all that, I have evidence that it works.
The compression idea evokes Kaj Sotala's summary/analysis of the AI-Foom Debate (which I found quite useful at the time). I support the idea, especially given it has taken a while for the participants to settle on things cruxy enough to discuss and so on. Though I would also be interested in "look, these two disagree on that, but look at all the very fundamental things about AI alignment they agree on".
I finished reading all the conversations a few hours ago. I have no follow-up questions (except maybe "now what?"), I'm still updating from all those words.
One except in particular, from the latest post, jumped at me (from Eliezer Yudkowsky, emphasis mine):
This is not aimed particularly at you, but I hope the reader may understand something of why Eliezer Yudkowsky goes about sounding so gloomy all the time about other people's prospects for noticing what will kill them, by themselves, without Eliezer constantly hovering over their shoulder every minute prompting them with almost all of the answer.
The past years or reading about alignment have left me with an intense initial distrust of any alignment research agenda. Maybe it's ordinary paranoia, maybe something more. I've not come up with any new ideas myself, and I'm not particularly confident in my ability to find flaws in someone else's proposal (what if I'm not smart enough to understand them properly? What if I make things even more confused and waste everyone's time?)
After thousands and thousands of lengthy conversations where it takes everyone ages to understand where threat models disagree, why some avenue of research is promising or not, and what is behind words (there was a whimper in my mind when the meaning/usage of corrigibility was discussed, as if this whole time experts had been talking past each other)...
... after all that, I get this strong urge to create something like Arbital to explain everything. Or maybe something simpler like Stampy. I don't know if it would help much, the confusion is just very frustrating. When I'm facilitating discussions, trying to bring more people into the field, I insist on how not-settled many posts are, the kind of failure modes you have to watch out for.
Also this gives me an extra push to try harder, publish more things, ask more questions, because I'm getting more desperate to make progress. So, thank you for publishing this sequence.
So, assuming an unaligned agent here.
If your agent isn't aware that its compute cycles are limited (i.e. the compute constraint is part of the math problem), then you have three cases: (1a) the agent doesn't hit the limit with its standard search, you're in luck; (1b) the problem is difficult enough that the agent runs its standards search but fails to find a solution in the allocated cycles, so it always fails, but safely. (1c) you tweak the agent to be more compute-efficient, which is very costly and might not work, in practice if you're in case 1b and it apparently fails safely you have an incentive to just increase the limit.
If your agent is indeed aware of the constraint, then it has an incentive to remove it, or increase the limit by other means. Three cases here again: (2a) identical to 1a, you're in luck; (2b) the limit is low enough that strategic action to remove the constraint is impossible, the agent fails "safely"; (3b) the agent finds a way to remove the constraint, and you're in very unsafe territory.
Two observations from there: first, ideally you'd want your agent to operate safely even if given unbounded cycles, that's the Omni Test. Second, there's indeed an alignment concept for agents that just try to solve the problem without long-term planning, that's Myopia (and defining it formally is... hard).
I am confused by the problem statement. What you're asking for is a generic tool, something that doesn't need information about the world to be created, but that I can then feed information about the real world and it will become very useful.
My problem is that the real world is rich, and feeding the tool with all relevant information will be expensive, and the more complicated the math problem is, the more safety issues you get.
I cannot rely on "don't worry if the Task AI is not aligned, we'll just feed it harmless problems", the risk comes from what the AI will do to get to the solution. If the problem is hard and you want to defer the search to a tool powerful enough that you have to choose carefully your inputs or catastrophe happens, you don't want to build that tool.
“Knowledge,” said the Alchemist, “is harder to transmit than anyone appreciates. One can write down the structure of a certain arch, or the tactical considerations behind a certain strategy. But above those are higher skills, skills we cannot name or appreciate. Caesar could glance at a battlefield and know precisely which lines were reliable and which were about to break. Vitruvius could see a great basilica in his mind’s eye, every wall and column snapping into place. We call this wisdom. It is not unteachable, but neither can it be taught. Do you understand?”
Quoted from Ars Longa, Vita Brevis.
I second Charlie Steiner's questions, and add my own: why collaboration? A nice property of an (aligned) AGI would be that we could defer activities to it... I would even say that the full extent of "do what we want" at superhuman level would encompass pretty much everything we care about (assuming, again, alignment).
Hi! Thank you for writing this and suggesting solutions. I have a number of points to discuss. Apologies in advance for all the references to Arbital, it's a really nice resource.
The AI will hack the system and produce outputs that it's not theoretically meant to be able to produce at all.
In the first paragraphs following this, you describe this first kind of misalignment as an engineering problem, where you try to guarantee that the instructions that are run on the hardware correspond exactly to the code you are running; being robust from hardware tampering.
I argue that this is actually a subset of the second kind of misalignment. You may have solved the initial engineering problem that at the start the hardware does what the software says, but the agent's own hardware is part of the world, and so can plausibly be influenced by whatever the agent outputs.
You can attempt to specifically bar the agent from taking actions that target its hardware; that is not a hardware problem, but your second kind of misalignment. For any sufficiently advanced agent, which may find cleverer strategies than the cleverest hacker, no hardware is safe.
Plus, the agent's hardware may have more parts than you expect as long as it can interact with the outside world. We still have a long way to go before being confident about that part.
Of course the problem with this oracle is that it's far too inefficient. On every single run we can get at most 1 bit of information, but for that one bit of information we're running a superhuman artificial intelligence. By the time it becomes practical, ordinary superhuman AIs will have been developed by someone else and destroyed the world.
There are other problems, for instance how can you be sure that the agent hasn't figured out how to game the automated theorem prover to validate its proofs. You conclusion seems to be that if we manage to make safe enough, it will become impractical enough. But if you try to get more than one bit of information, you run into other issues.
This satisfies our second requirement - we can verify the AIs solution, so we can tell if it's lying. There's also some FNP problems which satisfy the first requirement - there's only one right answer. For example, finding the prime factors of an integer.
Here the verification process is no longer an automated process, it's us. You correctly point out that most useful problems have various possible solutions, and the more information we feed the agent, the more likely it will be able to find some solution that exploit our flaws and... start a nuclear war, in your example.
I am confused by your setup, which seems to be trying to make it harder for the agent to harm us, when it shouldn't even be trying to harm us in the first place.
In other words, I'm asking: is there a hidden assumption that, in the process of solving FNP problems, the agent will need to explore dangerous plans?
A superintelligent FNP problem solver would be a huge boon towards building AIs that provably had properties which are useful for alignment. Maybe it's possible to reduce the question "build an aligned AI" to an FNP problem, and even if not, some sub-parts of that problem definitely should be reducible.
I would say that building a safe superintelligent FNP solver requires solving AI alignment in the first place. A less powerful FNP solver could maybe help with sub-parts of the problem... which ones?
If I'm correct and you're talking about
you might want to add spoiler tags.
I'm taking the liberty of pointing to Adam's DBLP page.
All my hopes for this new subscription model! The use of NFTs for posts will, without a doubt, ensure that quality writing remains forever in the Blockchain (it's like the Cloud, but with better structure). Typos included.
Is there a plan to invest in old posts' NFTs that will be minted from the archive? I figure Habryka already holds them all, and selling vintage Sequences NFT to the highest bidder could be a nice addition to LessWrong's finances (imagine the added value of having a complete set of posts!)
Also, in the event that this model doesn't pan out, will the exclusive posts be released for free? It would be an excruciating loss for the community to have those insights sealed off.
My familiarity with the topic gives me enough confidence to join this challenge!
- Write down your own criticism so it no longer feels fresh
- Have your criticism read aloud to you by someone else
- Argue back to this criticism
- Write down your counter-arguments so they stick
- Document your own progress
- Get testimonials and references even when you don't "need" them
- Praise the competence of other people without adding self-deprecation
- Same as above but in their vicinity so they'll feel compelled to praise you back
- Teach the basics of your field to newcomers
- Teach the basics of your field to experts from other fields
- Write down the basics of your field, for yourself
- Ask someone else to make your beverage of choice
- Ask them to tell you "you deserve it" when they're giving it to you
- If your instinct is to reply "no I don't", consider swapping the roles
- Drink your beverage, because it feels nice
- Build stuff that cannot possibly be built by chance alone
- Stare outside the window, wondering if anybody cares about you
- Consider a world where everyone is as insecure as you
- Ask friends about their insecurities
- Consider you're too stupid to drink a glass of water, then drink some water
- Meditate on the difference between map and territory
- Write instructions for the non-impostor version of you
- Write instructions for whoever replaces you when people find out you're an impostor
- Validate those instructions with other experts, passing it off as project planning
- Follow the instructions to keep the masquerade on
- Refine the instructions since they're "obviously" not perfect
- Publish the whole thing here, get loads of karma
- Document everything you don't know for reference
- Publish the thing as a list of open problems
- Criticize harshly other people's work to see how they take it
- Make amends by letting them criticize you
- Use all this bitterness to create a legendary academic rivalry
- Consider "impostor" as a cheap rhetorical attack that doesn't hold up
- Become very good at explaining why other people are better than you
- Publish the whole thing as in-depth reporting of the life of scientists
- Focus on your deadline, time doesn't care if you're an impostor or not
- Make yourself lunch, balance on one foot, solve a sudoku puzzle
- Meditate on the fact you actually can do several complex things well
- Consider that competence is not about knowing exactly how one does things
- Have motivational pictures near you and argue how they don't apply to you
- Consider the absurdity of arguing with pictures
- Do interesting things instead, not because you have to, but to evade the absurdity
- Practice the "I have no idea what I'm doing, but no one does" stance
- Ask people why they think they know how they do things
- If they start experimenting impostor syndrome as well, support them
- Join a club of impostors, to learn from better impostors than you
- Write an apology letter to everyone you think you've duped
- Simulate the outrage of anyone reading this letter
- Cut ties with everyone who would actually treat you badly after reading
- Sleep well, eat well, exercise, brush your teeth, take care of yourself
I hope this makes the case at least somewhat that these events are important, even if you don’t care at all about the specific politics involved.
I would argue that the specific politics inherent in these events are exactly why I don't want to approach them. From the outside, the mix of corporate politics, reputation management, culture war (even the boring part), all of which belong in the giant near-opaque system that is Google, is a distraction from the underlying (indeed important) AI governance problems.
For that particular series of events, I already got all the governance-relevant information I needed from the paper that apparently made the dominoes fall. I don't want my attention to get caught in the whirlwind. It's too messy (and still is after months). It's too shiny. It's not tractable for me. It would be an opportunity cost. So I take a deep breath and avert my eyes.
My gratitude for the already posted suggestions (keep them coming!) - I'm looking forward to work on the reviews. My personal motivation resonates a lot with the help people navigate the field part; in-depth reviews are a precious resource for this task.
This is one of the rare times I can in good faith use the prefix "as a parent...", so thank you for the opportunity.
So, as a parent, lots of good ideas here. Some I couldn't implement in time, some that are very dependent on living conditions (finding space for the trampoline is a bit difficult at the moment), some that are nice reminders (swamp water, bad indeed), some that are too early (because they can't read yet)...
... but most importantly, some that genuinely blindsided me, because I found myself agreeing with them, and they were outside my thought process! The one-Brilliant-problem a day one, the let-them-eat-more-cookies, mainly.
I appreciate, in particular, the breadth of the ideas. Thanks for sharing, even if you don't practice what you preach, you'll be able to get feedback.