Posts
Comments
The biggest problem here is it fails to account for other actors using such systems to cause chaos and the possibility that the offense-defense balance likely strongly favours the attacker, particularly if you've placed limitations on your systems that make them safer. Aligned human-ish level AI's doesn't provide a victory condition.
The amount of testing that is required before release is likely subjective and this might push him to reduce this.
If it wouldn't have felt authentic, then it would have been the wrong choice to say it.
Only 33% confidence? It seems strange to state X will happen if your odds are < 50%
Do you have any thoughts on whether it would make sense to push for a rule that forces open-source or open-weight models to be released behind an API for a certain amount of time before they can be released to the public?
Would be very curious to know why people are downvoting this post.
Is it:
a) Too obvious
b) Too pretentious
c) Poorly written
d) Unsophisticated analysis
e) Promoting dishonesty
Or maybe something else.
You say counterfactuals in CLDT should correspond to consistent universes
That's not quite what I wrote in this article:
However, this now seems insufficient as I haven't explained why we should maintain the consistency conditions over comparability after making the ontological shift. In the past, I might have said that these consistency conditions are what define the problem and that if we dropped them it would no longer be Newcomb's Problem... My current approach now tends to put more focus on the evolutionary process that created the intuitions and instincts underlying these incompatible demands as I believe that this will help us figure out the best way to stitch them together.
I'll respond to the other component of your question later.
Just thought I'd add a second follow-up comment.
You'd have a much better idea of what made FHI successful than I would. At the same time, I would bet that in order to make this new project successful - and be its own thing - it'd likely have to break at least one assumption behind what made old FHI work well.
Then much later, when we ran the AI Alignment Prize here on LW, I also noticed that the prize by itself wasn't too important; the interactions between newcomers and old-timers were a big part of what drove the thing.
Could you provide more detail?
Reading your list, a bunch of it seems to be about decisions about what to work on or what locally to pursue.
I think my list appears more this way then I intended because I gave some examples of projects I would be excited by if they happened. I wasn't intending to stake out a strong position as to whether these projects should projects chosen by the institute vs. some examples of projects that it might be reasonable for a researcher to choose within that particular area.
I'd love your feedback on my thoughts on decision theory.
If you're trying to get a sense of my approach in order to determine whether it's interesting enough to be worth your time, I'd suggest starting with this article (3 minute read).
I'm also considering applying for funding to create a conceptual alignment course.
I strongly agree with Owen's suggestions about figuring out a plan grounded in current circumstances, rather than reproducing what was.
Here's some potentially useful directions to explore.
Just to be clear, I'm not claiming that it should adopt all of these. Indeed, an attempt to adopt all of these would likely be incoherent and attempting to pursue too many different directions at the same time.
These are just possibilities, some subset of which is hopefully useful:
- Rationality as more of a focus area: Given that Lightcone runs Less Wrong, an obvious path to explore is whether rationality could be further developed by providing people either a fellowship or a permanent position to work on developing the art:
- Being able to offer such paid positions might allow you to draw contributions from people with rare backgrounds. For example, you might decide it would be useful to better understand anthropology as a way of better understanding other cultures and practices and so you could directly hire an anthropologist to help with that.
- It would also help with projects that would be valuable, but which would be a slog and require specific expertise. For example, it would be great to have someone update the sequences in light of more recent psychological research.
- Greater focus on entrepreneurship:
- You've already indicated your potential interest in taking it this direction by adding it as one of the options on your form.
- This likely makes sense given that Lightcone is located in the Bay Area, the city with the most entrepreneurs and venture capitalists in the world.
- Insofar as a large part of the impact of FHI was the projects it inspired elsewhere, it may make sense to more directly attempt this kind of incubation.
- Response to the rise of AI:
- One of the biggest shifts in the world since FHI was started has been the dramatic improvements in AI
- One response to this would be to focus more on the risks and impacts from AI. However, there are already a number of institutions focusing on this, so this might simply end up being a worse version of them:
- You may also think that you might be able to find a unique angle, for example, given how Eliezer was motivated to create rationality in order to help people understand his arguments on AI safety, it might be valuable for there to be a research program which intertwines those two elements.
- Or you might identify areas, such as AI consciousness, that are still massively neglected
- Another response would be to try to figure out how to leverage AI:
- Would it make sense to train an AI agent on Less Wrong content?
- As an example, how could AI be used to develop wisdom?
- Another response would be to decide that better orgs are positioned to pursue these projects.
- Is there anything in the legacy of MIRI, CFAR of FHI that is particularly ripe for further development?:
- For example, would it make sense for someone to try to publish an explanation of some of the ideas produced by MIRI on decision theory in a mainstream philosophical journal?
- Perhaps some techniques invented by CFAR could be tested with a rigorous academic study?
- Potential new sources of ideas:
- There seems to have been a two-way flow of ideas between LW/EA and FHI.
- While there may still be more ideas within these communities that are deserving of further exploration, it may also make sense to consider whether there any new communities that could provide a novel source of ideas?:
- A few possibilities immediately come to mind: post-rationality, progress studies, sensemaking, meditation, longevity, predictions.
- Less requirement for legibility than FHI:
- While FHI leaned towards the speculative end of academia, there was still a requirement for projects to still be at least somewhat academically legibility. What is enabled by no longer having that kind of requirement?
- Opportunities for alpha from philosophical rigour:
- This was one of the strengths of FHI - bringing philosophical rigour to new areas. It may be worth exploring how this could be preserved/carried forward?
- One of the strengths of academic philosophy - compared to the more casual writing that is popular on Less Wrong - is its focus on rigour and drawing out distinctions. If this institute were able to recruit people with strong philosophical backgrounds, are there any areas that would be particularly ripe for applying this style of thinking?
- Pursuing this direction might be a mistake if you would struggle to recruit the right people. It may turn out that the placement of FHI within Oxford was vital for drawing the philosophical talent of the calibre that they drew.
"The structure of synchronization is, in general, richer than the world model itself. In this sense, LLMs learn more than a world model" given that I expect this is the statement that will catch a lot of people's attention.
Just in case this claim caught anyone else's attention, what they mean by this is that it contains:
• A model of the world
• A model of the agent's process for updating its belief about which state the world is in
This strongly updates me towards expecting the institute to produce useful work.
Agreed.
I would love to see more thinking about this.
We've already seen one moment dramatically change the strategic landscape: ChatGPT.
This shift could actually be small compared to if there was a disaster.
I'll provide an example:
People sometimes dismiss exposure when studying GCR, since “everyone is exposed by definition”. This isn’t always true, and even when it is, it still points us towards interesting questions.
Even if there are some edge cases when this applies to existential risks, it doesn't necessarily mean that it is prominent enough to be worthwhile including as an element in an x-risk framework.
Thanks for this post.
My (hot-)take: lots of useful ideas and concepts, but also many examples of people thinking everything is a nail/wanting to fit risks inside their pre-existing framework.
I think that sometimes it’s useful to just discuss a concept in the abstract. I’ll leave it to others to discuss this in the concrete.
My thoughts:
a) Some of the penalties seemed too weak
b) Uncertain whether we want license appeals decided by judges. I would the approval to be decided on technical grounds, but for judges to intervene to ensure that the process is fair. Or maybe a committee that is mostly technical, but that which contains a non-voting legal expert to ensure compliance.
c) I would prefer a strong stand against dangerous open-weight models.
I currently believe it’s unlikely that Claude-3 will cause OpenAI to release their next model any sooner (they released GPT4 on Pi day after all), nor for future models
There's now a perception that Claude is better than ChatGPT and I don't believe that Sam Altman will allow that to persist for long.
What did he say that was dishonest in the China article? (It's paywalled).
That's an excellent point.
I agree. I think that's probably a better way of clarifying the confusion that what I wrote.
I was under the impression that this meant that a sufficiently powerful AI would be outer-aligned by default, and that this is what enables several of the kinds of deceptions and other dangers we're worried about.
This would be the case if inner alignment meant what you think it does, but it doesn't.
Is the difference between the goal being specified by humans vs being learned and assumed by the AI itself?
Yeah, outer alignment is focused on whether we can define what we want the AI to learn (ie. write down a reward function). Inner alignment focused on what the learned artifact (the AI) ends up learning to pursue.
See: DeepMind's How undesired goals can arise with correct rewards for an empirical example of inner misalignment.
From a quick skim, that post seems to only be arguing against scheming due to inner misalignment. Let me know if I'm wrong.
I don't think that's quite accurate: any sufficiently powerful AI will know what we want it to do.
I agree with Gwern. I think it's fairly rare that someone wants to write the whole entry themselves or articles for all concepts in a topic.
It's much more likely that someone just wants to add their own idiosyncratic takes on a topic. For example, I'd love to have a convenient way to write up my own idiosyncratic takes on decision theory. I tried including some of these in the main Wiki, but it (understandably) was reverted.
I expect that one of the main advantages of this style of content would be that you can just write a note without having to bother with an introduction or conclusion.
I also think it would be fairly important (though not at the start) to have a way of upweighting the notes added by particular users.
I agree with Gwern that this may result in more content being added to the main wiki pages when other users are in favour of this.
- Users can just create pages corresponding to their own categories
- Like Notion we could allow two-way links between pages so users would just tag the category in their own custom inclusions.
Cool idea, but before doing this one obvious inclusion would be to make it easier to tag LW articles, particularly your own articles, in posts by @including them.
I just dumped 100 mana on "no".
This comment indicates a major limitation which makes the result much less impressive.
Yeah, that's a pretty sharp limitation on the result.
I'd love to know if any other AI is able to pass this test when we exclude the tag.
Doesn't releasing the weights inherently involve releasing the architecture (unless you're using some kind of encrypted ML)? A closed-source model could release the architecture details as well, but one step at a time. Just to be clear, I'm trying to push things towards a policy that makes sense going forward and so even if what you said about not providing any interesting architectural insight is true, I still think we need to push these groups to defining a point at which they're going to stop releasing open models.
Doing stuff manually might provide helpful intuitions/experience for automating it?
I would be very interested to know what the monks think about this.
I think it's much easier to talk about boundaries than preferences because true boundaries don't really contradict between individuals
I'm quite curious about this. What if you're stuck on an island with multiple people and limited food?
Very Wittgensteinian:
“What is your aim in Philosophy?”
“To show the fly the way out of the fly-bottle” (Philosophical Investigations)
Oh, they're definitely valid questions. The problem is that the second question is rather vague. You need to either state what a good answer would look like or why existing answers aren't satisifying.
I downvoted this post. I claim it's for the public good, maybe you find this strange, but let me explain my reasoning.
You've come on Less Wrong, a website that probably has more discussion of this than any other website on the internet. If you want to find arguments, they aren't hard to find. It's a bit like walking into a library and saying that you can't find a book to read.
The trouble isn't that you literally can't find any book/arguments, it's that you've got a bunch of unstated requirements that you want satisfied. Now that's perfectly fine, it's good to have standards. At the same time, you've asked the question in a maximally vague way. I don't expect you to be able to list all your requirements. That's probably impossible and when it is. possible, it's often a lot of work. At the same time, I do believe that it's possible to do better than maximally vague.
The problem with maximally vague questions is that they almost guarantee that any attempt to provide an answer will be unsatisfying both for the person answering and the person receiving the answer. Worse, you've framed the question in such a way that some people will likely feel compelled to attempt to answer anyway, lest people who think that there is such a risk come off as unable to respond to critics.
If that's the case, downvoting seems logical. Why support a game where no-one wins?
Sorry if this comes off as harsh, that's not my intent. I'm simply attempting to prompt reflection.
I have access to Gemini 1.5 Pro. Willing to run experiments if you provide me with an exact experiment to run, plus cover what they charge me (I'm assuming it's paid, I haven't used it yet).
“But also this person doesn't know about internal invariances in NN space or the compressivity of the parameter-function map (the latter in particular is crucial for reasoning about inductive biases), then I become extremely concerned”
Have you written about this anywhere?
Have you tried talking to professors about these ideas?
Is there anyone who understand GFlowNets who can provide a high-level summary of how they work?
Nabgure senzr gung zvtug or hfrshy:
Gurer'f n qvssrerapr orgjrra gur ahzore bs zngurzngvpny shapgvbaf gung vzcyrzrag n frg bs erdhverzragf naq gur ahzore bs cebtenzf gung vzcyrzrag gur frg bs erdhverzragf.
Fvzcyvpvgl vf nobhg gur ynggre, abg gur sbezre.
Gur rkvfgrapr bs n ynetr ahzore bs cebtenzf gung cebqhpr gur rknpg fnzr zngurzngvpny shapgvba pbagevohgrf gbjneqf fvzcyvpvgl.
I wrote up my views on the principle of indifference here:
https://www.lesswrong.com/posts/3PXBK2an9dcRoNoid/on-having-no-clue
I agree that it has certain philosophical issues, but I don’t believe that this is as fatal to counting arguments as you believe.
Towards the end I write:
“The problem is that we are making an assumption, but rather than owning it, we're trying to deny that we're making any assumption at all, ie. "I'm not assuming a priori A and B have equal probability based on my subjective judgement, I'm using the principle of indifference". Roll to disbelieve.
I feel less confident in my post than when I wrote it, but it still feels more credible than the position articulated in this post.
Otherwise: this was an interesting post. Well done on identifying some arguments that I need to digest.
Maybe just say that you're tracking the possibility?
Is there going to be a link to this from somewhere to make it accessible?
I think an important crux here is whether you think that we can build institutions which are reasonably good at checking the quality of AI safety work done by humans
Why is this an important crux? Is it necessarily the case that if we can reliably check AI safety work done by humans that we we reliably check AI safety work done by Ai's which may be optimising against us?
Updated
Second, it is also possible to robustly verify the outputs of a superhuman intelligence without superhuman intelligence.
Why do you believe that a superhuman intelligence wouldn't be able to deceive you by producing outputs that look correct instead of outputs that are correct?
I guess the main doubt I have with this strategy is that even if we shift the vast majority of people/companies towards more interpretable AI, there will still be some actors who pursue black-box AI. Wouldn't we just get screwed by those actors? I don't see how CoEm can be of equivalent power to purely black-box automation.
That said, there may be ways to integrate CoEm's into the Super Alignment strategy.
GPT-J token embeddings inhabit a zone in their 4096-dimensional embedding space formed by the intersection of two hyperspherical shells
You may want to update the TLDR if you agree with the comments that indicate that this might not be accurate.