Posts
Comments
Yup, there is a working prototype and a programmer who would like to work on it full time if there was funding, but it's not been progressing much for the past year or so because no one has had the free bandwidth to work on it.
https://aisafety.world/tiles/ has a bunch.
That's fair, I've added a note to the bottom of the post to clarify my intended meaning. I am not arguing for it in a well-backed up way, just stating the output of my models from being fairly close to the situation and having watched a different successful mediation.
Forced or badly done mediation seems indeed terrible, entering into conversation facilitated by someone skilled with an intent to genuinely understand the harms caused and make sure you correct he underlying patterns seems much less bad than the actual way the situation played out.
I was asked to comment by Ben earlier, but have been juggling more directly impactful projects and retreats. I have been somewhat close to parts of the unfolding situation, including spending some time with both Alice, Chloe, and (separately) the Nonlinear team in-person, and communicating online on-and-off with most parties.
I can confirm some of the patterns Alice complained about, specifically not reliably remembering or following through on financial and roles agreements, and Emerson being difficult to talk to about some things. I do not feel notably harmed by these, and was able to work them out with Drew and Kat without much difficulty, but it does back up my perception that there were real grievances which would have been harmful to someone in a less stable position. I also think they've done some excellent work, and would like to see that continue, ideally with clear and well-known steps to mitigate the kinds of harms which set this in motion.
I have consistently attempted to shift Nonlinear away from what appears to me a wholly counterproductive adversarial emotional stance, with limited results. I understand that they feel defected against, especially Emerson, but they were in the position of power and failed to make sure those they were working with did not come out harmed, and the responses to the initial implosion continued to generate harm and distraction for the community. I am unsettled by the threat of legal action towards Lightcone and focus on controlling the narrative rather than repairing damage.
Emerson: You once said one of the main large failure modes you were concerned about becoming was Stalin's mistake: breaking the networks of information around you so you were unaware things were going so badly wrong. My read is you've been doing this in a way which is a bit more subtle than the gulags, by the intensity of your personality shaping the fragments of mind around you to not give you evidence that in fact you made some large mistakes here. I felt the effects of this indirectly, as well as directly. I hope you can halt, melt, and catch fire, and return to the effort as someone who does not make this magnitude of unforced error.
You can't just push someone who is deeply good out of the movement which has the kind of co-protective nature of ours in the way you merely shouldn't in some parts of the world, if there's intense conflict call in a mediator and try and heal the damage.
Edit: To clarify, this is not intended as a blanket endorsement of mediation, or of avoiding other forms of handling conflict. I do think that going into a process where the parties genuinely try and understand each other's worlds much earlier would have been much less costly for everyone involved as well as the wider community in this case, but I can imagine mediation is often mishandled or forced in ways which are also counterproductive.
How can I better recruit attention and resources to this topic?
Consider finding an event organizer/ops person and running regular retreats on the topic. This will give you exposure to people in a semi-informal setting, and help you find a few people with clear thinking who you might want to form a research group with, and can help structure future retreats.
I've had great success with a similar approach.
We're getting about 20k uniques/month across the different URLs, expect that to get much higher once we make a push for attention when Rob Miles passes us for quality to launch to LW then in videos.
AI safety is funding constrained, we win more timelines if there are a bunch of people investing to give successfully.
If you're able to spend time in the UK, the EA Hotel offers free food and accommodation for up to two years in low-cost shared living. Relatedly, there should really be one of these in the US and mainland Europe.
feel that I have a bad map of the AI Alignment/Safety community
This is true of many people, and why I built the map of AI safety :)
Next step is to rebuild aisafety.com into a homepage which ties all of this together, and offer AI Safety Info's database via an API for other websites (like aisafety.com, and hopefully lesswrong) to embed.
Why did the Alignment community not prepare tools and plans for convincing the wider infosphere about AI safety years in advance?
I've been organizing the volunteer team who built AI Safety Info for the past two and a half years, alongside building a whole raft of other tools like AI Safety Training and AI Safety World.
But, yes, the movement as a whole has dropped the ball pretty hard on basic prep. The real answer is that things are not done by default, and this subculture has relatively few do-ers compared to thinkers. And the thinkers had very little faith in the wider info-sphere, sometime actively discouraging most do-ers from trying broad outreach.
Early corporations, like the East India Company, might be a decent reference class?
I'm pretty sure that at some level what sorts of things your brain spits out into your consciousness and how useful that information is in the given situation, is something that you can't fundamentally change. I expect this to be a hard-coded algorithm
Tune Your Cognitive Strategies proports to offer a technique which can improve that class of algorithm significantly.
Edit: Oh, no, you were meaning a different thing, and this probably goes into the inputs to the algorithm category?
Your probabilities are not independent, your estimates mostly flow from a world model which seem to me to be flatly and clearly wrong.
The plainest examples seem to be assigning
We invent a way for AGIs to learn faster than humans | 40% |
AGI inference costs drop below $25/hr (per human equivalent) | 16% |
despite current models learning vastly faster than humans (training time of LLMs is not a human lifetime, and covers vastly more data) and the current nearing AGI and inference being dramatically cheaper and plummeting with algorithmic improvements. There is a general factor of progress, where progress leads to more progress, which you seem to be missing in the positive factors. For the negative, derailment that delays enough to push us out that far needs to be extreme, on the order of a full-out nuclear exchange, given more reasonable models of progress.
I'll leave you with Yud's preemptive reply:
Taking a bunch of number and multiplying them together causes errors to stack, especially when those errors are correlated.
Nice! Glad to see more funding options entering the space, and excited to see the S-process rolled out to more grantmakers.
Added you to the map of AI existential safety:
One thing which might be nice as part of improving grantee experience would be being able to submit applications as a google doc (with a template which gives the sections you list) rather than just by form. This increases re-usability and decreases stress, as it's easy to make updates later on so it's less of a worry that you ended up missing something crucial. Might be more hassle than it's worth though, depending on how entangled your backend is with airtable.
Cool! Feel free to add it with the form
that was me for context:
core claim seems reasonable and worth testing, though I'm not very hopeful that it will reliably scale through the sharp left turn
my guesses the intuitions don't hold in the new domain, and radical superintelligence requires intuitions that you can't develop on relatively weak systems, but it's a source of data for our intuition models which might help with other stuff so seems reasonable to attempt.
Meta's previous LLM, OPT-175B, seemed good by benchmarks but was widely agreed to be much, much worse than GPT-3 (not even necessarily better than GPT-Neo-20b). It's an informed guess, not a random dunk, and does leave open the possibility that they're turned it around and have a great model this time rather than something which goodharts the benchmarks.
This is a Heuristic That Almost Always Works, and it's the one most likely to cut off our chances of solving alignment. Almost all clever schemes are doomed, but if we as a community let that meme stop us from assessing the object level question of how (and whether!) each clever scheme is doomed then we are guaranteed not to find one.
Security mindset means look for flaws, not assume all plans are so doomed you don't need to look.
If this is, in fact, a utility function which if followed would lead to a good future, that is concrete progress and lays out a new set of true names as a win condition. Not a solution, we can't train AIs with arbitrary goals, but it's progress in the same way that quantilizers was progress on mild optimization.
Not inspired by them, no. Those did not have, as far as I'm aware, a clear outlet for use of the outputs. We have a whole platform we've been building towards for three years (starting on the FAQ long before those contests), and the ability to point large numbers of people at that platform once it has great content thanks to Rob Miles.
As I said over on your Discord, this feels like it has a shard of hope, and the kind of thing that could plausibly work if we could hand AIs utility functions.
I'd be interested to see the explicit breakdown of the true names you need for this proposal.
Agreed, incentives probably block this from being picked up by megacorps. I had thought to try and get Musk's twitter to adopt it at one point when he was talking about bots a lot, it would be very effective, but doesn't allow rent extraction in the same way the solution he settled on (paid twitter blue).
Websites which have the slack to allow users to improve their experience even if it costs engagement might be better adopters, LessWrong has shown they will do this with e.g. batching karma daily by default to avoid dopamine addiction.
Hypothesis #2: These bits of history are wrong for reasons you can check with simpler learned structures.
Maybe these historical patterns are easier to disprove with simple exclusions, like "these things were in different places"?
And if you use common but obviously wrong science or maths, it is less likely to.
Yeah, my guess is if you use really niche and plausible-sounding historical examples it is much more likely to hallucinate.
Maybe the RLHF agent selected for expects the person giving feedback to correct it for the history example, but not know the latter example is false. If you asked a large sample of humans, more would be able to confidently say the first example is false than the latter one.
Yeah, that makes a lot of sense and fits my experience of what works.
Hell yeah!
This matches my internal experience that caused me to bring a ton of resources into existence in the alignment ecosystem (with various collaborators):
- aisafety.info - Man, there really should be a single point of access that lets people self-onboard into the effort. (Helped massively by Rob Miles's volunteer community, soon to launch a paid distillation fellowship)
- aisafety.training - Maybe we should have a unified place with all the training programs and conferences so people can find what to apply to? (AI Safety Support had a great database that just needed a frontend)
- aisafety.world - Let's make a map of everything in AI existential safety so people know what orgs, blogs, funding sources, resources, etc exist, in a nice sharable format. (Hamish did the coding, Superlinear funded it)
- ea.domains - Wow, there sure are a lot of vital domains that could get grabbed by squatters. Let's step in and save them for good orgs and projects.
- aisafety.community - There's no up-to-date list of online communities. This is an obvious missing resource.
- Rob Miles videos are too rare, almost entirely bottlenecked on the research and scriptwriting process. So I built some infrastructure which allows volunteers to collaborate as teams on scripts for him, being tested now.
- Ryan Kidd said there should be a nice professional site which lists all the orgs in a format which helps people leaving SERI MATS decide where to apply. aisafety.careers is my answer, though it's not quite ready yet. Volunteers wanted to help write up descriptions for orgs in the Google Docs we have auto-syncing with the site!
- Nonlinear wanted a prize platform, and that seemed pretty useful as a way to usefully use the firehose of money while FTXFF was still a thing, so I built Superlinear.
- There are a lot of obvious low-hanging fruit here. I need more hands. Let's make a monthly call and project database so I can easily pitch these to all the people who want to help save the world and don't know what to do. A bunch of great devs joined!
- and 6+ more major projects as well as a ton of minor ones, but that's enough to list here.
I do worry I might be neglecting my actual highest EV thing though, which is my moonshot formal alignment proposal (low chance of the research direction working out, but much more direct if it does). Fixing the alignment ecosystem is just so obviously helpful though, and has nice feedback loops.
This is some great advice. Especially 1 and 2 seem foundational for anyone trying to reliably shift the needle by a notable amount in the right direction.
My favorite frame is based on In The Future Everyone Will Be Famous To Fifteen People. If we as a civilization pass this test, we who lived at the turn of history will be outnumbered trillions to one, and the future historians will construct a pretty good model of how everyone contributed. We'll get to read about it, if we decide that that's part of our personal utopia.
I'd like to be able to look back from eternity and see that I shifted things a little in the right direction. That perspective helps defuse some of the local status-seeking drives, I think.
I can verify that the owner of the blaked[1] account is someone I have known for a significant amount of time, that he is a person with a serious, long-standing concern with AI safety (and all other details verifiable by me fit), and that based on the surrounding context I strongly expect him to have presented the story as he experienced it.
This isn't a troll.
- ^
(also I get to claim memetic credit for coining the term "blaked" for being affected by this class of AI persuasion)
Agree with Jim, and suggest starting with some Rob Miles videos. The Computerphile ones, and those on his main channel, are a good intro.
Great, thanks!
Nice! Would you be up for putting this in the aisafety.info Google Drive folder too, with a question-shaped title?
Yes, this is a robustly good intervention on the critical path. Have had it on the Alignment Ecosystem Development ideas list for most of a year now.
Some approaches to solving alignment go through teaching ML systems about alignment and getting research assistance from them. Training ML systems needs data, but we might not have enough alignment research to sufficiently fine tune our models, and we might miss out on many concepts which have not been written up. Furthermore, training on the final outputs (AF posts, papers, etc) might be less good at capturing the thought processes which go into hashing out an idea or poking holes in proposals which would be the most useful for a research assistant to be skilled at.
It might be significantly beneficial to capture many of the conversations between researchers, and use them to expand our dataset of alignment content to train models on. Additionally, some researchers may be fine with having their some of their conversations available to the public, in case people want to do a deep dive into their models and research approaches.
The two parts of the system which I'm currently imagining addressing this are:
- An email where audio files can be sent, automatically run through Whsiper, and added to the alignment dataset github.
- Clear instructions for setting up a tool which captures audio from calls automatically (either a general tool or platform-specific advice), and makes it as easy as possible to send the right calls to the dataset platform.
I'm working on it. aisafety.world and /map are still a WIP, but very useful already.
Updates!
- Have been continually adding to this, up to 49 online communities.
- Added a student groups section, with 12 entries and some extra fields (website, contact, calendar, mailing list), based on the AGISF list (in talks with the maintainer to set up painless syncing between the two systems).
- Made a form where you can easily add entries.
Still getting consistent traffic, happy to see it getting used :)
Added to aisafety.training.
I do things, such as aisafety.community, aisafety.info, aisafety.world, aisafety.training, ea.domains, alignment.dev and about a dozen others. Come hang out on Alignment Ecosystem Development.
fwiw that strange philosophical bullet fits remarkably well with a set of thoughts I had while reading Anthropic Bias about 'amount of existence' being the fundamental currency of reality (a bunch of the anthropic paradoxes felt like they were showing that if you traded sufficiently large amounts of "patterns like me exist more" then you could get counterintuitive results like bending the probabilities of the world around you without any causal pathway), and infraBayes requiring it actually updated me a little towards infraBayes being on the right track.
My model of why humans seem to prefer non-existence to existence in some cases is that our ancestors faced situations which could reduce their ability to self-propagate to almost zero, and needed to avoid these really hard. Evolution gave us training signals which can easily generate subagents which are single-mindedly obsessed with avoiding certain kinds of intense suffering. This motivates us to avoid a wide range of realistic things which cost us existence, but as a side-effect of being emphasized so much make it possible to tip into suicidality in cases where, in our history, it was not too costly because things were bad enough anyway that the agent wouldn't propagate much (suicide when the cues for self-propagation being relatively likely for on-distribution humans should have been weeded out). This strikes me as unintended and a result of a hack which works pretty well on-distribution, and likely not reflectively consistent in the limit. An evolution which could generate brains with unbounded compute would not make agents which ever preferred suicide or non-existence.
Another angle on this is thinking of evolution having set things up for a sign-flipped subagent to be reinforced, which just wants to Not Be. This is not a natural shape for an agent to be, but it's useful enough that the pattern to generate it is common.
This is all pretty handwave-y and I don't claim high confidence that it's correct or useful, but might be interesting babble.
I classified the first as Outer misalignment, and the second as Deceptive outer misalignment, before reading on.
I agree with
Another use of the terms “outer” and “inner” is to describe the situation in which an “outer” optimizer like gradient descent is used to find a learned model that is itself performing optimization (the “inner” optimizer). This usage seems fine to me.
being the worthwhile use of the term inner alignment as opposed to the ones you argue against, and could imagine that the term is being blurred and used in less helpful ways by many people. But I'd be wary of discouraging the inner vs outer alignment ontology too hard, as the internal optimizer failure mode feels like a key one and worthwhile having as a clear category within goal misgeneralization.
As for why so many researchers were tripped up, I imagine that the framing of the pop quiz would make a big difference. The first was obviously outer alignment and the question was inner vs outer, so priors for a non-trick quiz were that the second was inner. People don't like calling out high-status people as pulling trick questions on them without strong evidence (especially in front of a group of peers), and there is some vague story that doesn't hold up if you look too closely in the direction of internal models not being displayed that could cause a person to brush over the error.
For developing my hail mary alignment approach, the dream would be to be able to load enough of the context of the idea into a LLM that it could babble suggestions (since the whole doc won't fit in the context window, maybe randomizing which parts beyond the intro are included for diversity?), then have it self-critique those suggestions automatically in different threads in bulk and surface the most promising implementations of the idea to me for review. In the perfect case I'd be able to converse with the model about the ideas and have that be not totally useless, and pump good chains of thought back into the fine-tuning set.
From the outside, it looks like Tesla and SpaceX are doing an unusually good (though likely not perfect) job of resisting mazedom and staying in touch with physical reality. Things like having a rule where any employee can talk to Musk directly, and blocking that is a fireable offence, promoting self-management rather than layers of middle management, over-the-top work ethic as a norm to keep out people who don't actually want to build things, and checking for people who have solved hard technical problems in the hiring process.
I'd be interested to hear from any employees of those organizations how it is on the ground.
There's something extra which is closely connected with this: The time and motivation of central nodes is extremely valuable. Burning the resources of people who are in positions of power and aligned with your values is much more costly than burning other people's resources[1], and saving or giving them extra resources is proportionately more valuable.
People who have not been a central node generally underestimate how many good things those people would be able to unblock or achieve with a relatively small amount of their time/energy/social favors that simply don't get done because they're making hard trade-offs. Finding ways to buy central nodes more time by finding things they'd love to do but don't have the bandwidth to do themselves or even manage other people in any detail and stepping up to do a good job of it is one of the best ways to contribute for many people.
For example, I noticed that the LW tag portal entry on AI was missing many of the recently created tags about a year ago, and after making a bunch of other contributions to less sensitive tag-wiki pages I brought this up with the LW team and they gave me access to keep it updated. I've been doing that, and taking a small load off the team while maintaining something which should obviously be maintained.
- ^
Energy is more often the limiting factor than time, so if a central node gets a motivation boost from mentoring you this can be great. Don't avoid talking to central nodes and trust them to manage their own time, just don't pressure them to talk to you if they seem like they don't want to continue the conversation. It's almost certainly not personal, they just have 10 other high value things they could be doing.
Very excited by this agenda, was discussing my hope that someone finetunes LLMs on the alignment archive soon today!
Also, from MIT CSAIL and Meta: Gradient Descent: The Ultimate Optimizer
Working with any gradient-based machine learning algorithm involves the tedious task of tuning the optimizer's hyperparameters, such as its step size. Recent work has shown how the step size can itself be optimized alongside the model parameters by manually deriving expressions for "hypergradients" ahead of time.
We show how to automatically compute hypergradients with a simple and elegant modification to backpropagation. This allows us to easily apply the method to other optimizers and hyperparameters (e.g. momentum coefficients). We can even recursively apply the method to its own hyper-hyperparameters, and so on ad infinitum. As these towers of optimizers grow taller, they become less sensitive to the initial choice of hyperparameters. We present experiments validating this for MLPs, CNNs, and RNNs.
Strong upvote, important work!
Yup, another instance of this is the longtermist census, that likely has the most entries but is not public. Then there's AI Safety Watch, the EA Hub (with the right filters), the mailing list of people who went through AGISF, I'm sure SERI MATS has one, other mailing lists like AISS's opportunities one, other training programs, student groups, people in various entries on aisafety.community...
Yeah, there's some organizing to do. Maybe the EA forum's proposed new profile features will end up being the killer app?
I used Metaphor to find many entries on the upcoming map of AI safety.
Value alignment here means being focused on improving humanity's long term future by reducing existential risk, not other specific cultural markers (identifying as EA or rationalist, for example, is not necessary). Having people working towards same goal seems vital for organizational cohesion, and I think alignment orgs would rightly not hire people who are not focused on trying to solve alignment. Upskilling people who are happy to do capabilities jobs without pushing hard internally for capabilities orgs to be more safety focused seems net negative.