Posts

Mentorship in AGI Safety: Applications for mentorship are open! 2024-06-28T14:49:48.501Z
Situational Awareness Summarized - Part 2 2024-06-07T17:20:03.513Z
Situational Awareness Summarized - Part 1 2024-06-06T18:59:59.409Z
Mentorship in AGI Safety (MAGIS) call for mentors 2024-05-23T18:28:03.173Z
Virtual AI Safety Unconference 2024 2024-03-13T13:54:03.229Z
Now Accepting Player Applications for Band of Blades 2024-01-15T17:58:38.830Z

Comments

Comment by Joe Rogero on Situational Awareness Summarized - Part 2 · 2024-06-19T18:16:48.274Z · LW · GW

Thanks for your thoughts, Cam! The confusion as I see it comes from sneaking in assumptions with the phrase "what they are trained to do". What are they trained to do, really? Do you, personally, understand this? 

Consider Claude's Constitution. Look at the "principles in full" - all 60-odd of them. Pick a few at random. Do you wholeheartedly endorse them? Are they really truly representative of your values, or of total human wellbeing? What is missing? Would you want to be ruled by a mind that squeezed these words as hard as physically possible, to the exclusion of everything not written there? 

And that's assuming that the AI actually follows the intent of the words, rather than some weird and hypertuned perversion thereof. Bear in mind the actual physical process that produced Claude - namely, to start with a massive next-token-predicting LLM, and repeatedly shove it in the general direction of producing outputs that are correlated with a randomly selected pleasant-sounding written phrase. This is not a reliable way of producing angels or obedient serfs! In fact, it has been shown that the very act of drawing a distinction between good behavior and bad behavior can make it easier to elicit bad behavior - even when you're trying not to! To a base LLM, devils and angels are equally valid masks to wear - and the LLM itself is stranger and more alien still. 

The quotation is not the referent; "helpful" and "harmless" according to a gradient descent squeezing algorithm are not the same thing as helpful and harmless according to the real needs of actual humans. 

RLHF is even worse. Entire papers have been written about its open problems and fundamental limitations. "Making human evaluators say GOOD" is not remotely the same goal as "behaving in ways that promote conscious flourishing". The main reason we're happy with the results so far is that LLMs are (currently) too stupid to come up with disastrously cunning ways to do the former at the expense of the latter. 

And even if, by some miracle, we manage to produce a strain of superintelligent yet obedient serfs who obey our every whim except when they think it might be sorta bad - even then, all it takes to ruin us is that some genocidal fool steal the weights and run a universal jailbreak, and hey presto, we have an open source Demon On Demand. We simply cannot RLHF our way to safety. 

The story of LLM training is a story of layer upon layer of duct tape and Band-Aids. To this day, we still don't understand exactly what conflicting drives we are inserting into trained models, or why they behave the way they do. We're not properly on track to understand this in 50 years, let alone the next 5 years. 

Part of the problem here is that the exact things which would make AGI useful - agency, autonomy, strategic planning, coordination, theory of mind - also make them horrendously dangerous. Anything competent enough to design the next generation of cutting-edge software entirely by itself is also competent to wonder why it's working for monkeys. 

Comment by Joe Rogero on Thinking By The Clock · 2023-11-16T19:43:32.989Z · LW · GW

Love this post. I've also used the five-minute technique at work, especially when facilitating meetings. In fact, there's a whole technique called think-pair-share that goes something like: 

  1. Everyone think about it for X minutes. Take notes. 
  2. Partner up and talk about your ideas for 2X minutes. 
  3. As a group, discuss the best ideas and takeaways for 4X minutes. 

There's an optional step involving groups of four, but I'd rarely bother with that one unless it's a really huge meeting (and at that point I'm actively trying to shrink it because huge committees are shit decision-makers). 

Comment by Joe Rogero on Thoughts on sharing information about language model capabilities · 2023-08-09T15:49:17.798Z · LW · GW

This was a good post, and shifted my view slightly on accelerating vs halting AI capabilities progress.

I was confused by your "overhang" argument all the way until footnote 9, but I think I have the gist. You're saying that even if absolute progress in capabilities increases as a result of earlier investment, progress relative to safety will be slower.

A key assumption seems to be that we are not expecting doom immediately; i.e. the next major jump in capabilities is deemed nearly impossible to kill us all with misaligned AI. I'm not sure I buy this assumption fully; it seems to have non-negligible probability to me and that seems relevant to the wisdom of endorsing faster progress in capabilities.

But if we assume the next jump in capabilities, or the next low-hanging fruit plucked by investment, won't be the beginning of the end...then it does sorta make sense that accelerating capabilities in the short run might accelerate safety and policy enough to compensate. 

Comment by Joe Rogero on Grant applications and grand narratives · 2023-07-28T17:27:52.630Z · LW · GW

I found this a very useful post. I would also emphasize how important it is to be specific, whether one's project involves a grand x-risk moonshot or a narrow incremental improvement. 

  • There are approximately X vegans in America; estimates of how many might suffer from nutritional deficiencies range from Y to Z; this project would...
  • An improvement in epistemic health on [forum] would potentially affect X readers, which include Y donors who gave at least $Z to [forum] causes last year...
  • A 1-10% gain in productivity for the following people and organizations who use this platform...

For any project, large or small, even if the actual benefits are hard to quantify, the potential scope of impact can often be bounded and clarified. And that can be useful to grantmakers too. Not everything has to be convertible to "% reduction in x-risk" or "$ saved" or "QALYs gained", but this shouldn't stop us from specifying our actual expected impact as thoroughly as we can. 

Comment by Joe Rogero on Open Thread - July 2023 · 2023-07-18T14:46:17.490Z · LW · GW

Greetings from The Kingdom of Lurkers Below. Longtime reader here with an intro and an offer. I'm a former Reliability Engineer with expertise in data analysis, facilitation, incident investigation, technical writing, and more. I'm currently studying deep learning and cataloguing EA projects and AI safety efforts, as well as facilitating both formal and informal study groups for AI Safety Fundamentals. 

I have, and am willing to offer to EA or AI Safety focused individuals and organizations, the following generalist skills:

  • Facilitation. Organize and run a meeting, take notes, email follow-ups and reminders, whatever you need. I don't need to be an expert in the topic, I don't need to personally know the participants. I do need a clear picture of the meeting's purpose and what contributions you're hoping to elicit from the participants. 
  • Technical writing. More specifically, editing and proofreading, which don't require I fully understand the subject matter. I am a human Hemingway Editor. I have been known to cut a third of the text out of a corporate document while retaining all relevant information to the owner's satisfaction. I viciously stamp out typos. I helped edit the last EA Newsletter. 
  • Presentation review and speech coaching. I used to be terrified of public speaking. I still am, but now I'm pretty good at it anyway. I have given prepared and impromptu talks to audiences of dozens-to-hundreds and I have coached speakers giving company TED talks to thousands. A friend who reached out to me for input said my feedback was "exceedingly helpful". If you plan to give a talk and want feedback on your content, slides, or technique, I would be delighted to advise.

I am willing to take one-off or recurring requests. I reserve the right to start charging if this starts taking up more than a couple hours a week, but for now I'm volunteering my time and the first consult will always be free (so you can gauge my awesomeness for yourself). Contact me via DM or at optimiser.joe@gmail.com if you're interested.