Posts

MATS AI Safety Strategy Curriculum 2024-03-07T19:59:37.434Z
Announcing the London Initiative for Safe AI (LISA) 2024-02-02T23:17:47.011Z
MATS Summer 2023 Retrospective 2023-12-01T23:29:47.958Z
Apply for MATS Winter 2023-24! 2023-10-21T02:27:34.350Z
[Job Ad] SERI MATS is (still) hiring for our summer program 2023-06-06T21:07:07.185Z
How MATS addresses “mass movement building” concerns 2023-05-04T00:55:26.913Z
SERI MATS - Summer 2023 Cohort 2023-04-08T15:32:56.737Z
Aspiring AI safety researchers should ~argmax over AGI timelines 2023-03-03T02:04:51.685Z
Would more model evals teams be good? 2023-02-25T22:01:31.568Z
Air-gapping evaluation and support 2022-12-26T22:52:29.881Z
Probably good projects for the AI safety ecosystem 2022-12-05T02:26:41.623Z
Ryan Kidd's Shortform 2022-10-13T19:12:47.984Z
SERI MATS Program - Winter 2022 Cohort 2022-10-08T19:09:53.231Z
Selection processes for subagents 2022-06-30T23:57:25.699Z
SERI ML Alignment Theory Scholars Program 2022 2022-04-27T00:43:38.221Z
Ensembling the greedy doctor problem 2022-04-18T19:16:00.916Z
Is Fisherian Runaway Gradient Hacking? 2022-04-10T13:47:16.454Z
Introduction to inaccessible information 2021-12-09T01:28:48.154Z

Comments

Comment by Ryan Kidd (ryankidd44) on Estimating the Current and Future Number of AI Safety Researchers · 2024-04-24T16:42:39.861Z · LW · GW

I found this article useful. Any plans to update this for 2024?

Comment by Ryan Kidd (ryankidd44) on Shallow review of live agendas in alignment & safety · 2023-12-05T21:43:24.748Z · LW · GW

Wow, high praise for MATS! Thank you so much :) This list is also great for our Summer 2024 Program planning.

Comment by Ryan Kidd (ryankidd44) on MATS Summer 2023 Retrospective · 2023-12-05T17:11:43.210Z · LW · GW

Another point: Despite our broad call for mentors, only ~2 individuals expressed interest in mentorship who we did not ultimately decide to support. It's possible our outreach could be improved and I'm happy to discuss in DMs.

Comment by Ryan Kidd (ryankidd44) on MATS Summer 2023 Retrospective · 2023-12-03T23:28:26.380Z · LW · GW

I don't see this distribution of research projects as "Goodharting" or "overfocusing" on projects with clear feedback loops. As MATS is principally a program for prosaic AI alignment at the moment, most research conducted within the program should be within this paradigm. We believe projects that frequently "touch reality" often offer the highest expected value in terms of reducing AI catastrophic risk, and principally support non-prosaic, "speculative," and emerging research agendas for their “exploration value," which might aid potential paradigm shifts, as well as to round out our portfolio (i.e., "hedge our bets").

However, even with the focus on prosaic AI alignment research agendas, our Summer 2023 Program supported many emerging or neglected research agendas, including projects in agent foundations, simulator theory, cooperative/multipolar AI (including s-risks), the nascent "activation engineering" approach our program helped pioneer, and the emerging "cyborgism" research agenda.

Additionally, our mentor portfolio is somewhat conditioned on the preferences of our funders. While we largely endorse our funders' priorities, we are seeking additional funding diversification so that we can support further speculative "research bets". If you are aware of large funders willing to support our program, please let me know!

Comment by Ryan Kidd (ryankidd44) on MATS Summer 2023 Retrospective · 2023-12-02T20:54:31.910Z · LW · GW

There seems to be a bit of pushback against "postmortem" and our team is ambivalent, so I changed to "retrospective."

Comment by Ryan Kidd (ryankidd44) on MATS Summer 2023 Retrospective · 2023-12-02T19:15:25.286Z · LW · GW

Thank you!

Comment by Ryan Kidd (ryankidd44) on MATS Summer 2023 Retrospective · 2023-12-02T19:14:56.510Z · LW · GW

Ok, added!

Comment by Ryan Kidd (ryankidd44) on MATS Summer 2023 Retrospective · 2023-12-02T04:30:30.475Z · LW · GW

FYI, the Net Promoter score is 38%.

Comment by Ryan Kidd (ryankidd44) on MATS Summer 2023 Retrospective · 2023-12-02T04:17:18.067Z · LW · GW

Ok, graph is updated!

Comment by Ryan Kidd (ryankidd44) on MATS Summer 2023 Retrospective · 2023-12-02T04:09:23.934Z · LW · GW

Do you think "46% of scholar projects were rated 9/10 or higher" is better? What about "scholar projects were rated 8.1/10 on average" ?

Comment by Ryan Kidd (ryankidd44) on MATS Summer 2023 Retrospective · 2023-12-02T03:45:14.937Z · LW · GW

We also asked mentors to rate scholars' "depth of technical ability," "breadth of AI safety knowledge," "research taste," and "value alignment." We ommitted these results from the report to prevent bloat, but your comment makes me think we should re-add them.

Comment by Ryan Kidd (ryankidd44) on MATS Summer 2023 Retrospective · 2023-12-02T03:33:30.801Z · LW · GW

Yeah, I just realized the graph is wrong; it seems like the 10/10 scores were truncated. We'll upload a new graph shortly.

Comment by Ryan Kidd (ryankidd44) on MATS Summer 2023 Retrospective · 2023-12-02T03:05:08.270Z · LW · GW

Cheers, Vaniver! As indicated in the figure legend for "Mentor ratings of scholar research", mentors were asked, “Taking the above [depth/breadth/taste ratings] into account, how strongly do you support the scholar's research continuing?” and prompted with:

  • 10/10 = Very disappointed if [the research] didn't continue;
  • 5/10 = On the fence, unsure what the right call is;
  • 1/10 = Fine if research doesn't continue.

Mentors rated 18% of scholar research projects as 10/10 and 28% as 9/10.

Comment by Ryan Kidd (ryankidd44) on Apply for MATS Winter 2023-24! · 2023-11-08T01:22:45.383Z · LW · GW

Also, last year's program was 8-weeks and this year's program is 10-weeks.

Comment by Ryan Kidd (ryankidd44) on Apply to the Constellation Visiting Researcher Program and Astra Fellowship, in Berkeley this Winter · 2023-11-08T01:14:51.374Z · LW · GW

Buck Shlegeris, Ethan Perez, Evan Hubinger, and Owain Evans are mentoring in both programs. The links show their MATS projects, "personal fit" for applicants, and (where applicable) applicant selection questions, designed to mimic the research experience.

Astra seems like an obviously better choice for applicants principally interested in:

  • AI governance: MATS has no AI governance mentors in the Winter 2023-24 Program, whereas Astra has Daniel Kokotajlo, Richard Ngo, and associated staff at ARC Evals and Open Phil;
  • Worldview investigations: Astra has Ajeya Cotra, Tom Davidson, and Lukas Finnvedan, whereas MATS has no Open Phil mentors;
  • ARC Evals: While both programs feature mentors working on evals, only Astra is working with ARC Evals;
  • AI ethics: Astra is working with Rob Long.
Comment by Ryan Kidd (ryankidd44) on Apply to the Constellation Visiting Researcher Program and Astra Fellowship, in Berkeley this Winter · 2023-11-08T00:10:58.900Z · LW · GW

MATS has the following features that might be worth considering:

  1. Empowerment: Emphasis on empowering scholars to develop as future "research leads" (think accelerated PhD-style program rather than a traditional internship), including research strategy workshops, significant opportunities for scholar project ownership (though the extent of this varies between mentors), and a 4-month extension program;
  2. Diversity: Emphasis on a broad portfolio of AI safety research agendas and perspectives with a large, diverse cohort (50-60) and comprehensive seminar program;
  3. Support: Dedicated and experienced scholar support + research coach/manager staff and infrastructure;
  4. Network: Large and supportive alumni network that regularly sparks research collaborations and AI safety start-ups (e.g., Apollo, Leap Labs, Timaeus, Cadenza, CAIP);
  5. Experience: Have run successful research cohorts with 30, 58, 60 scholars, plus three extension programs with about half as many participants.
Comment by Ryan Kidd (ryankidd44) on Apply for MATS Winter 2023-24! · 2023-10-29T22:30:32.496Z · LW · GW

Our alumni were surveyed about what amount of funding they would consider sufficient to participate in MATS and the median (discounting the $0 responses) was quite low.

Comment by Ryan Kidd (ryankidd44) on Apply for MATS Winter 2023-24! · 2023-10-29T22:24:03.134Z · LW · GW

Yes

Comment by Ryan Kidd (ryankidd44) on How MATS addresses “mass movement building” concerns · 2023-05-23T16:38:01.361Z · LW · GW

We agree, which is why we note, "We think that ~1 more median MATS scholar focused on AI safety is worth 5-10 more median capabilities researchers (because most do pointless stuff like image generation, and there is more low-hanging fruit in safety)."

Comment by Ryan Kidd (ryankidd44) on Ryan Kidd's Shortform · 2023-05-23T16:14:08.758Z · LW · GW

MATS' goals:

  • Find + accelerate high-impact research scholars:
    • Pair scholars with research mentors via specialized mentor-generated selection questions (visible on our website);
    • Provide a thriving academic community for research collaboration, peer feedback, and social networking;
    • Develop scholars according to the “T-model of research” (breadth/depth/epistemology);
    • Offer opt-in curriculum elements, including seminars, research strategy workshops, 1-1 researcher unblocking support, peer study groups, and networking events;
  • Support high-impact research mentors:
    • Scholars are often good research assistants and future hires;
    • Scholars can offer substantive new critiques of alignment proposals;
    • Our community, research coaching, and operations free up valuable mentor time and increase scholar output;
  • Help parallelize high-impact AI alignment research:
    • Find, develop, and refer scholars with strong research ability, value alignment, and epistemics;
    • Use alumni for peer-mentoring in later cohorts;
    • Update mentor list and curriculum as the alignment field’s needs change.
Comment by Ryan Kidd (ryankidd44) on Ryan Kidd's Shortform · 2023-05-23T15:43:32.886Z · LW · GW

Types of organizations that conduct alignment research, differentiated by funding model and associated market forces:

Comment by Ryan Kidd (ryankidd44) on SERI MATS - Summer 2023 Cohort · 2023-05-15T22:11:55.687Z · LW · GW

The Summer 2023 Cohort has 460 applicants. Our last cohort included 57 scholars.

Comment by Ryan Kidd (ryankidd44) on SERI MATS - Summer 2023 Cohort · 2023-05-15T17:36:17.152Z · LW · GW

As an educational seminar and independent research program, MATS cannot offer J1 visas. We can support scholars' ESTA and B1/B2 visa applications, however.

Comment by Ryan Kidd (ryankidd44) on SERI MATS - Summer 2023 Cohort · 2023-05-13T00:20:01.334Z · LW · GW

John's scholars have historically only had to seek LTFF funding for the 4-month extension program subsequent to the in-person Scholars Program. They are otherwise treated like other scholars.

Comment by Ryan Kidd (ryankidd44) on SERI MATS - Summer 2023 Cohort · 2023-05-12T20:33:35.199Z · LW · GW

Hi Pulkit. Unfortunately, applications have closed for our Summer 2023 Cohort. Hopefully, we will launch applications for our Winter Cohort soon!

Comment by Ryan Kidd (ryankidd44) on An open letter to SERI MATS program organisers · 2023-05-05T20:19:32.126Z · LW · GW

I'm somewhere in the middle of the cognitivist/enactivist spectrum. I think that e.g. relaxed adversarial training is motivated by trying to make an AI robust to arbitrary inputs it will receive in the world before it leaves the box. I'm sympathetic to the belief that this is computationally intractable; however, it feels more achievable than altering the world in the way I imagine would be necessary without it.

I'm not an idealist here: I think that some civilizational inadequacies should be addressed (e.g., better cooperation and commitment mechanisms) concurrent with in-the-box alignment strategies. My main hope is that we can build an in-the-box corrigible AGI that allows in-deployment modification.

Comment by Ryan Kidd (ryankidd44) on How MATS addresses “mass movement building” concerns · 2023-05-05T20:06:53.669Z · LW · GW

I agree with you that AI is generally seen as "the big thing" now, and we are very unlikely to be counterfactual in encouraging AI hype. This was a large factor in our recent decision to advertise the Summer 2023 Cohort via a Twitter post and a shout-out on Rob Miles' YouTube and TikTok channels.

However, because we provide a relatively simple opportunity to gain access to mentorship from scientists at scaling labs, we believe that our program might seem attractive to aspiring AI researchers who are not fundamentally directed toward reducing x-risk. We believe that accepting such individuals as scholars is bad because:

  • We might counterfactually accelerate their ability to contribute to AI capabilities;
  • They might displace an x-risk-motivated scholar.

Therefore, while we intend to expand our advertising approach to capture more out-of-network applicants, we do not currently plan to reduce the selection pressures for x-risk-motivated scholars.

Another crux here is that I believe the field is in a nascent stage where new funders and the public might be swayed by fundamentally bad "AI safety" projects that make AI systems more commercialisable without reducing x-risk. Empowering founders of such projects is not a goal of MATS. After the field has grown a bit larger while maintaining its focus on reducing x-risk, there will hopefully be less "free energy" for naive AI safety projects, and we can afford to be less choosy with scholars.

Comment by Ryan Kidd (ryankidd44) on How MATS addresses “mass movement building” concerns · 2023-05-05T19:53:41.528Z · LW · GW

Mentorship is critical to MATS. We generally haven't accepted mentorless scholars because we believe that mentors' accumulated knowledge is extremely useful for bootstrapping strong, original researchers.

Let me explain my chain of thought better:

  1. A first-order failure mode would be "no one downloads experts' models, and we grow a field of naive, overconfident takes." In this scenario, we have maximized exploration at the cost of accumulated knowledge transmission (and probably useful originality, as novices might make the same basic mistakes). We patch this by creating a mechanism by which scholars are selected for their ability to download mentors' models (and encouraged to do so).
  2. A second-order failure mode would be "everyone downloads and defers to mentors' models, and we grow a field of paradigm-locked, non-critical takes." In this scenario, we have maximized the exploitation of existing paradigms at the cost of epistemic diversity or critical analysis. We patch this by creating mechanisms for scholars to critically examine their assumptions and debate with peers.
Comment by Ryan Kidd (ryankidd44) on An open letter to SERI MATS program organisers · 2023-05-04T21:02:54.001Z · LW · GW

I think we agree on a lot more than I realized! In particular, I don't disagree with your general claims about pathways to HRAD through Alignment MVPs (though I hold some credence that this might not work). Things I disagree with:

  • I generally disagree with the claim "alignment approaches don't limit agentic capability." This was one subject of my independent research before I started directing MATS. Hopefully, I can publish some high-level summaries soon, time permitting! In short, I think "aligning models" generally trades off bits of optimization pressure with "making models performance-competitive," which makes building aligned models less training-competitive for a given degree of performance.
  • I generally disagree with the claim "corrigibility is not a useful, coherent concept." I think there is a (narrow) attractor basin around "corrigibility" in cognition space. Happy to discuss more and possibly update.
  • I generally disagree with the claim "provably-controllable highly reliable agent design is impossible in principle." I think it is possible to design recursively self-improving programs that are robust to adversarial inputs, even if this is vanishingly hard in practice (which informs my sense of alignment difficulty only insomuch as I hope we don't hit that attractor well before CEV value-loading is accomplished). Happy to discuss and possibly update.
  • I generally disagree with the implicit claim "it's useful to try aligning AI systems via mechanism design on civilization." This feels like a vastly clumsier version of trying to shape AGIs via black-box gradient descent. I also don't think that realistic pre-AGI efficient markets we can build are aligned with human-CEV by default.
Comment by Ryan Kidd (ryankidd44) on How MATS addresses “mass movement building” concerns · 2023-05-04T18:22:12.545Z · LW · GW
  • We broadened our advertising approach for the Summer 2023 Cohort, including a Twitter post and a shout-out on Rob Miles' YouTube and TikTok channels. We expected some lowering of average applicant quality as a result but have yet to see a massive influx of applicants from these sources. We additionally focused more on targeted advertising to AI safety student groups, given their recent growth. We will publish updated applicant statistics after our applications close.
  • In addition to applicant selection and curriculum elements, our Scholar Support staff, introduced in the Winter 2022-23 Cohort, supplement the mentorship experience by providing 1-1 research strategy and unblocking support for scholars. This program feature aims to:
    • Supplement and augment mentorship with 1-1 debugging, planning, and unblocking;
    • Allow air-gapping of evaluation and support, improving scholar outcomes by resolving issues they would not take to their mentor;
    • Solve scholars’ problems, giving more time for research.
  • Defining "good alignment research" is very complicated and merits a post of its own (or two, if you also include the theories of change that MATS endorses). We are currently developing scholar research ability through curriculum elements focused on breadth, depth, and epistemology (the "T-model of research"):
  • Our Alumni Spotlight includes an incomplete list of projects we highlight. Many more past scholar projects seem promising to us but have yet to meet our criteria for inclusion here. Watch this space.
  • Since Summer 2022, MATS has explicitly been trying to parallelize the field of AI safety as much as is prudent, given the available mentorship and scholarly talent. In longer-timeline worlds, more careful serial research seems prudent, as growing the field rapidly is a risk for the reasons outlined in the above article. We believe that MATS' goals have grown more important from the perspective of timelines shortening (though MATS management has not updated on timelines much as they were already fairly short in our estimation).
  • MATS would love to support senior research talent interested in transitioning into AI safety! Our scholars generally comprise 10% Postdocs, and we would like this number to rise. Currently, our advertising strategy is contingent on the AI safety community adequately targeting these populations (which seems false) and might change for future cohorts.
Comment by Ryan Kidd (ryankidd44) on An open letter to SERI MATS program organisers · 2023-05-03T23:57:07.079Z · LW · GW

I don't disagree with Shimi as strongly as you do. I think there's some chance we need radically new paradigms of aligning AI than "build alignment MVPs via RLHF + adversarial training + scalable oversight + regularizers + transparency tools + mulligans."

While I do endorse some anthropocentric "value-loading"-based alignment strategies in my portfolio, such as Shard Theory and Steve Byrnes' research, I worry about overly investing in anthropocentric AGI alignment strategies. I don't necessarily think that RLHF shapes GPT-N in a manner similar to how natural selection and related processes shaped humans to be altruistic. I think it's quite likely that the kind of cognition that GPT-N learns to predict tokens is more akin to an "alien god" than it is to human cognition. I think that trying to value-load an alien god is pretty hard.

In general, I don't highly endorse the framing of alignment as "making AIs more human." I think this kind of approach fails in some worlds and might produce models that are not performance-competitive enough to outcompete the unaligned models others deploy. I'd rather produce corrigible models with superhuman cognition coupled with robust democratic institutions. Nevertheless, I endorse at least some research along this line, but this is not the majority of my portfolio.

Comment by Ryan Kidd (ryankidd44) on An open letter to SERI MATS program organisers · 2023-04-26T22:43:22.128Z · LW · GW

Thank you for recommending your study guide; it looks quite interesting.

MATS does not endorse “maximizing originality” in our curriculum. We believe that good original research in AI safety comes from a combination of broad interdisciplinary knowledge, deep technical knowledge, and strong epistemological investigation, which is why we emphasize all three. I’m a bit confused by your reference to Adam’s post. I interpret his post as advocating for more originality, not less, in terms of diverse alignment research agendas.

I think that some of the examples you gave of "non-alignment" research areas are potentially useful subproblems for what I term "impact alignment." For example, strong anomaly detection (e.g., via mechanistic interpretability, OOD warning lights, or RAT-style acceptability verification) can help ensure "inner alignment" (e.g., through assessing corrigibility or myopia) and infosec/governance/meme-curation can help ensure "outer alignment," "paying the alignment tax," and mitigating "vulnerable world" situations with stolen/open sourced weaponizable models. I think this inner/outer distinction is a useful framing, though not the only way to carve reality at the joints, of course.

I think Metzinger or Clark could give interesting seminars, though I'm generally worried about encouraging the anthropomorphism of AGI. I like the "shoggoth" or "alien god" memetic framings of AGI, as these (while wrong) permit a superset of human-like behavior without restricting assumptions of model internals to (unlikely and optimistic, imo) human-like cognition. In this vein, I particularly like Steve Byrnes' research as I feel it doesn't overly anthropomorphize AGI and encourages the "competent psychopath with alien goals" memetic framing. I'm intrigued by this suggestion, however. How do you think Metzinger or Clark would specifically benefit our scholars?

(Note: I've tried to differentiate between MATS' organizational position and mine by using "I" or "we" when appropriate.)

Comment by Ryan Kidd (ryankidd44) on An open letter to SERI MATS program organisers · 2023-04-26T22:10:59.514Z · LW · GW

MATS' framing is that we are supporting a "diverse portfolio" of research agendas that might "pay off" in different worlds (i.e., your "hedging bets" analogy is accurate). We also think the listed research agendas have some synergy you might have missed. For example, interpretability research might build into better AI-assisted white-box auditing, white/gray-box steering (e.g., via ELK), or safe architecture design (e.g., "retargeting the search").

The distinction between "evaluator" and "generator" seems fuzzier to me than you portray. For instance, two "generator" AIs might be able to red-team each other for the purposes of evaluating an alignment strategy.

Comment by Ryan Kidd (ryankidd44) on An open letter to SERI MATS program organisers · 2023-04-24T02:22:08.208Z · LW · GW

MATS aims to find and accelerate alignment research talent, including:

  • Developing scholar research ability through curriculum elements focused on breadth, depth, and originality (the "T-model of research");
  • Assisting scholars in producing impactful research through research mentorship, a community of collaborative peers, dedicated 1-1 support, and educational seminars;
  • Aiding the creation of impactful new alignment organizations (e.g., Jessica Rumbelow's Leap Labs and Marius Hobbhahn's Apollo Research);
  • Preparing scholars for impactful alignment research roles in existing organizations.

Not all alumni will end up in existing alignment research organizations immediately; some return to academia, pursue independent research, or potentially skill-up in industry (to eventually aid alignment research efforts). We generally aim to find talent with existing research ability and empower it to work on alignment, not necessarily through existing initiatives (though we certainly endorse many).

Comment by Ryan Kidd (ryankidd44) on An open letter to SERI MATS program organisers · 2023-04-24T02:13:04.673Z · LW · GW

In general, MATS gives our chosen mentors the final say on certain curriculum elements (e.g., research projects offered, mentoring style, mentor-specific training program elements) but offers a general curriculum (e.g., Alignment 201, seminars, research strategy/tools workshops) that scholars can opt into.

Comment by Ryan Kidd (ryankidd44) on An open letter to SERI MATS program organisers · 2023-04-24T02:05:17.452Z · LW · GW

I think there should be much more seminars about philosophy of science, methodology of science, and strategy of research in the program. Perhaps, not at the expense of other seminar content, but via increasing the number of seminar hours. Talks about ethics, and in particular about ethical naturalism (such as Oliver Scott Curry’s “Morality as Cooperation”), or interaction of ethics and science more generally, also seem essential.

MATS is currently focused on developing scholars as per the "T-model of research" with three main levers:

  • Mentorship (weekly meetings + mentor-specific curriculum);
  • Curriculum (seminars + workshops + Alignment 201 + topic study groups);
  • Scholar support (1-1s for unblocking + research strategy).

The "T-model of research" is:

In the Winter 2022-23 Cohort, we ran several research strategy workshops (focusing on problem decomposition and strategy identification) and had dedicated scholar support staff who offered regular, airgapped 1-1 support for research strategy and researcher unblocking. We will publish some findings from our Scholar Support Post-Mortem soon. We plan to run further research strategy and “originality” workshops in the Summer 2023 Cohort and are currently reviewing our curriculum from the ground up.

Currently, we are not focused on “ethical naturalism” or similar as a curriculum element as it does not seem broadly useful for our cohort compared to project-specific, scholar-driven self-education. We are potentially open to hosting a seminar on this topic.

Comment by Ryan Kidd (ryankidd44) on An open letter to SERI MATS program organisers · 2023-04-24T02:04:54.398Z · LW · GW

Independent researchers sometimes seem to think that if they come up with a particularly brilliant idea about alignment, leading AGI labs will adopt it. Not realising that this won’t happen no matter what partially for social reasons

MATS generally thinks that separating the alignment problem into “lowering the alignment tax” via technical research and “convincing others to pay the alignment tax” via governance interventions is a useful framing. There are worlds in which the two are not so cleanly separable, of course, but we believe that making progress toward at least one of these goals is probably useful (particularly if more governance-focused initiatives exist). We also support several mentors whose research crosses this boundary (e.g., Dan Hendrycks, Jesse Clifton, Daniel Kokotajlo).

Comment by Ryan Kidd (ryankidd44) on An open letter to SERI MATS program organisers · 2023-04-24T02:01:05.368Z · LW · GW

I see in the previous SERI MATS cohort, there was a debate about this, titled “How much should be invested in interpretability research?", but apparently there is no recording.

We did record this debate but have yet to publish the recording as the speakers have not yet given approval. MATS believe that interpretability research might have many uses.

Comment by Ryan Kidd (ryankidd44) on An open letter to SERI MATS program organisers · 2023-04-24T02:00:10.729Z · LW · GW

People fail to realise that leading AGI labs, such as OpenAI and Conjecture (I bet DeepMind and Anthropic, too, even though they haven’t publicly stated this) do not plan to align LLMs “once and forever”, but rather use LLMs to produce novel alignment research, which will almost certainly go in package with novel AI architectures (or variations on existing proposals, such as LeCun’s H-JEPA).

Among many other theories of change, MATS currently supports:

Comment by Ryan Kidd (ryankidd44) on Probably good projects for the AI safety ecosystem · 2023-04-15T23:14:14.852Z · LW · GW

A talent recruitment and onboarding organization targeting cyber security researchers to benefit AI alignment,

This EA Infosec bookclub seems like a good start, but more could be done!

Comment by Ryan Kidd (ryankidd44) on Probably good projects for the AI safety ecosystem · 2023-04-15T23:13:24.510Z · LW · GW

offers of funding for large compute projects that focus on alignment

This NSF grant addresses this!

Comment by Ryan Kidd (ryankidd44) on Probably good projects for the AI safety ecosystem · 2023-04-15T23:11:28.670Z · LW · GW

An organization that does for the Open Philanthropy worldview investigations team what GCP did to supplement CEA's workshops and 80,000 Hours’ career advising calls.

Rethink Priorities are doing this!

Comment by Ryan Kidd (ryankidd44) on Probably good projects for the AI safety ecosystem · 2023-04-15T23:10:36.536Z · LW · GW

A dedicated bi-yearly workshop for AI safety university group leaders that teaches them how to recognize talent, foster useful undergraduate research projects, and build a good talent development pipeline or “user journey” (including a model of alignment macrostrategy and where university groups fit in).

Several workshops have run since I posted this, but none quite fits my vision.

Comment by Ryan Kidd (ryankidd44) on Probably good projects for the AI safety ecosystem · 2023-04-15T23:10:14.108Z · LW · GW

A combined research mentorship and seminar program that aims to do for AI governance research what MATS is trying to do for technical AI alignment research.

The GovAI Summer & Winter Fellowships Program is addressing this niche!

Comment by Ryan Kidd (ryankidd44) on SERI MATS - Summer 2023 Cohort · 2023-04-10T18:26:08.187Z · LW · GW

Yeah, we hope to hold another cohort starting in Nov. However, applying this time might be good practise and, if the mentor is willing, you could just defer to Winter!

Comment by Ryan Kidd (ryankidd44) on Ryan Kidd's Shortform · 2023-04-02T00:18:00.642Z · LW · GW

An incomplete list of possibly useful AI safety research:

  • Predicting/shaping emergent systems (“physics”)
    • Learning theory (e.g., shard theory, causal incentives)
    • Regularizers (e.g., speed priors)
    • Embedded agency (e.g., infra-Bayesianism, finite factored sets)
    • Decision theory (e.g., timeless decision theory, cooperative bargaining theory, acausal trade)
  • Model evaluation (“biology”)
    • Capabilities evaluation (e.g., survive-and-spread, hacking)
    • Red-teaming alignment techniques
    • Demonstrations of emergent properties/behavior (e.g., instrumental powerseeking)
  • Interpretability (“neuroscience”)
    • Mechanistic interpretability (e.g., superposition, toy models, automated circuit detection)
    • Gray box ELK (e.g., Collin Burns’ research)
    • Feature extraction/sparsity (including Wentworth/Bushnaq style “modularity” research)
    • Model surgery (e.g., ROME)
  • Alignment MVP (“psychology”)
    • Sampling simulators safely (conditioning predictive models)
    • Scalable oversight (e.g., RLHF, CAI, debate, RRM, model-assisted evaluations)
    • Cyborgism
    • Prompt engineering (e.g., jailbreaking)
  • Strategy/governance (“sociology”)
    • Compute governance (e.g., GPU logging/restrictions, treaties)
    • Model safety standards (e.g., auditing policies)
  • Infosecurity
    • Multi-party authentication
    • Airgapping
    • AI-assisted infosecurity
Comment by Ryan Kidd (ryankidd44) on Ryan Kidd's Shortform · 2023-03-10T02:39:16.070Z · LW · GW

Main takeaways from a recent AI safety conference:

  • If your foundation model is one small amount of RL away from being dangerous and someone can steal your model weights, fancy alignment techniques don’t matter. Scaling labs cannot currently prevent state actors from hacking their systems and stealing their stuff. Infosecurity is important to alignment.
  • Scaling labs might have some incentive to go along with the development of safety standards as it prevents smaller players from undercutting their business model and provides a credible defense against lawsuits regarding unexpected side effects of deployment (especially with how many tech restrictions the EU seems to pump out). Once the foot is in the door, more useful safety standards to prevent x-risk might be possible.
  • Near-term commercial AI systems that can be jailbroken to elicit dangerous output might empower more bad actors to make bioweapons or cyberweapons. Preventing the misuse of near-term commercial AI systems or slowing down their deployment seems important.
  • When a skill is hard to teach, like making accurate predictions over long time horizons in complicated situations or developing a “security mindset,” try treating humans like RL agents. For example, Ph.D. students might only get ~3 data points on how to evaluate a research proposal ex-ante, whereas Professors might have ~50. Novice Ph.D. students could be trained to predict good research decisions by predicting outcomes on a set of expert-annotated examples of research quandaries and then receiving “RL updates” based on what the expert did and what occurred.
Comment by Ryan Kidd (ryankidd44) on Aspiring AI safety researchers should ~argmax over AGI timelines · 2023-03-06T19:58:43.839Z · LW · GW

These days have particular significance in my AGI timelines ranking and I think are a good default spread based on community opinion. However, there is no reason you shouldn't choose alternate years!

Comment by Ryan Kidd (ryankidd44) on Aspiring AI safety researchers should ~argmax over AGI timelines · 2023-03-03T05:29:48.075Z · LW · GW

Seems right. Explore vs. exploit is another useful frame.

Comment by Ryan Kidd (ryankidd44) on Ryan Kidd's Shortform · 2023-03-02T03:10:46.955Z · LW · GW

Reasons that scaling labs might be motivated to sign onto AI safety standards:

  • Companies who are wary of being sued for unsafe deployment that causes harm might want to be able to prove that they credibly did their best to prevent harm.
  • Big tech companies like Google might not want to risk premature deployment, but might feel forced to if smaller companies with less to lose undercut their "search" market. Standards that prevent unsafe deployment fix this.

However, AI companies that don’t believe in AGI x-risk might tolerate higher x-risk than ideal safety standards by the lights of this community. Also, I think insurance contracts are unlikely to appropriately account for x-risk, if the market is anything to go by.