Posts

The case for training frontier AIs on Sumerian-only corpus 2024-01-15T16:40:22.011Z
Jonathan Claybrough's Shortform 2023-07-26T09:06:22.848Z
News : Biden-⁠Harris Administration Secures Voluntary Commitments from Leading Artificial Intelligence Companies to Manage the Risks Posed by AI 2023-07-21T18:00:57.016Z
An Overview of AI risks - the Flyer 2023-07-17T12:03:20.728Z

Comments

Comment by Jonathan Claybrough (lelapin) on MATS Summer 2023 Retrospective · 2024-03-01T16:13:27.332Z · LW · GW

Hi, I'm currently evaluating the cost effectiveness of various projects and would be interested in knowing, if you're willing to disclose, approximately how much this program costs MATS in total? By this I mean the summer cohort, includings ops before and after necessary for it to happen, but not counting the extension. 

Comment by Jonathan Claybrough (lelapin) on The impossible problem of due process · 2024-01-16T12:19:49.682Z · LW · GW

"It's true that we don't want women to be driven off by a bunch of awkward men asking them out, but if we make everyone read a document that says 'Don't ask a woman out the first time you meet her', then we'll immediately give the impression that we have a problem with men awkwardly asking women out too much — which will put women off anyway."

This seems a weak response to me, at best only defensible considering yourself to be on the margin and without thought for longterm growth and your ability to clarify intentions (you have more than 3 words when interacting with people irl). 

To be clear explicitly writing "don't ask women out the first time you meet her" would be terrible writing and if that's the best writing members of that group can do for guidelines, then maybe nothing is better than that. Still, it reeks of "we've tried for 30 seconds and are all out of ideas" energy. 

A guidelines document can give high level guidance to the vibe you want (eg. truth seeking, not too much aggressiveness, giving space when people feel uncomfortable, communicating around norms explicitly), all in the positive way (eg. you say what you want, not what you don't want), and can refer to sub-documents to give examples and be quite concrete if you have socially impaired people around that need to learn this explicitly. 

Comment by Jonathan Claybrough (lelapin) on Stop talking about p(doom) · 2024-01-02T18:57:15.100Z · LW · GW

Note Existential is a term of art different from Extinction. 

The Precipice cites Bostrome and defines it such:
"An existential catastrophe is the destruction of humanity’s longterm potential. 
An existential risk is a risk that threatens the destruction of humanity’s longterm potential."

Disempowerment is generally considered an existential risk in the literature. 

Comment by Jonathan Claybrough (lelapin) on Funding case: AI Safety Camp · 2023-12-12T11:02:42.882Z · LW · GW

I participated in the previous edition of AISC and found it very valuable to my involvement in AI Safety. I acquired knowledge (on standards and the standards process), got experience, contacts. I appreciate how much coordination AISC enables, with groups forming, which enable many to have their first hands on experience and step up their involvement. 

Comment by Jonathan Claybrough (lelapin) on AISC 2024 - Project Summaries · 2023-11-30T14:12:16.523Z · LW · GW

Thanks, and thank you for this post in the first place!

Comment by Jonathan Claybrough (lelapin) on AISC 2024 - Project Summaries · 2023-11-28T15:18:05.844Z · LW · GW

Jonathan Claybrough

Actually no, I think the project lead here is jonachro@gmail.com which I guess sounds a bit like me, but isn't me ^^

Comment by Jonathan Claybrough (lelapin) on AI Safety is Dropping the Ball on Clown Attacks · 2023-10-22T09:34:28.485Z · LW · GW

Would be up for this project. As is, I downvoted Trevor's post for how rambly and repetitive it is. There's a nugget of idea, that AI can be used for psychological/information warfare that I was interested in learning about, but the post doesn't seem to have much substantive argument to it, so I'd be interested in someone both doing an incredibly shorter version which argued for its case with some sources. 

Comment by Jonathan Claybrough (lelapin) on Padding the Corner · 2023-09-15T14:19:00.286Z · LW · GW

It's a nice pic and moment, I very much like this comic and the original scene. It might be exaggerating a trait (here by having the girl be particularly young) for comedic effect but the Hogfather seems right. 
I think I was around 9 when I got my first sword, around 10 for a sharp knife. I have a scar in my left palm from stabbing myself with that sharp knife as a child while whittling wood for a bow. It hurt for a bit, and I learned to whittle away from me or do so more carefully. I'm pretty sure my life is better for it and (from having this nice story attached to it) I like the scar. 

Comment by Jonathan Claybrough (lelapin) on Padding the Corner · 2023-09-13T09:16:39.345Z · LW · GW

This story still presents the endless conundrum between avoiding hurt and letting people learn and gain skills.
Assuming the world was mostly the same as nowadays, by the time your children are parenting, would they have the skills to notice sharp corners if they never experienced them ? 

I think my intuitive approach here would be to put some not too soft padding (which is effectively close to what you did, it's still an unpleasant experience hitting against that even with the cloth). 

What's missing is how to teach against existential risks. There's an extent to which actually bleeding profusely from a sharp corner can help learn walking carefully, anticipating dangers, and that these skills do generalize to many situations and allows one to live a long fruitful life. (This last sentence does not pertain to the actual age of your children and doesn't address ideal ages at which you can actually learn the correct and generalizable thing). If you have control on the future, remove all the sharp edges forever. 
If you don't, you remove the hard edges when they're young, instore them again when they can/should learn to recognize what typically are hard edges and must be accounted for. 

Comment by Jonathan Claybrough (lelapin) on Jonathan Claybrough's Shortform · 2023-07-26T09:06:22.955Z · LW · GW

Are people losing ability to use and communicate in previous ontologies after getting Insight from meditation ? (Or maybe they never had the understanding I'm expecting of them ?) Should I be worried myself, in my practice of meditation ? 

Today I reread Kensho by @Valentine, which presents Looking, and the ensuing conversation in the comments between @Said Achmiz and @dsatan, where Said asks for concrete benefits we can observe and mostly fails to get them. I also noticed interesting comments by @Ruby who  in contrast was still be able to communicate in the more typical LW ontology, but hadn't meditated to the point of Enlightenment. Is Enlightenment bad? Different ?  

My impression is that people don't become drastically better (at epistemology, rationality, social interaction, actually achieving your goals and values robustly) very fast through meditating or getting Enlightened, though they may acquire useful skills that could help to get better. If that's the case, it's safe for me to continue practicing meditation, getting into Jhanas, Insight etc (I'm following The Mind Illuminated), as the failings of Valentine/dsatan to communicate their points could just be attributed to them not being able to before either. 
But I remain wary that people spend so much time engaging and believing in the models and practices taught in meditation material that they actually change their minds for the worse in certain respects. It looks like meditation ontologies/practices are Out to Get You and I don't want to get Got. 

Comment by Jonathan Claybrough (lelapin) on Change my mind: Veganism entails trade-offs, and health is one of the axes · 2023-06-04T23:19:32.836Z · LW · GW

I focused my answer on the morally charged side, not emotional. The quoted statement said A and B so as long as B is mostly true for vegans, A and B is mostly true for (a sub-group) of vegans. 

I'd agree with the characterization "it’s deeply emotionally and morally charged for one side in a conversation, and often emotional to the other." because most people don't have small identities and do feel attacked by others behaving differently indeed. 

Comment by Jonathan Claybrough (lelapin) on Change my mind: Veganism entails trade-offs, and health is one of the axes · 2023-06-02T18:44:12.140Z · LW · GW

It's standard that the morally charged side in a veganism conversation is from people who argue for veganism. 
Your response reads as snarky, since you pretend to have understood the contrary. You're illustrating op's point, that certain vegans are emotionally attached to their cause and jump to the occasion to defend their tribe. If you disagree to being pictured a certain way, at least act so it isn't accurate to depict you that way.

Comment by Jonathan Claybrough (lelapin) on Change my mind: Veganism entails trade-offs, and health is one of the axes · 2023-06-02T18:20:31.548Z · LW · GW
Comment by Jonathan Claybrough (lelapin) on Agentized LLMs will change the alignment landscape · 2023-04-12T13:36:05.838Z · LW · GW

Did you know about "by default, GPTs think in plain sight"?
It doesn't explicitly talk about agentized GPTs but was discussing the impact this has on GPTs for AGI and how it affects the risks, and what we should do about it (eg. maybe rlhf is dangerous)

Comment by Jonathan Claybrough (lelapin) on Agentized LLMs will change the alignment landscape · 2023-04-12T12:21:20.114Z · LW · GW

To not be misinterpreted, I didn't say I'm sure it's more the format than the content that's causing the upvotes (open question), nor that this post doesn't meet the absolute quality bar that normally warrants 100+ upvote (to each reader their opinion).

If you're open to object level discussing this, I can point on concrete disagreement with the content. Most importantly, this should not be seen as a paradigm shift, because it does not invalidate any of the previous threat models - it would only be so if it rendered impossible to do AGI any other way. I also don't think this should "change the alignment landscape" because it's just another part of it, one which was known and has been worked on for years (Anthropic and OpenAI have been "aligning" LLMs and I'd bet 10:1 anticipated these would be used to do agents like most people I know in alignment). 

To clarify, I do think it's really important and great people work on this, and that in order this will be the first x-risk stuff we see. But we could solve the GPT-agent problem and still die to unalignment AGI 3 months afterwards. The fact that the world trajectory we're on is throwing additional problems in the mix (keeping the world safe from short term misuse and unaligned GPT-agents) doesn't make the existing ones simpler. There still is pressure to built autonomous AGI, there might still be mesa optimizers, there might still be deception, etc. We need the manpower to work on all of these, and not "shift the alignment landscape" to just focus on the short term risks.

I'd recommend to not worry much about PR risk, just ask the direct question: Even if this post is only ever read by LW folk, does the "break all encryption" add to the conversation? Causing people to take time to debunk certain suggestions isn't productive even without the context of PR risk

Overall I'd like some feedback on my tone, if it's too direct/aggressive to you of it's fine. I can adapt. 

Comment by Jonathan Claybrough (lelapin) on All AGI Safety questions welcome (especially basic ones) [April 2023] · 2023-04-12T11:31:57.025Z · LW · GW

You can read "reward is not the optimization target" for why a GPT system probably won't be goal oriented to become the best at predicting tokens, and thus wouldn't do the things you suggested (capturing humans). The way we train AI matters for what their behaviours look like, and text transformers trained on prediction loss seem to behave more like Simulators. This doesn't make them not dangerous, as they could be prompted to simulate misaligned agents (by misuses or accident), or have inner misaligned mesa-optimisers

I've linked some good resources for directly answering your question, but otherwise to read more broadly on AI safety I can point you towards the AGI Safety Fundamentals course which you can read online, or join a reading group. Generally you can head over to AI Safety Support, check out their "lots of links" page and join the AI Alignment Slack, which has a channel for question too.

Finally, how does complexity emerge from simplicity? Hard to answer the details for AI, and you probably need to delve into those details to have real picture, but there's at least strong reason to think it's possible : we exist. Life originated from "simple" processes (at least in the sense of being mechanistic, non agentic), chemical reactions etc. It evolved to cells, multi cells, grew etc. Look into the history of life and evolution and you'll have one answer to how simplicity (optimize for reproductive fitness) led to self improvement and self awareness

Comment by Jonathan Claybrough (lelapin) on Agentized LLMs will change the alignment landscape · 2023-04-11T10:18:52.672Z · LW · GW

Quick meta comment to express I'm uncertain that posting things in lists of 10 is a good direction. The advantages might be real, easy to post, quick feedback, easy interaction, etc.

But the main disadvantage is that this comparatively drowns out other better posts (with more thought and value in them). I'm unsure if the content of the post was also importantly missing from the conversation (to many readers) and that's why this got upvoted so fast or if it's a lot the format... Even if this post isn't bad (and I'd argue it is for the suggestions it promotes), this is early warning of a possible trend that people with less thought out takes quickly post highly accessible content, get comparatively more upvotes than should, and it's harder to find good content.

(Additional disclosure, some of my bad taste for this post come from the fact its call to break all encryption is being cited on Twitter as representative of the alignment community - I'd have liked to answer that obviously no, but it got many upvotes! This makes my meta point also seem to be motivated by PR/optics which is why it felt necessary to disclose but let's mostly focus on consequences inside the community)

Comment by Jonathan Claybrough (lelapin) on All AGI Safety questions welcome (especially basic ones) [April 2023] · 2023-04-09T17:15:48.819Z · LW · GW

First a quick response on your dead man switch proposal : I'd generally say I support something in that direction. You can find existing literature considering the subject and expanding in different directions in the "multi level boxing" paper by Alexey Turchin https://philpapers.org/rec/TURCTT , I think you'll find it interesting considering your proposal and it might give a better idea of what the state of the art is on proposals (though we don't have any implementation afaik)

Back to "why are the predicted probabilities so extreme that for most objectives, the optimal resolution ends with humans dead or worse". I suggest considering a few simple objectives we could give ai (that it should maximise) and what happens, and over trials you see that it's pretty hard to specify anything which actually keeps humans alive in some good shape, and that even when we can sorta do that, it might not be robust or trainable. 
For example, what happens if you ask an ASI to maximize a company's profit ? To maximize human smiles? To maximize law enforcement ? Most of these things don't actually require humans, so to maximize, you should use the atoms human are made of in order to fulfill your maximization goal. 
What happens if you ask an ASI to maximize number of human lives ? (probably poor conditions). What happens if you ask it to maximize hedonistic pleasure ? (probably value lock in, plus a world which we don't actually endorse, and may contain astronomical suffering too, it's not like that was specified out was it?). 

So it seems maximising agents with simple utility functions (over few variables) mostly end up with dead humans or worse. So it seems approaches which ask for much less, eg. doing an agi that just tries to secure the world from existential risk (a pivotal act) and solve some basic problems (like dying) then gives us time for a long reflection to actually decide what future we want, and be corrigible so it lets us do that, seems safer and more approachable. 

Comment by Jonathan Claybrough (lelapin) on All AGI Safety questions welcome (especially basic ones) [April 2023] · 2023-04-09T10:51:08.169Z · LW · GW

I watched the video, and appreciate that he seems to know the literature quite well and has thought about this a fair bit - he did a really good introduction of some of the known problems. 
This particular video doesn't go into much detail on his proposal, and I'd have to read his papers to delve further - this seems worthwhile so I'll add some to my reading list. 

I can still point out the biggest ways in which I see him being overconfident : 

  • Only considering the multi-agent world. Though he's right that there already are and will be many many deployed AI systems, that doe not translate to there being many deployed state of the art systems. As long as training costs and inference costs continue increasing (as they have), then on the contrary fewer and fewer actors will be able to afford state of the art system training and deployment, leading to very few (or one) significantly powerful AGI (as compared to the others, for example GPT4 vs GPT2)
  • Not considering the impact that governance and policies could have on this. This isn't just a tech thing where tech people can do whatever they want forever, regulation is coming. If we think we have higher chances of survival in highly regulated worlds, then the ai safety community will do a bunch of work to ensure fast and effective regulation (to the extent possible). The genie is not out of the bag for powerful AGI, governments can control compute and regulate powerful AI as weapons, and setup international agreements to ensure this. 
  • The hope that game theory ensures that AI developed under his principles would be good for humans. There's a crucial gap between going from real world to math models. Game theory might predict good results under certain conditions and rules and assumptions, but many of these aren't true of the real world and simple game theory does not yield accurate world predictions (eg. make people play various social games and they won't act like how game theory says). Stated strongly, putting your hope on game theory is about as hard on putting your hope on alignment.  There's nothing magical about game theory which makes it work simpler than alignment, and it's been studied extensively by ai researchers (eg. why Eliezer calls himself a decision theorist and writes a lot about economics) with no clear "we've found a theory which empirically works robustly and in which we can put the fate of humanity in"

I work in AI strategy and governance, and feel we have better chances of survival in a world where powerful AI is limited to extremely few actors, with international supervision and cooperation for the guidance and use of these systems, making extreme efforts in engineering safety, in corrigibility, etc. I am not trustworthy of predictions on how complex systems turn out (which is the case of real multi agent problems) and don't think we can control these well in most relevant cases. 

Comment by Jonathan Claybrough (lelapin) on Maze-solving agents: Add a top-right vector, make the agent go to the top-right · 2023-04-03T08:22:30.883Z · LW · GW

Writing down predictions. The main caveat is that these predictions are predictions about how the author will resolve these questions, not my beliefs about how these techniques will work in the future. I am pretty confident at this stage that value editing can work very well in LLMs when we figure it out, but not so much that the first try will have panned out. 

  1. Algebraic value editing works (for at least one "X vector") in LMs: 90 %
  2. Algebraic value editing works better for larger models, all else equal 75 %
  3. If value edits work well, they are also composable 80 %
  4. If value edits work at all, they are hard to make without substantially degrading capabilities 25 %
  5. We will claim we found an X-vector which qualitatively modifies completions in a range of situations, for X = 
    1. "truth-telling" 10 %
    2. "love" 70 %
    3. "accepting death" 20%
    4. "speaking French" 80%
Comment by Jonathan Claybrough (lelapin) on Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky · 2023-04-02T05:57:36.918Z · LW · GW

I don't think reasoning about others' beliefs and thoughts is helping you be correct about the world here. Can you instead try to engage with the arguments themselves and point out at what step you you don't see a concrete way for that to happen ? 
You don't show much sign of having read the article so I'll copy paste the part with explanations of how AIs start acting in the physical space.

In this scenario, the AIs face a challenge: if it becomes obvious to everyone that they are trying to defeat humanity, humans could attack or shut down a few concentrated areas where most of the servers are, and hence drastically reduce AIs' numbers. So the AIs need a way of getting one or more "AI headquarters": property they control where they can safely operate servers and factories, do research, make plans and construct robots/drones/other military equipment. 

Their goal is ultimately to have enough AIs, robots, etc. to be able to defeat the rest of humanity combined. This might mean constructing overwhelming amounts of military equipment, or thoroughly infiltrating computer systems worldwide to the point where they can disable or control most others' equipment, or researching and deploying extremely powerful weapons (e.g., bioweapons), or a combination.

Here are some ways they could get to that point:

  • They could recruit human allies through many different methods - manipulation, deception, blackmail and other threats, genuine promises along the lines of "We're probably going to end up in charge somehow, and we'll treat you better when we do."
    • Human allies could be given valuable intellectual property (developed by AIs), given instructions for making lots of money, and asked to rent their own servers and acquire their own property where an "AI headquarters" can be set up. Since the "AI headquarters" would officially be human property, it could be very hard for authorities to detect and respond to the danger.
    • Via threats, AIs might be able to get key humans to cooperate with them - such as political leaders, or the CEOs of companies running lots of AIs. This would open up further strategies.
  • As assumed above, particular companies are running huge numbers of AIs. The AIs being run by these companies might find security holes in the companies' servers (this isn't the topic of this piece, but my general impression is that security holes are widespread and that reasonably competent people can find many of them)15, and thereby might find opportunities to create durable "fakery" about what they're up to.
    • E.g., they might set things up so that as far as humans can tell, it looks like all of the AI systems are hard at work creating profit-making opportunities for the company, when in fact they're essentially using the server farm as their headquarters - and/or trying to establish a headquarters somewhere else (by recruiting human allies, sending money to outside bank accounts, using that money to acquire property and servers, etc.)
  • If AIs are in wide enough use, they might already be operating lots of drones and other military equipment, in which case it could be pretty straightforward to be able to defend some piece of territory - or to strike a deal with some government to enlist its help in doing so.
  • AIs could mix-and-match the above methods and others: for example, creating "fakery" long enough to recruit some key human allies, then attempting to threaten and control humans in key positions of power to the point where they control solid amounts of military resources, then using this to establish a "headquarters."

So is there anything here you don't think is possible ? 
Getting human allies ? Being in control of large sums of compute while staying undercover ? Doing science, and getting human contractors/allies to produce the results ? etc

Comment by Jonathan Claybrough (lelapin) on The Wizard of Oz Problem: How incentives and narratives can skew our perception of AI developments · 2023-03-21T08:31:10.287Z · LW · GW

I think this post would benefit from being more explicit on its target. This problem concerns AGI labs and their employees on one hand, and anyone trying to build a solution to Alignment/AI Safety on the other. 

By narrowing the scope to the labs, we can better evaluate the proposed solutions (for example  to improve decision making we'll need to influence decision makers therein), make them more focused (to the point of being lab specific, analyzing each's pressures), and think of new solutions (inoculating ourselves/other decision makers on AI about believing stuff that come from those labs by adding a strong dose of healthy skepticism). 

By narrowing the scope to people working on AI Safety who's status or monetary support relies on giving impressions of progress, we come up with different solutions (try to explicitly reward honesty, truthfulness, clarity over hype and story making). A general recommendation I'd have is to have some kind of reviews that check against "Wizard of Oz'ing" for flagging the behavior and suggesting corrections. Currently I'd say the diversity of LW and norms for truth seeking are doing quite well at this, so posting on here publicly is a great way to control this. It highlights the importance of this place and of upkeeping these norms.

 

Comment by Jonathan Claybrough (lelapin) on Shutting Down the Lightcone Offices · 2023-03-21T08:12:33.938Z · LW · GW

Thanks for the reply ! 

The main reason I didn't understand (despite some things being listed) is I assumed none of that was happening at Lightcone (because I guessed you would filter out EAs with bad takes in favor of rationalists for example). The fact that some people in EA (a huge broad community) are probably wrong about some things didn't seem to be an argument that Lightcone Offices would be ineffective as (AFAIK) you could filter people at your discretion. 

More specifically, I had no idea "a huge component of the Lightcone Offices was causing people to work at those organizations". That's strikingly more of a debatable move but I'm curious why that happened in the first place ? In my field building in France we talk of x-risk and alignment and people don't want to accelerate the labs but do want to slow down or do alignment work. I feel a bit preachy here but internally it just feels like the obvious move is "stop doing the probably bad thing", but I do understand if you got in this situation unexpectedly that you'll have a better chance burning this place down and creating a fresh one with better norms. 

Overall I get a weird feeling of "the people doing bad stuff are being protected again, we should name more precisely who's doing the bad stuff and why we think it's bad" (because I feel aimed at by vague descriptions like field-building, even though I certainly don't feel like I contributed to any of the bad stuff being pointed at)

No, this does not characterize my opinion very well. I don't think "worrying about downside risk" is a good pointer to what I think will help, and I wouldn't characterize the problem that people have spent too little effort or too little time on worrying about downside risk. I think people do care about downside risk, I just also think there are consistent and predictable biases that cause people to be unable to stop, or be unable to properly notice certain types of downside risk, though that statement feels in my mind kind of vacuous and like it just doesn't capture the vast majority of the interesting detail of my model. 

So it's not a problem of not caring, but of not succeeding at the task. I assume the kind of errors you're pointing at are things which should happen less with more practiced rationalists ? I guess then we can either filter to only have people who are already pretty good rationalists, or train them (I don't know if there are good results on that side per CFAR). 

Comment by Jonathan Claybrough (lelapin) on Shutting Down the Lightcone Offices · 2023-03-20T09:30:12.769Z · LW · GW

I don't think cost had that much to do with the decision, I expect that Open Philanthropy thought it was worth the money and would have been willing to continue funding at this price point. 

In general I think the correct response to uncertainty is not half-speed. In my opinion it was the right call to spend this amount of funding on the office for the last ~6 months of its existence even when we thought we'd likely do something quite different afterwards, because it was still marginally worth doing it and the cost-effectiveness calculations for the use of billions of dollars of x-risk money on the current margin are typically quite extreme. 

You're probably not the one to rant to about funding but I guess while the conversation is open I could use additional feedback and some reasons for why OpenPhil wouldn't be irresponsible in spending the money that way. (I only talk about OpenPhil and not particularly Lightcone, maybe you couldn't think of better ways to spend the money and didn't have other options)

Cost effectiveness calculations for reducing x-risk kinda always favor x-risk reduction so looking at it in the absolute isn't relevant. Currently AI x-risk reduction work is limited because of severe funding restrictions (there are many useful things to do that no one is doing for lack of money) which should warrant carefully done triage (and in particular considering the counterfactual). 

I assume the average Lightcone office resident would be doing the same work with slightly reduced productivity (let's say 1/3) if they didn't have that office space (notably because many are rich enough to get other shared office space from their own pocket). Assuming 30 full time workers in the office, that's 10 man months per month of extra x-risk reduction work. 

For contrast, on the same time period, $185k/month could provide for salary, lodging and office space for 50 people in Europe, all who counterfactually would not be doing that work otherwise, for which I claim 50 man months per month of extra x-risk reduction work. The biggest difference I see is incubation time would be longer than for the Lightcone offices, but if I start now with $20k/month I'd find 5 people and scale it up to 50 by the end of the year. 

 

Comment by Jonathan Claybrough (lelapin) on Shutting Down the Lightcone Offices · 2023-03-20T08:34:33.055Z · LW · GW

I've multiple times been perplexed as to what the past events which can lead to this kind of take (over 7 years ago, EA/Rationality community's influence probably accelerated openAI's creation) have to do with today's shutting down of the offices. 
Are there current, present day things going on in the EA and rationality community which you think warrant suspecting them of being incredibly net negative (causing worse worlds, conditioned on the current setup)? Things done in the last 6 months ? At Lightcone Offices ? (Though I'd appreciate specific examples, I'd already greatly appreciate knowing if there is something in the abstract and prefer a quick response to that level of precision than nothing)

I've imagined an answer, is the following on your mind ? 
"EAs are more about saying they care about numbers than actually caring about numbers, and didn't calculate downside risk enough in the past. The past events reveal this attitude and because it's not expected to have changed, we can expect it to still be affecting current EAs, who will continue causing great harm because of not actually caring for downside risk enough. "

Comment by Jonathan Claybrough (lelapin) on Shutting Down the Lightcone Offices · 2023-03-20T08:24:48.374Z · LW · GW

I mostly endorse having one office concentrate on one research agenda and be able to have high quality conversations on it, and the stated numbers of maybe 10 to 20% people working on strategy/meta sounds fine in that context. Still I want to emphasize how crucial they are - If you have no one to figure out the path between your technical work and overall reducing risk, you're probably missing better paths and approaches (and maybe not realizing your work is useless). 

Overall I'd say we don't have enough strategy work being done, and believe it's warranted to have spaces with 70% of people working on strategy/meta. I don't think it was bad if the Lightcone office had a lot of strategy work. (We probably also don't have enough technical alignment work, having more of both is probably good, if we coordinate properly)

Comment by Jonathan Claybrough (lelapin) on Why do we assume there is a "real" shoggoth behind the LLM? Why not masks all the way down? · 2023-03-11T18:09:03.533Z · LW · GW

Other commenters have argued about the correctness of using Shoggoth. I think it's mostly a correct term if you take it in the Lovcraftian sense, and that currently we don't understand LLMs that much. Interpretability might work and we might progress so we're not sure they actually are incomprehensible like Shoggoths  (though according to wikipedia, they're made of physics, so probably advanced civilizations could get to a point where they could understand them, the analogy holds surprisingly well!)
Anyhow it's a good meme and useful to say "hey, we don't understand these things as well as you might imagine from interacting with the smiley face" to describe our current state of knowledge. 

Now for trying to construct some idea of what it is
I'll argue a bit against calling an LLM as a pile of masks, as that seems to carry implications which I don't believe in. The question we're asking ourselves is something like "what kind of algorithms/patterns do we expect to see appear when an LLM is trained? Do those look like a pile of masks, or some more general simulator that creates masks on the fly?" and the answer depends on specifics and optimization pressure. I wanna sketch out different stages we could hope to see and understand better (and I'd like for us to test this empirically and find out how true this is). Earlier stages don't disappear, as they're still useful at all times, though other things start weighing more in the next token predictions. 

Level 0 : Incompetent, random weights, no information about the real world or text space. 

Level 1 "Statistics 101" : Dumb heuristics  doesn't take word positions into account. 
It knows facts about the world like token distribution and uses that. 

Level 2 "Statistics 201"  : Better heuristics, some equivalent to Markov chains. 
Its knowledge of text space increases, it produces idioms, reproduces common patterns. At this stage it already contains huge amount of information about the world. It "knows" stuff like mirrors are more likely to break and cause 7 years of jinx. 

Level 3 "+ Simple algorithms": Some pretty specific algorithms appear (like Indirect Object Identification), which can search for certain information and transfer it in more sophisticated ways. Some of these algorithms are good enough they might not be properly described as heuristics anymore, but instead really representing the actual rules as strongly as they exist in language (like rules of grammar properly applied). Note these circuits appear multiple times and tradeoff against other things so overall behavior is still stochastic, there are heuristics on how much to weight these algorithms and other info. 

Level 4 "Simulating what created that text" : This is where it starts to have more and more powerful in context learning, ie. its weights represent algorithms which do in context search (and combines with its vast encyclopedic knowledge of texts, tropes, genres) and figure out consistencies in characters or concepts introduced in the prompt. For example it'll pick up on Alice and Bobs' different backgrounds, latent knowledge on them, their accents. 
But it only does that because that's what authors generally do, and it has the same reasoning errors common to tropes. That's because it simulates not the content of the text (the characters in the story), but the thing which generates the story (the writer, who themselves have some simulation of the characters). 



So uh, do masks or pile of masks fit anywhere in this story ? Not that I see. The mask is a specific metaphor for the RLHF finetunning which causes mode collapse and makes the LLM mostly only play the nice assistant (and its opposites). It's a constraint or bridle or something, but if the training is light (doesn't affect the weights too much), then we expect the LLM to mostly be the same, and that was not masks. 

Nor are there piles of masks. It's a bunch of weights really good at token prediction, learning more and more sophisticated strategies for this. It encodes stereotypes at different places (maybe french=seduction or SF=techbro), but I don't believe these map out to different characters. Instead, I expect it at level 4, there's a more general algorithm which pieces together the different knowledge, that it in context learns to simulate certain agents. Thus, if you just take mask to mean "character", the LLM isn't a pile of them, but a machine which can produce them on demand. 

(In this view of LLMs, x-risk happens because we feed some input where the LLM simulates an agentic deceptive self aware agent which steers the outputs until it escapes the box)

Comment by Jonathan Claybrough (lelapin) on Fighting For Our Lives - What Ordinary People Can Do · 2023-02-22T10:24:43.532Z · LW · GW

Cool that you wanna get involved! I recommend the most important thing to do is coordinate with other people already working on AI safety, because they might have plans and projects already going on you can help with, and to avoid the unilateralist's curse. 
So, a bunch of places to look into to both understand the field of AI safety better and find people to collaborate with : 
http://aisafety.world/tiles/ (lists different people and institutions working on AI safety)
https://coda.io/@alignmentdev/alignmentecosystemdevelopment (lists AI safety communities, you might join some international ones or local ones near you)

I have an agenda around outreach (convincing relevant people to take AI safety seriously) and think it can be done productively, though it wouldn't look like 'screaming on the rooftops', but more expert discussion with relevant evidence. 

I'm happy to give an introduction to the field and give initial advice on promising directions, anyone interested dm me and we can schedule that. 

Comment by Jonathan Claybrough (lelapin) on Why should ethical anti-realists do ethics? · 2023-02-17T05:41:12.511Z · LW · GW

I generally explain my interest in doing good and considering ethics (despite being anti realist) something like your point 5, and I don't agree with or fully get your refutation that it's not a good explanation, so I'll engage with it and hope for clarifications. 

My position, despite anti-realism and moral relativism, is I do happen to have values (which I can "personal values", they're mine and I don't think there's an absolute reason for anyone else to have them, though I will advocate for them to some extent) and epistemics (despite the problem of the criterion) that have initialized in a space where I want to do Good, I want to know what is Good, I want to iterate at improving my understanding and actions doing Good. 

A quick question - when you say "Personally, though, I collect stamps", do you mean your personal approach to ethics is descriptive and exploratory (and you're collecting stamps in the sense of physics vs stamp collection image), and that you don't identify as systematizer ? 

I wouldn't identify as "systematizer for its sake" either, it's not a terminal value, but it's an instrumental value for achieving my goal of doing Good. I happen to have priors and heuristics saying I can do more Good by systematizing better so I do, and I get positive feedback from it so I continue this. 
Re "conspicuous absence of subject-matter" - true for an anti realist considering "absolute ethics", but this doesn't stop an anti realist considering what they'll call "my ethics". There can be as much subject-matter there as in realist absolute ethics, because you can simulate absolute ethics in "my ethics" with : "I epistemically believe there is no true absolute ethics, but my personal ethics is that I should adopt what I imagine would be the absolute real ethics if it existed".  I assume this is an existing theorized position but not knowing if it already has another standard name, I call this being a "quasi realist", which is how I'd describe myself currently. 

I don't buy Anti realists treating consistency as absolute, so there's nothing to explain. I view valuing consistency as being instrumental and it happens to win all the time (every ethics has it) because of the math that you can't rule out anything otherwise. I think the person who answers "eh, I don’t care that much about being ethically consistent" is correct that it's not in their terminal values, but miscalculates (they actually should value it instrumentally), it's a good mistake to point out. 
I agree that someone who tries to justify their intransitivities by saying "oh I'm inconsistent" is throwing out the baby with the bathwater when they could simply say "I'm deciding to be intransitive here because it better fits my points". Again, it's a good mistake to point out. 
 I see anti realists as just picking up consistency because it's a good property to have for useful ethics, not because "Ethics" forced it onto them (it couldn't, it doesn't exist). 

On the final paragraph, I would write my position as : “I do ethics, as an anti-realist, because I have a brute, personal preference to Doing Good (a cluster of helping other people, reducing suffering, anything that stems from Veil of Uncertainty which is intuitively appealing), and that this is self reinforcing (I consider it Good to want to do Good and to improve and doing Good), so I want to improve my ethics. There exists zones of value space where I'm in the dark and have no intuition (eg. population ethics/repugnant conclusion) so I use good properties (consistency, ..) to craft a curve which extends my ethics, not because of personal preference for blah-structural-properties, but by belief that this will satisfy my preferences to Doing Good the best".
If a dilemma comes up pitting object level stakes and some abstract structural constraint, I weigh my belief that my intuition on "is this Good" is correct against my belief that "the model of ethics I constructed from other points is correct" and I'll probably update one or both. Because of the problem of the criterion, I'm neither gonna trust my ethics or my data points as absolute. I have uncertainty on the position of all my points and on the best shape of the curve, so sometimes I move my estimate of the point position because it fits the curve better, and sometimes I move the curve shape because I'm pretty sure the point should be there. 

I hope that's a fully satisfying answer to "Why do ethical anti-realists do ethics". 
I wouldn't say there's an absolute reason why ethical anti-realists should do ethics.

Comment by Jonathan Claybrough (lelapin) on The Cave Allegory Revisited: Understanding GPT's Worldview · 2023-02-16T15:21:43.799Z · LW · GW

Mostly unrelated - I'm curious about the page you linked to https://acsresearch.org/ 
As far as I see this is a fun site with a network simulation without any explanation. I'd have liked to see an about page with the stated goals of acs (or simply a link to your introductory post) so I can point to that site when talking about you. 

Comment by Jonathan Claybrough (lelapin) on What I mean by "alignment is in large part about making cognition aimable at all" · 2023-02-04T02:47:01.621Z · LW · GW

I don't dispute that at some point in time we want to solve alignment (to come out of the precipice period), but I disputed it's more dangerous to know how to point AI before having solved what perfect goal to give it. 
In fact, I think it's less dangerous because we at minimum gain more time, to work and solve alignment, and at best can use existing near human-level AGI to help us solve alignment too. The main reason to believe this is to reason that near human-level AGI is a particular zone where we can detect deception, where it can't easily unbox itself and takeover, yet is still useful. The longer we stay in this zone, the more relatively safe progress we can make (including on alignment)

Comment by Jonathan Claybrough (lelapin) on What I mean by "alignment is in large part about making cognition aimable at all" · 2023-02-03T18:08:43.861Z · LW · GW

Thanks for the answers. It seems they mostly point to you valuing stuff like freedom/autonomy/self-realization, and that violations of that are distasteful. I think your answers are pretty reasonable and though I might not have exact same level of sensitivity I agree with the ballpark and ranking (brainwashing is worse than explaining, teaching chess exclusively feels a little too heavy handed..)

So where our intuitions differ is probably that you're applying these heuristics about valuing freedom/autonomy/self-realization to AI systems we train ? Do you see them as people, or more abstractly as moral patients (because of them probably being conscious or something)? 

I won't get into moral weeds too fast, I'd point out that though I do currently mostly believe consciousness and moral patienthood is quite achievable "in silico", that doesn't mean that all intelligent system is conscious or a moral patient, and we might create AGI that isn't of that kind. If you suppose AGI is conscious and a moral patient, then yeah I guess you can argue against it being pointed somewhere, but I'd mostly counter argue from moral relativism that "letting it point anywhere" is not fundamentally more good than "pointed somewhere", so because we exist and have morals, let's point it to our morals anyway. 

Comment by Jonathan Claybrough (lelapin) on You Don't Exist, Duncan · 2023-02-03T17:57:39.298Z · LW · GW

Just tagging I've intuitively used a similar approach for a long time, but adding the warning that there definitely are corrosive aspects to it, where everyone else loses value and get disrespected. Your subcomment delved into finer details valuably so I think you're aware of this. 
Overall my favorite solution has been something like "I expect others to be mostly wrong, so I'm not surprised or hurt when they are, but I try to avoid mentally categorizing them in a degrading fashion" for most people. Everything is bad to an extent, everyone is bad to an extent, I can deal with it and try to make the world better. 
I don't think there's anyone who I admire/respect enough that I don't expect them to make mistakes of the kind Duncan's pointing at, so I'm not bothered even if it come from people I like or I think are competent on some other things. 

Comment by Jonathan Claybrough (lelapin) on What I mean by "alignment is in large part about making cognition aimable at all" · 2023-02-03T11:27:09.321Z · LW · GW

I don't particularly understand and you're getting upvoted so I'd appreciate clarification, here are some prompts if you want : 
- Is you deciding that you will concentrate on doing a hard task (solving a math problem) pointing your cognition, and is it viscerally disgusting ? 
- Is you asking a friend a favor to do your math homework pointing their cognition and is it viscerally disgusting ? 
- Is you convincing by true arguments someone else into doing a hard task that benefits them pointing their cognition, and is it viscerally disgusting ? 
- Is a CEO giving directions to his employees so they spend their days working on specific task pointing their cognition, and is it viscerally disgusting ? 
- Is you having a child and training them to be a chess grandmaster (eg. Judith Polgar) pointing their cognition, and is it viscerally disgusting ?
- Is you brainwashing someone into a particular cult where they will dedicate their life to one particularly repetitive task pointing their cognition, and is it viscerally disgusting ? 
- Is you running a sorting algorithm on a list pointing the computer's cognition, and is it viscerally disgusting ? 


I'm hoping to get info on what problem you're seeing (or esthetics?), why it's a problem, and how it could be solved. I gave many examples where you could interpret my questions as being rhetoric - that's not the point. It's about identifying at which point you start differing. 

Comment by Jonathan Claybrough (lelapin) on What I mean by "alignment is in large part about making cognition aimable at all" · 2023-02-03T11:16:37.675Z · LW · GW

Even if we haven't found a goal that will be (strongly) beneficial to humanity, it seems useful knowing "how to make an AI target a distinct goal" because we can at least make it have limited impact and not take over. There's gradients of success on both problems, and having solved the first does mean we can do slightly positive things even if we can't do very strongly positive things. 

Comment by Jonathan Claybrough (lelapin) on What I mean by "alignment is in large part about making cognition aimable at all" · 2023-02-03T11:11:22.618Z · LW · GW

Is it simple if you don't have infinite compute ? 
I would be interested in a description which doesn't rely on infinite compute, or more strictly still, that is is computationally tractable. This constraint is important to me because I assume that the first AGI we get is using something that's more efficient that other known methods (eg. using DL because it works, even though it's hard to control), so I care about aligning the stuff which we'll actually be using. 

Comment by Jonathan Claybrough (lelapin) on You Don't Exist, Duncan · 2023-02-03T10:39:43.507Z · LW · GW

For information I'd also qualify Said's statement as unkind (because of "saying it out loud is trite") if I modeled him as having understood or caring about Duncan and his point, but because that's not the case I understand Duncan just seeing it as not useful. 
"Rude" is a classification depending on shared social norms. On LW I don't think people are supposed to care about you, the basic assumption is more Rand like individuals who trade ideas because it's positive sum. That a lot of people happen to be nice is a nice surprise, but it's not expected, and I have gotten value from Said's comments in many places over time so I feel the LW norm makes sense. 

Comment by Jonathan Claybrough (lelapin) on You Don't Exist, Duncan · 2023-02-03T10:34:44.591Z · LW · GW

This is evidence that the thing you described exists, everyday, even in this more filtered community. Sorry about that. 

(The following 3 paragraphs use guesswork and near psychoanalyzing which is sorta distateful - I do it openly because I think it's got truth to it and I want to be corrected if it's not. Also hopefully it makes Duncan feel more seen (and I want to know if it's not the case))

It feels JBlack's reaction is part of the symptom being described. JBlack is similar in enough ways to have been often ostracized and has come up with a way that's fine for them to deal with it, and then write "It just doesn't seem to me to be a big thing to get upset about", ie. "there exists no one for whom its legitimate to get upset about" ie. "you don't exist Duncan". I imagine for you Duncan that's a frustrating answer when it's exactly the problem you were trying to convey. (I feel john's comment is much more appropriate about looking at the problem and saying they can see different solutions without saying that it should apply to you). 

I'm interested in why "the thing" was not conveyed to JBlack. 
One important dimension to differ on is the "intuitively/fundamentally altruistic". If you are high on that dimension, some things about other people matter in of themselves (and you don't walk in the Nozick Experience Machine (necessary, not sufficient)). When someone else says they experience this or that, then (as long as you don't have more evidence that they're lying/mistaken) you care and believe them. You start from their side and try to build using their models a solution. In this mode, I read your (Duncan) post and am like "hm, I empathize to many parts, I could feel I understand him. But he's warning strongly that he keeps being misunderstood and not seen, so I'm going to trust him, and keep in mind that my model of him is imprecise/incorrect in many directions and degrees. I'll keep this in mind in my writing, suggesting models and wanting to get feedback". 
I assume JBlack is not so high on the "intuitively/fundamentally altruistic" dimension and processes the world through their experience (I mean this in a stronger way than the tautologically true one, that JBlack discount what others say of their experience strongly based on if it corresponds to their own) and to some extent don't care about understanding Duncan. So they don't. 
I'm saying this because if it's the case, Duncan's shrug is appropriate, there's not much point in trying to reach people who don't care, it's not sad to not reach someone who's unreachable. 

Comment by Jonathan Claybrough (lelapin) on My Model Of EA Burnout · 2023-02-01T15:28:22.154Z · LW · GW

I model 1) as meaning people have high expectations and are mean in comments criticizing things ? 
I am unsure about what your reasons for 2) are - Is it close to "the comments will be so low quality I'll be ashamed of having seen them and that other humans are like that" ?
I expect 3) to be about your model of how difficult it is to improve the EA forum, and meaning that you think it's not worth investing time in trying to make it better ?

As an open question, I'm curious about what you've previously seen on EA Forum which makes you expect bad thing from it. Hostile behaviour ? Bad epistemics ? 

Comment by Jonathan Claybrough (lelapin) on Flying With Covid · 2023-01-18T18:27:09.878Z · LW · GW

On [1], I'd say it's more prosocial to test as you did and travel anyway taking maximum precaution than to not test, because though you'd want to take maximum precaution it's harder without the clear confirm. 

On the general subject, I'll say that at this point I also prefer living in your world, traveling in a plane of people following your decision process. Most people have been vaccinated, most people have already had Covid, and afaik there are no particular strains on the medical system right now ? I don't particular fear being slightly exposed to Covid again. 

Comment by Jonathan Claybrough (lelapin) on Bo Chin's Shortform · 2023-01-18T18:10:07.135Z · LW · GW

I'd say lack of empathy/comprehension of the situations and events is the problem, not the lack of emotional response. One can be neuro-divergent yet learn about others and care about them, and have appropriate reactions (without that needing to be feeling an emotion). 

Comment by Jonathan Claybrough (lelapin) on Language models can generate superior text compared to their input · 2023-01-18T18:05:03.841Z · LW · GW

The two examples were (mostly) unrelated and served to demonstrate two cases where a perfect text predictor needs to do incredibly complex calculation to correctly predict text. Thus a perfect text predictor is vast superintelligence (and we won't achieve perfect text prediction, but as we get better and better we might get closer to superintelligence)

In the first case, if the training data contains series of [hash] then [plain text], then a correct predictor must be able to retrieve the plain text from the hash (and because there are multiple plain texts with the same hash, it would have to calculate through all of them and evaluate which is most probable to appear). Thus correctly predicting text can mean being able to calculate an incredibly large amount of hashes on all combinations of text of certain lengths and evaluating which is the most probable. 

In the second case, the task is to predict future papers based on past papers, which is kinda obviously very hard. 

Comment by Jonathan Claybrough (lelapin) on The Feeling of Idea Scarcity · 2023-01-18T15:38:58.365Z · LW · GW

light downvoted but explaining why to give opportunity to reply and disagree.
Meta level : 
- Lack of any explanation, just references some locally appreciated thing
- yet it had already 8 upvotes which look like ingroup "I got that reference" instead of "This comment is beneficial to LW"
- analogies are bad if you don't give their boundaries. If I say "x is like y" without specifying along which properties or axis it's generally low information. 

on object level : 
- I don't see polyamory as being much of an answer to "avoid monopolies on your emotional needs". 
- It kinda maps to "diversify one's investments" on a surface level but I'd say you expose yourself to more risk with polyamory than not, while diversifying is supposed to reduce risks.

Comment by Jonathan Claybrough (lelapin) on my current outlook on AI risk mitigation · 2022-12-24T12:49:34.132Z · LW · GW

I definitely agree on thinking deeply on how relevant different approaches are, but I think framing it as "relevance over tractability" is weird because tractability makes something relevant. 

Maybe Deep learning neural networks really are too difficult to align and we won't solve the technical alignment problem by working on them, but interpretability could still be highly relevant by proving that this is the case. We currently do not have hard information to rule out AGI from scaling up transformers, nor that it's impossible to align for self modification (though uh, I wouldn't count on it being possible, I would recommend not allowing self modification etc). More information is useful if it helps activates the breaks on developing these dangerous AI. 
In the world where all AI alignment people are convinced DL is dangerous so work on other things, then the the world gets destroyed by DL people who were never shown that it's dangerous. The universe won't guarantee other people will be reasonable, so AI safety folk have all responsibility of proving danger etc. 

Comment by Jonathan Claybrough (lelapin) on When can a mimic surprise you? Why generative models handle seemingly ill-posed problems · 2022-11-16T17:00:48.360Z · LW · GW

What is this trying to solve ? 
If you try to setup an ML system to be a mimic, can you insure you don't get an inner misaligned mesa optimizer ? 
In general, what part of AI safety is this supposed to help with? A new design for alignable AGI? A design to slap onto existing ML systems, like a better version of RLHF? 

Comment by Jonathan Claybrough (lelapin) on Changing the world through slack & hobbies · 2022-07-21T21:50:58.623Z · LW · GW

Fantastic post, particularly interesting to me as since last December I decided to pivot and work on AI safety (it seems the most useful thing to do). In the spirit of sharing different perspectives, I will also share what I tried and failed at doing since then, so someone else might avoid my failures and recognize more easily if your suggested plan (hobbies while in stable job) might not work for them (as it didn't for me). 

The TL-DR: My job drains too much energy from me to be productive on AI safety in my spare time (I tried and failed), nor can I easily get another job with money and worklife-balance. Thus, I prefer quitting my job and doing AI safety full time with no funding, until either I can get paid doing that, or I have to return to a job (but at least I will have produced something instead of nothing). 

More context : I'm a software engineer doing backend development (code interacting with databases) in a startup. My strongest passions are game theory, psychology, education, ethics. If there was no urgency to AGI as x-risk, I would work on improving education and ethics (same vibe as the sequences wanting to train rationalists). I'm recently graduated from engineering school and have been working for close to 2 years with a decent salary. I discovered late in my studies that I don't like working as a programmer (because https://www.lesswrong.com/posts/x6Kv7nxKHfLGtPJej/the-topic-is-not-the-content). I have a personality where it seems I can only do good work on things that interest me (this is normal to an extent for everyone, I'm pointing it's stronger than normal for me to the point I often cannot do mental work on something which doesn't interest me (even though I can force myself to do physical work)). 

So my diploma and experience are for a job I don't enjoy and which drains me. The usual circumstances of working for a company (working around certain time slots, optimizing your work for company profit over utility) are also a burden. It was in this context I decided last December to try pivoting into AI alignment research and I was also fortunate to have regular discussions with one AI alignment researcher on my topic, what I wanted to learn and write (he deserves credit for that effort and I'm only not mentioning him because my failure is my own). I found a couple papers I enjoyed, chose one to analyze in detail, and chose the goal of publishing a short LW article doing some further work on that paper. I totally failed doing that. The full list of reasons is of course diverse and complicated, but the main one is I put little time into it, and the time I had was low energy. I could follow that rate for a year and produce almost no value. Other reason for failure : I mistakenly chose a paper which wasn't centrally about AI safety, so the work I produced was pretty useless for solving the actual problems of AI safety. 

So I abandoned that plan. Instead, I negotiated a raise in salary, kept my living expenses low and saved money. I'm now at the point where I can quit my job (and am planning to do so soon) and actually try AI safety research in suitable conditions, with the only caveat being limited time. Doing this, I hope to know much better how good a personal fit I have to alignment research and to other related work (governance, pedagogy) and choose the rest of my path accordingly. 

In general, I would advise trying the hobby path first, while staying aware it won't work for everyone (a failure in being productive in something as a hobby doesn't mean you won't be good at it as a job). If it doesn't work, you have several others options as presented in the post, and I contribute the one I am currently doing : gain enough money to get some real free time (probably at least 3 months, 6 should be enough, probably no point to more than 12) and try doing that thing you wanted to do.