lelapin

Posts
Comments

Posts

The case for training frontier AIs on Sumerian-only corpus 2024-01-15T16:40:22.011Z

Jonathan Claybrough's Shortform 2023-07-26T09:06:22.848Z

News : Biden-⁠Harris Administration Secures Voluntary Commitments from Leading Artificial Intelligence Companies to Manage the Risks Posed by AI 2023-07-21T18:00:57.016Z

An Overview of AI risks - the Flyer 2023-07-17T12:03:20.728Z

Comments

Comment by Jonathan Claybrough (lelapin) on Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study · 2025-04-18T05:07:52.271Z · LW · GW

Thanks for the followup!

Comment by Jonathan Claybrough (lelapin) on Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study · 2025-04-16T18:31:21.354Z · LW · GW

o3 released today and I was impressed by the demo for working with images they gave so I wanted to try this out (using the prompt op linked to in appendices), but I don't have the machining experience to evaluate the answer quickly, so I thought I'd prompt it and share its result if anyone else wants to evaluate it : https://chatgpt.com/share/67fff70d-7d90-800e-a913-663b82ae7f33

Comment by Jonathan Claybrough (lelapin) on Announcing Dialogues · 2025-03-14T01:21:51.047Z · LW · GW

PSA - at least as of March 2024, the way to create a Dialogue is by navigating to someone else's profile and to click the "Dialogue" option appearing near the right, next to the option to message someone.

Comment by Jonathan Claybrough (lelapin) on Jonathan Claybrough's Shortform · 2025-03-07T18:00:55.909Z · LW · GW

I get the impression from talking to people who work professionally on reducing ai x-risk (largely taken, would include doing ops for FLI) that less than half of them could give an accurate and logically sound model of where ai x-risk comes from and why it is good that they do the work they do (bonus points if they can compare it to other work they might be doing instead). I'm generally a bit disappointed by this because to me it doesn't seem that hard to get everyone who's a professional knowledgeable on the basics, and it seems worthwhile as more people could be autonomous in assessing strategic decisions and making sure implementation of plans serves the right aims.

An example of why having a model of x-risk and directions for reducing it matters : Imagine you're an ops person with money to organize a half-day conference. What do?
What city to choose, which participants to invite?

If you have no model of anything, I guess you can copycat - do a small scale EAG-like event for AI Safety people, or do a scientific conference like thing. That's okay, probably not a total waste of money (except to the extent you don't know ai safety theory, it's possible to do counter-productive actions like giving a platform to the wrong people).

Imagine you have the following simple model :
- AI is being improved by people
- If deployed with the current alignment tech and governance level, powerful autonomous AGI could have convergent instrumental goals that lead to takeover (sharp left turn models) or large scale deployment of AGI in society can eek out human influence (Out with a Whimper or Multipolar Failure)
- Thus we should do actions that get us the best alignment tech and governance level by the time anyone is able to train and deploy dangerous IA systems.
- This means we could do actions that
- - Slow down dangerous AI development
- - Accelerate alignment tech development
- - Accelerate getting a good level of governance

Each of these need sub-models (all interlinked), let's detail a simple one for governance
Things that accelerate getting a good level of governance
- Better knowledge and models of ai x-risk (scenario+risk planning), demonstrations of precursors (model organisms of misalignment, ..), increasing scientific consensus about the paths to avoid
- Spreading the above knowledge to the general public (to have politicians be supported in economy-limiting policy)
- Spreading the above knowledge to policy-makers
- Having technically competent people write increasingly better policy proposals (probably stuff leading to international regulation of dangerous AI)

Now that you have these, it's much easier to find a few key metrics to optimize for your ops event. Doing it really well might include you talking to other event organizers to know what has and hasn't been done, what works, what's been neglected etc, but even without all that you can decide to act on:
Improvement in attendees' knowledge and models of AI x-risk
Pre-event and post-event surveys or informal conversations to gauge attendees' understanding of AI risk scenarios and mitigation strategies.
Structure the event with interactive sessions explicitly designed to clarify AI x-risk models, scenario planning, and concrete policy implications.
Potential reach for disseminating accurate AI safety knowledge
Choose guests who have strong networks or influence with policy-makers, academics, or public media.
Select a location close to influential governmental bodies or major media centers (e.g., Washington D.C., Brussels, or Geneva).

(I realize now that I wrote a full example that this might have been a mini post to serve as short reference, a wireframe for thinking through this, please reply if you think a cleaned up version of this would be good)

To get back to my original point, it's currently my impression that many who work in AI x-risk reduction, for example in ops, could not produce a "good" version of this above draft even with 2 hours of their time because of lacking background knowledge. I hope the above example sufficiently illustrates that they could be doing better work if they did.

Comment by Jonathan Claybrough (lelapin) on Catastrophe through Chaos · 2025-02-02T20:16:53.692Z · LW · GW

(off topic to op, but in topic to Jan bringing up ALERT)
To what extent do you believe Sentinel fulfills what you wanted to do with ALERT? Their emergency response team is pretty small rn. Would you recommend funders support that project or a new ALERT?

Comment by Jonathan Claybrough (lelapin) on Beards and Masks? · 2025-01-19T10:18:17.351Z · LW · GW

Appreciate the photos and final video, as they also make this informative post more enjoyable to follw through.

Comment by Jonathan Claybrough (lelapin) on Jonathan Claybrough's Shortform · 2025-01-19T10:10:31.838Z · LW · GW

EpochAI seem do be doing a lot of work that'll accelerate AI capabalities research and development (eg. informing investors and policy makers that yes AI is a huge economic deal and here are the bottlenecks you should work around, building capabilities benchmarks to optimize for). Under common-around-LW assumptions that no one could align AGI at this point, they are, by these means, increasing AI catastrophic and existential risk.

At a glance they also seem to not be doing AI x-risk reducing moves, like using their platform to mention that there are risks associated to AI, that these are not improbable, and that these require both technical work and governance to manage appropriately. This was salient to me in their latest podcast episode - speaking at length about AI replacing human workers in 5 to 10 years and the impact on the economy, without even hinting that there are risks associated with this, is burying the lede.

Given that Epoch AI is funded by OpenPhilantropy and Jaan Tallinn, who on their face care about reducing AI x-risk, what am I missing? (non rhetorical)

What is EpochAi's theory of change for making the world better on net?
Overall, is EpochAI increasing or reducing ai x-risk (on LW models, in their models)?

I wanted this to be but a short shortpost, but since I'm questioning a pretty big maybe influential org under my true name let me show good faith with proposing reasons that might contribute to what I'm seeing. For anyone unaware of their work, maybe check out their launch post, or their recent podcast episode.

OpenPhil is unsure about magnitude of AI x-risk so invest in forecasting AI capabilities to know if they should invest more in AI safety.
EpochAI doesn't believe AI x-risk is likely and believes that accelerating is overall better for humanity (seems true for some employees but not all)
EpochAI believes that promoting the information that AGI is economically important and possible soon is better because governments will better govern it than counterfactually
EpochAI is saying what they think is true without selection to avoid being deceptive (this doesn't mesh with the next reason)
EpochAI believe that mentionning AI risks at this stage would hurt their platform and their influence, and are waiting for a more ripe opportunity/better argued paper.

Tagging a few EpochAI folk that appeared in their podcast - @Jsevillamol @Tamay @Ege Erdil

Comment by Jonathan Claybrough (lelapin) on ARENA 4.0 Impact Report · 2024-11-28T21:06:44.661Z · LW · GW

Congratz on your successes and thank you for publishing this impact report.

It leaves me unsatiated related to cost effectiveness though. With no idea of how much money was invested in this project to get this outcome, I don't know if Arena is cost effective compared to other training programs and counterfactual opportunities. Would you mind sharing at least something about the amount of funding this got?

Re

Still, it is also positive if ARENA can help participants who want to pursue a career transition test their fit for alignment engineering in a comparatively low-cost way.

it doesn't strike me that a 5 week all expenses paid program is a particularly low cost way to find out AI Safety isn't for you (as compared to for example participating in an Apart Hackathon)

Comment by Jonathan Claybrough (lelapin) on Abstractions are not Natural · 2024-11-05T19:46:42.343Z · LW · GW

I don't actualy think your post was hostile, but I think I get where deepthoughtlife is coming from. At the least, I can share about how I felt reading this post and point out to why, since you seem keen on avoiding the negative side. Btw I don't think you avoid causing any frustration in readers, they are too diverse, so don't worry too much about it either.

The title of the piece is strongly worded and there's no epistimic status disclaimer to state this is exploratory, so I actually came in expecting much stronger arguments. Your post is good as an exposition of your thoughts and conversation started, but it's not a good counter argument to NAH imo, so shouldn't be worded as such. Like deepthoughtlife, I feel your post is confused re NAH, which is totally fine when stated as such, but a bit grating when I came in expecting more rigor or knowledge of NAH.

Here's a reaction to the first part :
- in "Systems must have similar observational apparatus" you argue that different apparatus lead to different abstractions and claim a blind deaf person is such an example, yet in practice blind deaf people can manipulate all the abstractions others can (with perhaps a different inner representation), that's what general intelligence is about. You can check out this wiki page and video for some of how it's done https://en.wikipedia.org/wiki/Tadoma . The point is that all the abstractions can be understood and must be understood by a general intelligence trying to act effectively, and in practice Helen Keler could learn to speak by using other senses than hearing, in the same way we learn all of physics despite limited native instruments.

I think I had similar reactions to other parts, feeling they were missing the point about NAH and some background assumptions.

Thanks for posting!

Comment by Jonathan Claybrough (lelapin) on dirk's Shortform · 2024-11-05T16:54:44.665Z · LW · GW

Putting this short rant here for no particularly good reason but I dislike that people claim constraints here or there in a way where I guess their intended meaning is only that "the derivative with respect to that input is higher than for the other inputs".

On factory floors there exist hard constraints, the throughput is limited by the slowest machine (when everything has to go through this). The AI Safety world is obviously not like that. Increase funding and more work gets done, increase talent and more work gets done. None are hard constraints.

If I'm right that people are really only claiming the weak version, then I'd like to see somewhat more backing to their claims, especially if you say "definitely". Since none are constraints, the derivatives could plausibly be really close to one another. In fact, they kind of have to be, because there are smart optimizers who are deciding where to spend their funding and trying to actively manage the proportion of money sent to field building (getting more talent) vs direct work.

Comment by Jonathan Claybrough (lelapin) on CstineSublime's Shortform · 2024-11-05T16:45:27.428Z · LW · GW

Interesting thoughts, ty.

A difficulty to common understanding I see here is that you're talking of "good" or "bad" paragraphs in the absolute, but didn't particularly define "good" or "bad" paragraph by some objective standard, so you're relying on your own understanding of what's good or bad. If you were defining good or bad relatively, you'd look for a 100 paragraphs, and post the worse 10 as bad. I'd be interested in seeing what were the worse paragraphs you found, some 50 percentile ones, and what were the best, then I'd tell you if I have the same absolute standards as you have.

Comment by Jonathan Claybrough (lelapin) on The Shallow Bench · 2024-11-05T16:35:06.859Z · LW · GW

Enjoyed this post.

Fyi, from the front page I just hovered this post "The shallow bench" and was immediately spoiled on Project Hail Mary (which I had started listening to, but didn't get far into). Maybe add some spoiler tag or warning directly after the title?

Comment by Jonathan Claybrough (lelapin) on JargonBot Beta Test · 2024-11-04T17:23:50.949Z · LW · GW

Without removing from the importance of getting the default right, and with some deliberate daring to feature creep, I think adding a customization feature (select colour) in personal profiles is relatively low effort and maintenance, so would solve the accessibility problem.

Comment by Jonathan Claybrough (lelapin) on Jonathan Claybrough's Shortform · 2024-10-08T10:39:57.179Z · LW · GW

There's tacit knowledge in bay rationalist conversation norms that I'm discovering and thinking about, here's an observation and related thought. (I put the example later after the generalisation because that's my preferred style, feel free to read the other way).

Willingness to argue righteously and hash out things to the end, repeated over many conversations, makes it more salient when you're going for a dead end argument. This salience can inspire you to do argue more concisely and to the point over time.
Going to the end of things generates ground data on which to update your models of arguing and conversation paths, instead of leaving things unanswered.
So, though it's skilful to know when not to "waste" time on details and unimportant disagreements, the norm of "frequently enough going through til everyone agrees on things" seems profoundly virtuous.

Short example from today, I say "good morning". They point out it's not morning (it's 12:02). I comment about how 2 minutes is not that much. They argue that 2 minutes is definitely more than zero and that's the important cut-off.
I realize that "2 minutes is not that much" was not my true rebuttal, that this next token my brain generated was mostly defensive reasoning rather than curious exploration of why they disagreed with my statement. Next time I could instead note they're using "morning" to have a different definition/central cluster than I, appreciate that they pointed this out, and decide if I want to explore this discrepancy or not.

Many things don't make sense if you're just doing them for local effect, but do when you consider long term gains. (something something naive consequentialism vs virtue ethics flavored stuff)

Comment by Jonathan Claybrough (lelapin) on Dialogue introduction to Singular Learning Theory · 2024-10-04T07:39:09.905Z · LW · GW

I don't strongly disagree but do weakly disagree on some points so I guess I'll answer

Re first- if you buy into automated alignment work by human level AGI, then trying to align ASI now seems less worth it. The strongest counterargument to this I see is that "human level AGI" is impossible to get with our current understanding, as it will be superhuman in some things and weirdly bad at others.

Re second- disagreements might be nitpicking on "few other approaches" vs "few currently pursued approaches". There are probably a bunch of things that would allow fundamental understanding if they panned out (various agent foundations agendas, probably safe ai agendas like davidad's), though one can argue they won't apply to deep learning or are less promising to explore than SLT

Comment by Jonathan Claybrough (lelapin) on We’re not as 3-Dimensional as We Think · 2024-08-04T22:42:53.655Z · LW · GW

I don't think your second footnote sufficiently addresses the large variance in 3D visualization abilities (note that I do say visualization, which includes seeing 2D video in your mind of a 3D object and manipulating that smoothly), and overall I'm not sure where you're getting at if you don't ground your post in specific predictions about what you expect people can and cannot do thanks to their ability to visualize 3D.

You might be ~conceptually right that our eyes see "2D" and add depth, but *um ackshually*, two eyes each receiving 2D data means you've received 4D input (using ML standards, you've got 4 input dimensions per time unit, 5 overall in your tensor). It's very redundant, and that redundancy mostly allows you to extract depth using a local algo, which allows you to create a 3D map in your mental representation. I don't get why you claim we don't have a 3D map at the end.

Back to concrete predictions, are there things you expect a strong human visualizer couldn't do? To give intuition I'd say a strong visualizer has at least the equivalent visualizing, modifying and measuring capabilities of solidworks/blender in their mind. You tell one to visualize a 3D object they know, and they can tell you anything about it.

It seems to me the most important thing you noticed is that in real life we don't that often see past the surface of things (because the spectrum of light we see doesn't penetrate most material) and thus most people don't know the inside of 3D things very well, but that can be explained by lack of exposure rather than inability to understand 3D.

Fwiw looking at the spheres I guessed an approx 2.5 volume ratio. I'm curious, if you visualized yourself picking up these two spheres, imagining them made of a dense metal, one after the other, could you feel one is 2.3 times heavier than the previous?

Comment by Jonathan Claybrough (lelapin) on This is already your second chance · 2024-07-30T04:07:15.415Z · LW · GW

I'll give fake internet points to whoever actually follows the instructions and posts photographic proof.

Comment by Jonathan Claybrough (lelapin) on Pivotal Acts are easier than Alignment? · 2024-07-23T12:21:17.543Z · LW · GW

The naming might be confusing because pivotal act sounds like a one time action, but in most cases getting to a stable world without any threat from AI requires constant pivotal processes. This makes almost all the destructive approaches moot (and they're probably already bad for ethical concerns and many others already discussed) because you'll make yourself a pariah.

The most promising venue for a pivotal act/pivotal process that I know of is doing good research so that ASI risks are known and proven, doing good outreach and education so most world leaders and decision makers are well aware of this, and helping setup good governance worldwide to monitor and limit the development of AGI and ASI until we can control it.

Comment by Jonathan Claybrough (lelapin) on Exercise: Planmaking, Surprise Anticipation, and "Baba is You" · 2024-07-21T13:39:59.325Z · LW · GW

I recently played Outer Wilds and Subnautica, and the exercise I recommend for both of these games is : Get to the end of the game without ever failing.
In subnautica that's dying once, in Outer Wilds it's a spoiler to describe what failing is (successfully getting to the end could certainly be argued to be a fail).
I failed in both of these. I played Outer Wilds first and was surprised at my fail, which inspired me to play Subnautica without dying. I got pretty far but also died from a mix of 1 unexpected game mechanic, uncareful measure of another mechanic, lack of redundancy in my contingency plans.

Comment by Jonathan Claybrough (lelapin) on Response to Dileep George: AGI safety warrants planning ahead · 2024-07-11T21:14:38.182Z · LW · GW

Oh wow, makes sense. It felt weird that you'd spend so much time on posts, yet if you didn't spend much time it would mean you write at least as fast as Scott Alexander. Well, thanks for putting in the work. I probably don't publish much because I want it to not be much work to do good posts but you're reassuring it's normal it does.

Comment by Jonathan Claybrough (lelapin) on Response to Dileep George: AGI safety warrants planning ahead · 2024-07-09T00:57:05.279Z · LW · GW

(aside : I generally like your posts' scope and clarity, mind saying how long it takes you to write something of this length?)

Comment by Jonathan Claybrough (lelapin) on Jonathan Claybrough's Shortform · 2024-07-03T17:18:00.921Z · LW · GW

Self modeling is a really important skill, and you can measure how good you are at it by writing predictions about yourself. (Modelling A notably important one for people who have difficulty with motivation is predicting your own motivation - will you be motivated to do X in situation Y?

If you can answer that one generally, you can plan to actually anything you could theoretically do, using the following algorithm : from current situation A, to achieve wanted outcome Z, find a predecessor situation Y from which you'll be motivated to get to Z (eg. have written 3 paragraphs of 4 of an essay), and a predecessor situational X from which you'll get to Y, iterate til you get to A (or forward chain, from A to Z). Check that indeed you'll be motivated each step of the way.

How can the above plan fail? Either you were mistaken about yourself, or about the world. Figure out which and iterate.

Comment by Jonathan Claybrough (lelapin) on The Minority Coalition · 2024-06-27T22:55:01.004Z · LW · GW

Appreciate the highlight of identity as this import/crucial self fulfilling prophecy, I use that frame a lot.

What does the title mean? Since they all disagree I don't see one as being more of a minority than the other.

Comment by Jonathan Claybrough (lelapin) on Talk: AI safety fieldbuilding at MATS · 2024-06-24T18:55:24.750Z · LW · GW

Nice talk!
When you talk about the most important interventions for the three scenarios, I wanna highlight that in the case of nationalization, you can also, if you're a citizen of one of these countries nationalizing AI, work for the government and be on those teams working and advocating for safe AI.

Comment by Jonathan Claybrough (lelapin) on Awakening · 2024-06-04T15:49:02.598Z · LW · GW

In my case I should have measurable results like higher salary, higher life satisfaction, more activity, more productivity as measured by myself and friends/flatmates. I was very low so it'll be easy to see progress. The difficulty was finding something that'd work, but it won't be measuring if it does.

Comment by Jonathan Claybrough (lelapin) on Jonathan Claybrough's Shortform · 2024-06-03T01:24:38.836Z · LW · GW

Some people have short ai timelines based inner models that don't communicate well. They might say "I think if company X trains according to new technique Y it should scale well and lead to AGI, and I expect them to use technique Y in the next few years", and the reasons for why they think technique Y should work are some kind of deep understanding built from years of reading ml papers, that's not particularly easy to transmit or debate.

In those cases, I want to avoid going into details and arguing directly, but would suggest that they use their deep knowledge of ML to predict existing recent results before looking at them. This would be easy to cheat, so I mostly suggest this for people to check themselves, or check people you trust to be honorable. Concretely, it'd be nice if when some new ml paper with a new technique comes out, someone compilés a list of questions answered by that paper (eg is technique A better than technique B for a particular result) and posts it to LW so people can track how well they understand ML, and thus (to some extent) short timelines.

For example a recent paper examinés how data affects performance on a bunch of benchmarks, and notably tested training either on an duplicated dataset (a bunch of common crawls), or deduplixated (the same except remove same documents that were shared between crawls). Do you expect deduplication in this case raises or lowers performance on benchmarks? If we could have similar questions when new results come out it's be nice.

Comment by Jonathan Claybrough (lelapin) on Awakening · 2024-06-01T18:41:17.406Z · LW · GW

Thank you for sharing, it really helps to pile on these stories (and nice to have some trust they're real, more difficult to get from reddit - on which note are there non doxing receipts you can show for this story being true? I have no reason to doubt you in particular but I guess it's good hygiene when on the internet to ask for evidence)

It also makes me wanna share a bit of my story. I read The Mind Illuminated, I did only small amounts of meditation, yet the framing the book offers has been changing my thinking and motivational systems. There aren't many things I'd call info hazards, but in my experience even just reading the book seems to be enough to contribute to profound changes, that would not be obviously be considered positive by the previous me. (They're not obviously negative either, I happen to be hopeful, but I'm waiting on results another year later to say)

Comment by Jonathan Claybrough (lelapin) on We might be dropping the ball on Autonomous Replication and Adaptation. · 2024-05-31T14:49:29.773Z · LW · GW

Might be good to have a dialogue format with other people who agree/disagree to flesh out scenarios and countermeasures

Comment by Jonathan Claybrough (lelapin) on MATS Summer 2023 Retrospective · 2024-03-01T16:13:27.332Z · LW · GW

Hi, I'm currently evaluating the cost effectiveness of various projects and would be interested in knowing, if you're willing to disclose, approximately how much this program costs MATS in total? By this I mean the summer cohort, includings ops before and after necessary for it to happen, but not counting the extension.

Comment by Jonathan Claybrough (lelapin) on The impossible problem of due process · 2024-01-16T12:19:49.682Z · LW · GW

"It's true that we don't want women to be driven off by a bunch of awkward men asking them out, but if we make everyone read a document that says 'Don't ask a woman out the first time you meet her', then we'll immediately give the impression that we have a problem with men awkwardly asking women out too much — which will put women off anyway."

This seems a weak response to me, at best only defensible considering yourself to be on the margin and without thought for longterm growth and your ability to clarify intentions (you have more than 3 words when interacting with people irl).

To be clear explicitly writing "don't ask women out the first time you meet her" would be terrible writing and if that's the best writing members of that group can do for guidelines, then maybe nothing is better than that. Still, it reeks of "we've tried for 30 seconds and are all out of ideas" energy.

A guidelines document can give high level guidance to the vibe you want (eg. truth seeking, not too much aggressiveness, giving space when people feel uncomfortable, communicating around norms explicitly), all in the positive way (eg. you say what you want, not what you don't want), and can refer to sub-documents to give examples and be quite concrete if you have socially impaired people around that need to learn this explicitly.

Comment by Jonathan Claybrough (lelapin) on Stop talking about p(doom) · 2024-01-02T18:57:15.100Z · LW · GW

Note Existential is a term of art different from Extinction.

The Precipice cites Bostrome and defines it such:
"An existential catastrophe is the destruction of humanity’s longterm potential.
An existential risk is a risk that threatens the destruction of humanity’s longterm potential."

Disempowerment is generally considered an existential risk in the literature.

Comment by Jonathan Claybrough (lelapin) on Funding case: AI Safety Camp 10 · 2023-12-12T11:02:42.882Z · LW · GW

I participated in the previous edition of AISC and found it very valuable to my involvement in AI Safety. I acquired knowledge (on standards and the standards process), got experience, contacts. I appreciate how much coordination AISC enables, with groups forming, which enable many to have their first hands on experience and step up their involvement.

Comment by Jonathan Claybrough (lelapin) on AISC 2024 - Project Summaries · 2023-11-30T14:12:16.523Z · LW · GW

Thanks, and thank you for this post in the first place!

Comment by Jonathan Claybrough (lelapin) on AISC 2024 - Project Summaries · 2023-11-28T15:18:05.844Z · LW · GW

Jonathan Claybrough

Actually no, I think the project lead here is jonachro@gmail.com which I guess sounds a bit like me, but isn't me ^^

Comment by Jonathan Claybrough (lelapin) on AI Safety is Dropping the Ball on Clown Attacks · 2023-10-22T09:34:28.485Z · LW · GW

Would be up for this project. As is, I downvoted Trevor's post for how rambly and repetitive it is. There's a nugget of idea, that AI can be used for psychological/information warfare that I was interested in learning about, but the post doesn't seem to have much substantive argument to it, so I'd be interested in someone both doing an incredibly shorter version which argued for its case with some sources.

Comment by Jonathan Claybrough (lelapin) on Padding the Corner · 2023-09-15T14:19:00.286Z · LW · GW

It's a nice pic and moment, I very much like this comic and the original scene. It might be exaggerating a trait (here by having the girl be particularly young) for comedic effect but the Hogfather seems right.
I think I was around 9 when I got my first sword, around 10 for a sharp knife. I have a scar in my left palm from stabbing myself with that sharp knife as a child while whittling wood for a bow. It hurt for a bit, and I learned to whittle away from me or do so more carefully. I'm pretty sure my life is better for it and (from having this nice story attached to it) I like the scar.

Comment by Jonathan Claybrough (lelapin) on Padding the Corner · 2023-09-13T09:16:39.345Z · LW · GW

This story still presents the endless conundrum between avoiding hurt and letting people learn and gain skills.
Assuming the world was mostly the same as nowadays, by the time your children are parenting, would they have the skills to notice sharp corners if they never experienced them ?

I think my intuitive approach here would be to put some not too soft padding (which is effectively close to what you did, it's still an unpleasant experience hitting against that even with the cloth).

What's missing is how to teach against existential risks. There's an extent to which actually bleeding profusely from a sharp corner can help learn walking carefully, anticipating dangers, and that these skills do generalize to many situations and allows one to live a long fruitful life. (This last sentence does not pertain to the actual age of your children and doesn't address ideal ages at which you can actually learn the correct and generalizable thing). If you have control on the future, remove all the sharp edges forever.
If you don't, you remove the hard edges when they're young, instore them again when they can/should learn to recognize what typically are hard edges and must be accounted for.

Comment by Jonathan Claybrough (lelapin) on Jonathan Claybrough's Shortform · 2023-07-26T09:06:22.955Z · LW · GW

Are people losing ability to use and communicate in previous ontologies after getting Insight from meditation ? (Or maybe they never had the understanding I'm expecting of them ?) Should I be worried myself, in my practice of meditation ?

Today I reread Kensho by @Valentine, which presents Looking, and the ensuing conversation in the comments between @Said Achmiz and @dsatan, where Said asks for concrete benefits we can observe and mostly fails to get them. I also noticed interesting comments by @Ruby who in contrast was still be able to communicate in the more typical LW ontology, but hadn't meditated to the point of Enlightenment. Is Enlightenment bad? Different ?

My impression is that people don't become drastically better (at epistemology, rationality, social interaction, actually achieving your goals and values robustly) very fast through meditating or getting Enlightened, though they may acquire useful skills that could help to get better. If that's the case, it's safe for me to continue practicing meditation, getting into Jhanas, Insight etc (I'm following The Mind Illuminated), as the failings of Valentine/dsatan to communicate their points could just be attributed to them not being able to before either.
But I remain wary that people spend so much time engaging and believing in the models and practices taught in meditation material that they actually change their minds for the worse in certain respects. It looks like meditation ontologies/practices are Out to Get You and I don't want to get Got.

Comment by Jonathan Claybrough (lelapin) on Change my mind: Veganism entails trade-offs, and health is one of the axes · 2023-06-04T23:19:32.836Z · LW · GW

I focused my answer on the morally charged side, not emotional. The quoted statement said A and B so as long as B is mostly true for vegans, A and B is mostly true for (a sub-group) of vegans.

I'd agree with the characterization "it’s deeply emotionally and morally charged for one side in a conversation, and often emotional to the other." because most people don't have small identities and do feel attacked by others behaving differently indeed.

Comment by Jonathan Claybrough (lelapin) on Change my mind: Veganism entails trade-offs, and health is one of the axes · 2023-06-02T18:44:12.140Z · LW · GW

It's standard that the morally charged side in a veganism conversation is from people who argue for veganism.
Your response reads as snarky, since you pretend to have understood the contrary. You're illustrating op's point, that certain vegans are emotionally attached to their cause and jump to the occasion to defend their tribe. If you disagree to being pictured a certain way, at least act so it isn't accurate to depict you that way.

Comment by Jonathan Claybrough (lelapin) on Change my mind: Veganism entails trade-offs, and health is one of the axes · 2023-06-02T18:20:31.548Z · LW · GW

Comment by Jonathan Claybrough (lelapin) on Agentized LLMs will change the alignment landscape · 2023-04-12T13:36:05.838Z · LW · GW

Did you know about "by default, GPTs think in plain sight"?
It doesn't explicitly talk about agentized GPTs but was discussing the impact this has on GPTs for AGI and how it affects the risks, and what we should do about it (eg. maybe rlhf is dangerous)

Comment by Jonathan Claybrough (lelapin) on Agentized LLMs will change the alignment landscape · 2023-04-12T12:21:20.114Z · LW · GW

To not be misinterpreted, I didn't say I'm sure it's more the format than the content that's causing the upvotes (open question), nor that this post doesn't meet the absolute quality bar that normally warrants 100+ upvote (to each reader their opinion).

If you're open to object level discussing this, I can point on concrete disagreement with the content. Most importantly, this should not be seen as a paradigm shift, because it does not invalidate any of the previous threat models - it would only be so if it rendered impossible to do AGI any other way. I also don't think this should "change the alignment landscape" because it's just another part of it, one which was known and has been worked on for years (Anthropic and OpenAI have been "aligning" LLMs and I'd bet 10:1 anticipated these would be used to do agents like most people I know in alignment).

To clarify, I do think it's really important and great people work on this, and that in order this will be the first x-risk stuff we see. But we could solve the GPT-agent problem and still die to unalignment AGI 3 months afterwards. The fact that the world trajectory we're on is throwing additional problems in the mix (keeping the world safe from short term misuse and unaligned GPT-agents) doesn't make the existing ones simpler. There still is pressure to built autonomous AGI, there might still be mesa optimizers, there might still be deception, etc. We need the manpower to work on all of these, and not "shift the alignment landscape" to just focus on the short term risks.

I'd recommend to not worry much about PR risk, just ask the direct question: Even if this post is only ever read by LW folk, does the "break all encryption" add to the conversation? Causing people to take time to debunk certain suggestions isn't productive even without the context of PR risk

Overall I'd like some feedback on my tone, if it's too direct/aggressive to you of it's fine. I can adapt.

Comment by Jonathan Claybrough (lelapin) on All AGI Safety questions welcome (especially basic ones) [April 2023] · 2023-04-12T11:31:57.025Z · LW · GW

You can read "reward is not the optimization target" for why a GPT system probably won't be goal oriented to become the best at predicting tokens, and thus wouldn't do the things you suggested (capturing humans). The way we train AI matters for what their behaviours look like, and text transformers trained on prediction loss seem to behave more like Simulators. This doesn't make them not dangerous, as they could be prompted to simulate misaligned agents (by misuses or accident), or have inner misaligned mesa-optimisers.

I've linked some good resources for directly answering your question, but otherwise to read more broadly on AI safety I can point you towards the AGI Safety Fundamentals course which you can read online, or join a reading group. Generally you can head over to AI Safety Support, check out their "lots of links" page and join the AI Alignment Slack, which has a channel for question too.

Finally, how does complexity emerge from simplicity? Hard to answer the details for AI, and you probably need to delve into those details to have real picture, but there's at least strong reason to think it's possible : we exist. Life originated from "simple" processes (at least in the sense of being mechanistic, non agentic), chemical reactions etc. It evolved to cells, multi cells, grew etc. Look into the history of life and evolution and you'll have one answer to how simplicity (optimize for reproductive fitness) led to self improvement and self awareness

Comment by Jonathan Claybrough (lelapin) on Agentized LLMs will change the alignment landscape · 2023-04-11T10:18:52.672Z · LW · GW

Quick meta comment to express I'm uncertain that posting things in lists of 10 is a good direction. The advantages might be real, easy to post, quick feedback, easy interaction, etc.

But the main disadvantage is that this comparatively drowns out other better posts (with more thought and value in them). I'm unsure if the content of the post was also importantly missing from the conversation (to many readers) and that's why this got upvoted so fast or if it's a lot the format... Even if this post isn't bad (and I'd argue it is for the suggestions it promotes), this is early warning of a possible trend that people with less thought out takes quickly post highly accessible content, get comparatively more upvotes than should, and it's harder to find good content.

(Additional disclosure, some of my bad taste for this post come from the fact its call to break all encryption is being cited on Twitter as representative of the alignment community - I'd have liked to answer that obviously no, but it got many upvotes! This makes my meta point also seem to be motivated by PR/optics which is why it felt necessary to disclose but let's mostly focus on consequences inside the community)

Comment by Jonathan Claybrough (lelapin) on All AGI Safety questions welcome (especially basic ones) [April 2023] · 2023-04-09T17:15:48.819Z · LW · GW

First a quick response on your dead man switch proposal : I'd generally say I support something in that direction. You can find existing literature considering the subject and expanding in different directions in the "multi level boxing" paper by Alexey Turchin https://philpapers.org/rec/TURCTT , I think you'll find it interesting considering your proposal and it might give a better idea of what the state of the art is on proposals (though we don't have any implementation afaik)

Back to "why are the predicted probabilities so extreme that for most objectives, the optimal resolution ends with humans dead or worse". I suggest considering a few simple objectives we could give ai (that it should maximise) and what happens, and over trials you see that it's pretty hard to specify anything which actually keeps humans alive in some good shape, and that even when we can sorta do that, it might not be robust or trainable.
For example, what happens if you ask an ASI to maximize a company's profit ? To maximize human smiles? To maximize law enforcement ? Most of these things don't actually require humans, so to maximize, you should use the atoms human are made of in order to fulfill your maximization goal.
What happens if you ask an ASI to maximize number of human lives ? (probably poor conditions). What happens if you ask it to maximize hedonistic pleasure ? (probably value lock in, plus a world which we don't actually endorse, and may contain astronomical suffering too, it's not like that was specified out was it?).

So it seems maximising agents with simple utility functions (over few variables) mostly end up with dead humans or worse. So it seems approaches which ask for much less, eg. doing an agi that just tries to secure the world from existential risk (a pivotal act) and solve some basic problems (like dying) then gives us time for a long reflection to actually decide what future we want, and be corrigible so it lets us do that, seems safer and more approachable.

Comment by Jonathan Claybrough (lelapin) on All AGI Safety questions welcome (especially basic ones) [April 2023] · 2023-04-09T10:51:08.169Z · LW · GW

I watched the video, and appreciate that he seems to know the literature quite well and has thought about this a fair bit - he did a really good introduction of some of the known problems.
This particular video doesn't go into much detail on his proposal, and I'd have to read his papers to delve further - this seems worthwhile so I'll add some to my reading list.

I can still point out the biggest ways in which I see him being overconfident :

Only considering the multi-agent world. Though he's right that there already are and will be many many deployed AI systems, that doe not translate to there being many deployed state of the art systems. As long as training costs and inference costs continue increasing (as they have), then on the contrary fewer and fewer actors will be able to afford state of the art system training and deployment, leading to very few (or one) significantly powerful AGI (as compared to the others, for example GPT4 vs GPT2)
Not considering the impact that governance and policies could have on this. This isn't just a tech thing where tech people can do whatever they want forever, regulation is coming. If we think we have higher chances of survival in highly regulated worlds, then the ai safety community will do a bunch of work to ensure fast and effective regulation (to the extent possible). The genie is not out of the bag for powerful AGI, governments can control compute and regulate powerful AI as weapons, and setup international agreements to ensure this.
The hope that game theory ensures that AI developed under his principles would be good for humans. There's a crucial gap between going from real world to math models. Game theory might predict good results under certain conditions and rules and assumptions, but many of these aren't true of the real world and simple game theory does not yield accurate world predictions (eg. make people play various social games and they won't act like how game theory says). Stated strongly, putting your hope on game theory is about as hard on putting your hope on alignment. There's nothing magical about game theory which makes it work simpler than alignment, and it's been studied extensively by ai researchers (eg. why Eliezer calls himself a decision theorist and writes a lot about economics) with no clear "we've found a theory which empirically works robustly and in which we can put the fate of humanity in"

I work in AI strategy and governance, and feel we have better chances of survival in a world where powerful AI is limited to extremely few actors, with international supervision and cooperation for the guidance and use of these systems, making extreme efforts in engineering safety, in corrigibility, etc. I am not trustworthy of predictions on how complex systems turn out (which is the case of real multi agent problems) and don't think we can control these well in most relevant cases.

Comment by Jonathan Claybrough (lelapin) on Maze-solving agents: Add a top-right vector, make the agent go to the top-right · 2023-04-03T08:22:30.883Z · LW · GW

Writing down predictions. The main caveat is that these predictions are predictions about how the author will resolve these questions, not my beliefs about how these techniques will work in the future. I am pretty confident at this stage that value editing can work very well in LLMs when we figure it out, but not so much that the first try will have panned out.

Algebraic value editing works (for at least one "X vector") in LMs: 90 %
Algebraic value editing works better for larger models, all else equal 75 %
If value edits work well, they are also composable 80 %
If value edits work at all, they are hard to make without substantially degrading capabilities 25 %
We will claim we found an X-vector which qualitatively modifies completions in a range of situations, for X =
1. "truth-telling" 10 %
2. "love" 70 %
3. "accepting death" 20%
4. "speaking French" 80%

Comment by Jonathan Claybrough (lelapin) on Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky · 2023-04-02T05:57:36.918Z · LW · GW

I don't think reasoning about others' beliefs and thoughts is helping you be correct about the world here. Can you instead try to engage with the arguments themselves and point out at what step you you don't see a concrete way for that to happen ?
You don't show much sign of having read the article so I'll copy paste the part with explanations of how AIs start acting in the physical space.

In this scenario, the AIs face a challenge: if it becomes obvious to everyone that they are trying to defeat humanity, humans could attack or shut down a few concentrated areas where most of the servers are, and hence drastically reduce AIs' numbers. So the AIs need a way of getting one or more "AI headquarters": property they control where they can safely operate servers and factories, do research, make plans and construct robots/drones/other military equipment.
Their goal is ultimately to have enough AIs, robots, etc. to be able to defeat the rest of humanity combined. This might mean constructing overwhelming amounts of military equipment, or thoroughly infiltrating computer systems worldwide to the point where they can disable or control most others' equipment, or researching and deploying extremely powerful weapons (e.g., bioweapons), or a combination.

Here are some ways they could get to that point:
They could recruit human allies through many different methods - manipulation, deception, blackmail and other threats, genuine promises along the lines of "We're probably going to end up in charge somehow, and we'll treat you better when we do."
Human allies could be given valuable intellectual property (developed by AIs), given instructions for making lots of money, and asked to rent their own servers and acquire their own property where an "AI headquarters" can be set up. Since the "AI headquarters" would officially be human property, it could be very hard for authorities to detect and respond to the danger.
Via threats, AIs might be able to get key humans to cooperate with them - such as political leaders, or the CEOs of companies running lots of AIs. This would open up further strategies.
As assumed above, particular companies are running huge numbers of AIs. The AIs being run by these companies might find security holes in the companies' servers (this isn't the topic of this piece, but my general impression is that security holes are widespread and that reasonably competent people can find many of them)¹⁵, and thereby might find opportunities to create durable "fakery" about what they're up to.
E.g., they might set things up so that as far as humans can tell, it looks like all of the AI systems are hard at work creating profit-making opportunities for the company, when in fact they're essentially using the server farm as their headquarters - and/or trying to establish a headquarters somewhere else (by recruiting human allies, sending money to outside bank accounts, using that money to acquire property and servers, etc.)
If AIs are in wide enough use, they might already be operating lots of drones and other military equipment, in which case it could be pretty straightforward to be able to defend some piece of territory - or to strike a deal with some government to enlist its help in doing so.
AIs could mix-and-match the above methods and others: for example, creating "fakery" long enough to recruit some key human allies, then attempting to threaten and control humans in key positions of power to the point where they control solid amounts of military resources, then using this to establish a "headquarters."

So is there anything here you don't think is possible ?
Getting human allies ? Being in control of large sums of compute while staying undercover ? Doing science, and getting human contractors/allies to produce the results ? etc

Comment by Jonathan Claybrough (lelapin) on The Wizard of Oz Problem: How incentives and narratives can skew our perception of AI developments · 2023-03-21T08:31:10.287Z · LW · GW

I think this post would benefit from being more explicit on its target. This problem concerns AGI labs and their employees on one hand, and anyone trying to build a solution to Alignment/AI Safety on the other.

By narrowing the scope to the labs, we can better evaluate the proposed solutions (for example to improve decision making we'll need to influence decision makers therein), make them more focused (to the point of being lab specific, analyzing each's pressures), and think of new solutions (inoculating ourselves/other decision makers on AI about believing stuff that come from those labs by adding a strong dose of healthy skepticism).

By narrowing the scope to people working on AI Safety who's status or monetary support relies on giving impressions of progress, we come up with different solutions (try to explicitly reward honesty, truthfulness, clarity over hype and story making). A general recommendation I'd have is to have some kind of reviews that check against "Wizard of Oz'ing" for flagging the behavior and suggesting corrections. Currently I'd say the diversity of LW and norms for truth seeking are doing quite well at this, so posting on here publicly is a great way to control this. It highlights the importance of this place and of upkeeping these norms.

User info

Posts

Comments