Posts

Non-loss of control AGI-related catastrophes are out of control too 2023-06-12T12:01:26.682Z
Is there a way to sort LW search results by date posted? 2023-03-12T04:56:06.786Z
A newcomer’s guide to the technical AI safety field 2022-11-04T14:29:46.873Z
Embedding safety in ML development 2022-10-31T12:27:13.055Z
aisafety.community - A living document of AI safety communities 2022-10-28T17:50:12.535Z
My Thoughts on the ML Safety Course 2022-09-27T13:15:03.000Z
Summary of ML Safety Course 2022-09-27T13:05:39.828Z
Levels of goals and alignment 2022-09-16T16:44:50.486Z
What if we approach AI safety like a technical engineering safety problem 2022-08-20T10:29:38.691Z
I missed the crux of the alignment problem the whole time 2022-08-13T10:11:24.826Z

Comments

Comment by zeshen on Using axis lines for good or evil · 2024-03-19T08:56:38.893Z · LW · GW

My first impression was also that axis lines are a matter of aesthetics. But then I browsed The Economist's visual styleguide and realized they also do something similar, i.e. omit the y-axis line (in fact, they omit the y-axis line on basically all their line / scatter plots, but almost always maintain the gridlines). 

Here's also an article they ran about their errors in data visualization, albeit probably fairly introductory for the median LW reader.

Comment by zeshen on Good taxonomies of all risks (small or large) from AI? · 2024-03-07T12:53:48.187Z · LW · GW

I'm pretty sure you have come across this already, but just in case you haven't:

https://incidentdatabase.ai/taxonomy/gmf/

Comment by zeshen on Funding case: AI Safety Camp · 2023-12-23T10:37:36.358Z · LW · GW

Strong upvoted. I was a participant of AISC8 in the team that went on to launch AI Standards Lab, which I think counterfactually would not be launched if not for AISC.

Comment by zeshen on How should we think about the decision relevance of models estimating p(doom)? · 2023-05-11T07:02:42.205Z · LW · GW

Why is this question getting downvoted?

Comment by zeshen on Support me in a Week-Long Picketing Campaign Near OpenAI's HQ: Seeking Support and Ideas from the LessWrong Community · 2023-05-01T16:16:15.148Z · LW · GW

This seems to be another one of those instances where I wish there was a dual-voting system to posts. I would've liked to strong disagree with the contents of the post without discouraging well-intentioned people from posting here. 

Comment by zeshen on [SEE NEW EDITS] No, *You* Need to Write Clearer · 2023-04-30T17:00:11.184Z · LW · GW

I feel like a substantial amount of disagreement between alignment researchers are not object-level but semantic disagreements, and I remember seeing instances where person X writes a post about how he/she disagrees with a point that person Y made, with person Y responding about how that wasn't even the point at all. In many cases, it appears that simply saying what you don't mean could have solved a lot of the unnecessary misunderstandings.

Comment by zeshen on Catching the Eye of Sauron · 2023-04-27T05:25:59.454Z · LW · GW

I'm curious if there are specific parts to the usual arguments that you find logically inconsistent.

Comment by zeshen on LLM Basics: Embedding Spaces - Transformer Token Vectors Are Not Points in Space · 2023-04-21T07:23:12.120Z · LW · GW

I Googled up 'how are tokens embedded' and this post came up third in the results - thanks for the post!

Comment by zeshen on "Carefully Bootstrapped Alignment" is organizationally hard · 2023-04-08T16:25:16.152Z · LW · GW

If this interests you, there is a proposal in the Guideline for Designing Trustworthy Artificial Intelligence by Fraunhofer IAIS which includes the following:

[AC-R-TD-ME-06] Shutdown scenarios
Requirement: Do
Scenarios should be identified, analyzed and evaluated in which the live AI application must be completely or partially shut down in order to maintain the ability of users and affected persons to perceive situations and take action. This includes shutdowns due to potential bodily injury or damage to property and also due to the violation of personal rights or the autonomy of users and affected persons. Thus, depending on the application context, this point involves analyzing scenarios that go beyond the accidents/safety incidents discussed in the Dimension: Safety and Security (S). For example, if it is possible that the AI application causes discrimination that cannot be resolved immediately, this scenario should be considered here. When evaluating the scenarios, the consequences of the shutdown for the humans involved, work processes, organization and company, as well as additional time and costs, should also be documented. This is compared with the potential damage that could arise if the AI application were not shut down. Documentation should be available on the AI application shutdown strategies that were developed based on the identified scenarios – both short-term, mid-term and permanent shutdown. Similarly, scenarios for shutting down subfunctions of the AI application should also be documented. Reference can be made to shutdown scenarios that may have already been covered in the Risk area: functional safety (FS) (see [S-RFS-ME-10]). A shutdown scenario documents
– the setting and the resulting decision-making rationale for the shutdown,
– the priority of the shutdown,
– by which persons or roles the shutdown is implemented and how it is done,
– how the resulting outage can be compensated,
– the expected impact for individuals or for the affected organization.

[AC-R-TD-ME-07] Technical provision of shutdown options
Requirement: Do
Documentation should be available on the technical options for shutting down specific subfunctions of the AI application as well as the entire AI application. Here, reference can be made to [S-R-FS-ME-10] or [S-RFS-ME-12] if necessary. It is outlined that other system components or business processes that use (sub)functionality that can be shutdown have been checked and (technical) measures that compensate for negative effects of shutdowns are prepared. If already covered there, reference can be made to [S-R-FS-ME-10].

Comment by zeshen on "Carefully Bootstrapped Alignment" is organizationally hard · 2023-04-04T05:04:08.364Z · LW · GW

Everyone in any position of power (which includes engineers who are doing a lot of intellectual heavy-lifting, who could take insights with them to another company), thinks of it as one of their primary jobs to be ready to stop

In some industries, Stop Work Authorities are implemented, where any employee at any level in the organisation has the power to stop a work deemed unsafe at any time. I wonder if something similar in spirit would be feasible to be implemented in top AI labs. 

Comment by zeshen on The hot mess theory of AI misalignment: More intelligent agents behave less coherently · 2023-03-12T06:58:34.405Z · LW · GW

Without thinking about it too much, this fits my intuitive sense. An amoeba can't possibly demonstrate a high level of incoherence because it simply can't do a lot of things, and whatever it does would have to be very much in line with its goal (?) of survival and reproduction. 

Comment by zeshen on Rationality-related things I don't know as of 2023 · 2023-02-12T08:31:50.072Z · LW · GW

Thanks for this post. I've always had the impression that everyone around LW have been familiar with these concepts since they were kids and now know them by heart, while I've been struggling with some of these concepts for the longest time. It's comforting to me that there are long time LWers who don't necessarily fully understand all of these stuff either.

Comment by zeshen on You Don't Exist, Duncan · 2023-02-08T06:05:23.037Z · LW · GW

Browsing through the comments section it seems that everyone relates to this pretty well. I do, too. But I'm wondering if this applies mostly to a LW subculture, or is it a Barnum/Forer effect where every neurotypical person would also relate to?

Comment by zeshen on A newcomer’s guide to the technical AI safety field · 2023-02-06T13:42:20.040Z · LW · GW

With regards the Seed AI paradigm, most of the publications seem to have come from MIRI (especially the earlier ones when they were called the Singularity Institute) with many discussions happening both here on LessWrong as well as events like the Singularity Summit. I'd say most of the thinking around this paradigm happened before the era of deep learning. Nate Soares' post might provide more context.

You're right that brain-like AI has not had much traction yet, but it seems to me that there is a growing interest in this research area lately (albeit much slower than the Prosaic AI paradigm), and I don't think they fall squarely under either of the Seed AI paradigm nor the Prosaic AI paradigm. Of course there may be considerable overlap between those 'paradigms', but I felt that they were sufficiently distinct to warrant a category of its own even though I may not think of it as a critical concept in AI literature. 

Comment by zeshen on Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk") · 2023-01-31T11:38:22.239Z · LW · GW

AI is highly non-analogous with guns.

Yes, especially for consequentialist AIs that don't behave like tool AIs. 

Comment by zeshen on Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk") · 2023-01-31T05:12:07.693Z · LW · GW

I feel like I broadly agree with most of the points you make, but I also feel like accident vs misuse are still useful concepts to have. 

For example, disasters caused by guns could be seen as:

  • Accidents, e.g. killing people by mistaking real guns for prop guns, which may be mitigated with better safety protocols
  • Misuse, e.g. school shootings, which may be mitigated with better legislations and better security etc.
  • Other structural causes (?), e.g. guns used in wars, which may be mitigated with better international relations

Nevertheless, all of the above are complex and structural in different ways where it is often counterproductive or plain misleading to assign blame (or credit) to the causal node directly upstream of it (in this case, guns). 

While I agree that the majority of AI risks are neither caused by accidents nor misuse, and that they shouldn't be seen as a dichotomy, I do feel that the distinction may still be useful in some contexts i.e. what the mitigation approaches could look like.

Comment by zeshen on Recursive Middle Manager Hell · 2023-01-26T07:29:56.162Z · LW · GW

Upvoted. Though as someone who has been in the corporate world for close to a decade, this is probably one of the rare LW posts that I didn't learn anything new from. And because every point is so absolutely true and extremely common in my experience, when reading the post I was just wondering the whole time how this is even news.

Comment by zeshen on Models Don't "Get Reward" · 2023-01-18T06:39:44.407Z · LW · GW

There are probably enough comments here already, but thanks again for the post, and thanks to the mods for curating it (I would've missed it otherwise).

Comment by zeshen on Be less scared of overconfidence · 2022-12-01T02:24:35.838Z · LW · GW

This is a nice post that echoes many points in Eliezer's book Inadequate Equilibria. In short, it is entirely possible that you outperform 'experts' or 'the market', if there are reasons to believe that these systems converge to a sub-optimal equilibrium, and even more so when you have more information that the 'experts', like in your Wave vs Theorem example. 

Comment by zeshen on Don't design agents which exploit adversarial inputs · 2022-11-26T10:58:37.583Z · LW · GW

Thanks for the explanation!

Comment by zeshen on Don't design agents which exploit adversarial inputs · 2022-11-24T08:55:17.786Z · LW · GW

In every scenario, if you have a superintelligent actor which is optimizing the grader's evaluations while searching over a large real-world plan space, the grader gets exploited.

Similar to the evaluator-child who's trying to win his mom's approval by being close to the gym teacher, how would grader exploitation be different from specification gaming / reward hacking? In theory, wouldn't a perfect grader solve the problem? 

Comment by zeshen on A newcomer’s guide to the technical AI safety field · 2022-11-21T06:23:39.570Z · LW · GW

In case anyone comes across this post trying to understand the field, Scott Aaronson did a better job at me at describing the "seed AI" and "prosaic AI" paradigms here, which he calls "Orthodox" vs "Reform". 

Comment by zeshen on Don't design agents which exploit adversarial inputs · 2022-11-21T04:10:31.715Z · LW · GW

I'm probably missing something, but doesn't this just boil down to "misspecified goals lead to reward hacking"?

Comment by zeshen on Reflective Consequentialism · 2022-11-19T10:27:54.835Z · LW · GW

This post makes sense to me though it feels almost trivial. I'm puzzled by the backlash against consequentialism, it just feels like people are overreacting. Or maybe the 'backlash' isn't actually as strong as I'm reading it to be.

I'd think of virtue ethics as some sort of equilibrium that society has landed ourselves in after all these years of being a species capable of thinking about ethics. It's not the best but you'd need more than naive utilitarianism to beat it (this EA forum post feels like commonsense to me too), which you describe as reflective consequentialism. It seems like it all boils down to: be a consequentialist, as long as you 1) account for second-order and higher effects, and 2) account for bad calculation due to corrupted hardware. 

Comment by zeshen on 2-D Robustness · 2022-11-19T07:36:30.641Z · LW · GW

Thanks - this helps.

Comment by zeshen on 2-D Robustness · 2022-11-17T07:08:27.703Z · LW · GW

Thanks for the reply! 

But I think you can come up with clean examples of capabilities failures if you look at, say, robots that use search to plan; they often do poorly according to the manually specified reward function on new domains because optimizing the reward is too hard for its search algorithm. 

I'd be interested to see actual examples of this, if there are any. But also, how would this not be an objective robustness failure if we frame the objective as "maximize reward"? 

if you perform Inverse Optimal Control on the behavior of the robot and derive a revealed reward function, you'll find that its 

Do you mean to say that its reward function will be indistinguishable from its policy?

there doesn't seem to be a super principled way of dividing up capabilities and preferences in the first place

Interesting paper, thanks! If a policy cannot be decomposed into a planning algorithm and a reward function anyway, it's unclear to me why 2D-robustness would be a better framing of robustness than just 1D-robustness.

Comment by zeshen on 2-D Robustness · 2022-11-16T03:57:48.528Z · LW · GW

Thanks for the example, but why this is a capabilities robustness problem and not an objective robustness problem, if we think of the objective as 'classify pandas accurately'?

Comment by zeshen on Thoughts on AGI safety from the top · 2022-11-08T18:07:14.237Z · LW · GW

I don't know how I even got here after so long but I really like this post. Looking forward to next year's post.

Comment by zeshen on Has anyone increased their AGI timelines? · 2022-11-08T11:28:05.155Z · LW · GW

I'd love to see a post with your reasonings.

Comment by zeshen on 4 Key Assumptions in AI Safety · 2022-11-08T10:31:09.008Z · LW · GW

I think these are fair assumptions for the alignment field in general. There are, however, work done outside this community that have different assumptions but also call themselves AI safety, e.g. this one

(I've written more about these assumptions here).

Comment by zeshen on Instead of technical research, more people should focus on buying time · 2022-11-07T23:40:52.863Z · LW · GW

Buying time could also mean implementing imperfect solutions that don't work against strong AGIs but might help to not get us destroyed by the first AGI that might be relatively weak.

(I wrote about it recently)

Comment by zeshen on Discussion: Objective Robustness and Inner Alignment Terminology · 2022-11-07T11:20:02.488Z · LW · GW

For example, although our results show CoinRun models failed to learn the general capability of pursuing the coin, the more natural interpretation is that the model has learned a robust ability to avoid obstacles and navigate the levels,[7] but the objective it learned is something like “get to the end of the level,” instead of “go to the coin.”

It seems to me that every robustness failure can be interpreted as an objective robustness failure (as aptly titled in your other post). Do you have examples of a capability robustness failure that is not an objective robustness failure?

Comment by zeshen on 2-D Robustness · 2022-11-07T11:08:36.468Z · LW · GW

Are there any examples of capability robustness failures that aren't objective robustness failures? 

Comment by zeshen on [AN #112]: Engineering a Safer World · 2022-11-02T11:08:06.977Z · LW · GW

I got the book (thanks to Conjecture) after doing the Intro to ML Safety Course where the book was recommended. I then browsed through the book and thought of writing a review of it - and I found this post instead, which is a much better review than I would have written, so thanks a lot for this! 

Let me just put down a few thoughts that might be relevant for someone else considering picking up this book.

Target audience: Right at the beginning of the book, the author says "This book is written for the sophisticated practitioner rather than the academic researcher or the general public." I think this is relevant, as the book goes to a level of detail way beyond what's needed to get a good overview of engineering safety.

Relevance to AI safety: I feel like most engineering safety concepts are not applicable to alignment, firstly because an AGI would likely not have any human involvement in its optimization process, and secondly the basic underlying STAMP constructs of safety constraints, hierarchical safety control structures, and process models are simply more applicable to engineering systems. As stated in p100, "“STAMP focuses particular attention on the role of constraints in safety management.“ and I highly doubt an AGI can be bounded by constraints. " Nevertheless, Chapter 8 STPA: A New Hazard Analysis Technique that describes STPA (System Theoretic Process Analysis) may be somewhat relevant to designing safety interlocks. Also, the final chapter (13) on Managing Safety and the Safety Culture, is broadly applicable to any field that involves safety. 

Criticisms on conventional techniques: The book often mentions that techniques like STAMP and STPA is superior than other conventional techniques like HAZOP and gives quotes by reviewers that attest to their superiority. I don't know if those criticisms are really fair, given how it is not really adopted at least in the oil and gas industry that, for all its flaws, takes safety very seriously. Perhaps the criticisms could be fair for very outdated safety practices. Nevertheless, the general concepts of engineering safety feels quite similar whether it uses 'conventional' techniques or the 'new' techniques described in the book. 

Overall, I think this book provides a good overview of engineering safety concepts, but for the general audience (or alignment researchers) it goes into too much detail on specific case studies and arguments. 

Comment by zeshen on Epistemic modesty and how I think about AI risk · 2022-10-27T12:47:15.803Z · LW · GW

This is actually what my PhD research is largely about: Are these risks actually likely to materialize? Can we quantify how likely, at least in some loose way? Can we quantify our uncertainty about those likelihoods in some useful way? And how do we make the best decisions we can if we are so uncertain about things?

I'd be really interested in your findings.

Comment by zeshen on aisafety.community - A living document of AI safety communities · 2022-10-24T09:48:58.417Z · LW · GW

If there's an existing database of university groups already, it would be great to include a link to that database, perhaps under "Local EA Group". Thanks!

Comment by zeshen on 'Utility Indifference' (2010) by FHI researcher Stuart Armstrong · 2022-10-18T14:53:46.513Z · LW · GW

The link in the post no longer works. Here's one that works.

Comment by zeshen on Charitable Reads of Anti-AGI-X-Risk Arguments, Part 1 · 2022-10-07T14:14:10.643Z · LW · GW

I thought this is a reasonable view and I'm puzzled with the downvotes. But I'm also confused by the conclusion - are you arguing on whether the x-risk from AGI is something predictable or not? Or is the post just meant to convey examples on the merits to both arguments?

Comment by zeshen on Alignment Org Cheat Sheet · 2022-10-05T14:09:25.059Z · LW · GW

(see Zac's comment for some details & citations)

Just letting you know the link doesn't work although the comment was relatively easy to find. 

Comment by zeshen on My Thoughts on the ML Safety Course · 2022-10-03T11:26:15.381Z · LW · GW

Thanks for the comment!

You can read more about how these technical problems relate to AGI failure modes and how they rank on importance, tractability, and crowdedness in Pragmatic AI Safety 5. I think the creators included this content in a separate forum post for a reason.

I felt some of the content in the PAIS series would've been great for the course, though the creators probably had a reason to exclude them and I'm not sure why. 

The second group doesn't necessarily care about why each research direction relates to reducing X-risk.

In this case I feel it could be better for the chapter on x-risk to be removed entirely. Might be better to not include it at all than to include it and mostly show quotes by famous people without properly engaging in the arguments.

Comment by zeshen on Fun with +12 OOMs of Compute · 2022-09-23T15:31:32.518Z · LW · GW

Ah that's clear, thanks! I must've overlooked the "In 2016" right at the top of the post. 

Comment by zeshen on Fun with +12 OOMs of Compute · 2022-09-23T11:25:19.173Z · LW · GW

Very minor thing but I was confused for a while when you say end of 2020, I thought of it as the year instead of the decade (2020s). 

Comment by zeshen on Levels of goals and alignment · 2022-09-22T17:08:52.883Z · LW · GW

Your position makes sense. Part of it was just paraphrasing (what seems to me as) the 'consensus view' that preventing AIs from wiping us out is much more urgent / important than preventing AIs from keeping us alive in a far-from-ideal state. 

Comment by zeshen on Levelling Up in AI Safety Research Engineering · 2022-09-02T10:16:12.630Z · LW · GW

This is a great guide - thank you. However, in my experience as someone completely new to the field, 100-200 hours on each level is very optimistic. I've easily spent double/triple the duration on the first two levels and not get to a comfortable level. 

Comment by zeshen on Complex Systems for AI Safety [Pragmatic AI Safety #3] · 2022-08-29T17:45:34.133Z · LW · GW

For those who prefer not to spend 3 hours (or 1.5 hours on 2x speed) watching the video, the lecture notes are here. They seem fairly self-explanatory.

Comment by zeshen on What if we approach AI safety like a technical engineering safety problem · 2022-08-27T14:35:22.815Z · LW · GW

This is great, thanks!

Comment by zeshen on Toni Kurz and the Insanity of Climbing Mountains · 2022-08-23T20:18:42.637Z · LW · GW

"Because it's there" - George Mallory in 1923, when asked why he wanted to climb Everest. He died in his summit attempt the following year. 

Comment by zeshen on Announcing the Distillation for Alignment Practicum (DAP) · 2022-08-18T22:44:38.950Z · LW · GW
Comment by zeshen on Utility ≠ Reward · 2022-08-18T21:55:47.192Z · LW · GW

A part of me is worried that the terminology invites viewing mesa-optimisers as a description of a very specific failure mode, instead of as a language for the general worry described above.

I have been very confused about the term for a very long time, and have always thought mesa-optimisers is a very specific failure mode.

This post helped me clear things up.

Comment by zeshen on Reward is not the optimization target · 2022-08-15T21:59:53.703Z · LW · GW

Are you just noting that the model won't necessarily find the global maxima, and only reach some local maxima?

That was my takeaway as well, but I'm also somewhat confused.