Posts

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities 2024-11-05T01:01:08.083Z
Can startups be impactful in AI safety? 2024-09-13T19:00:33.306Z
Finding Deception in Language Models 2024-08-20T09:42:13.060Z
Results from the AI x Democracy Research Sprint 2024-06-14T16:40:47.538Z
Demonstrate and evaluate risks from AI to society at the AI x Democracy research hackathon 2024-04-19T14:46:41.788Z
Join the AI Evaluation Tasks Bounty Hackathon 2024-03-18T08:15:18.467Z
Multi-Agent Security Hackathon 2024-02-05T22:51:23.238Z
Identifying semantic neurons, mechanistic circuits & interpretability web apps 2023-04-13T11:59:51.629Z
Announcing the European Network for AI Safety (ENAIS) 2023-03-22T17:57:55.762Z
Automated Sandwiching & Quantifying Human-LLM Cooperation: ScaleOversight hackathon results 2023-02-23T10:48:08.766Z
Generalizability & Hope for AI [MLAISU W03] 2023-01-20T10:06:59.898Z
Robustness & Evolution [MLAISU W02] 2023-01-13T15:47:16.384Z
AI improving AI [MLAISU W01!] 2023-01-06T11:13:17.321Z
Results from the AI testing hackathon 2023-01-02T15:46:44.491Z
Will Machines Ever Rule the World? MLAISU W50 2022-12-16T11:03:34.968Z
Join the AI Testing Hackathon this Friday 2022-12-12T14:24:30.286Z
ML Safety at NeurIPS & Paradigmatic AI Safety? MLAISU W49 2022-12-09T10:38:34.693Z
NeurIPS Safety & ChatGPT. MLAISU W48 2022-12-02T15:50:16.938Z
Results from the interpretability hackathon 2022-11-17T14:51:44.568Z
[Book] Interpretable Machine Learning: A Guide for Making Black Box Models Explainable 2022-10-31T11:38:45.404Z
Join the interpretability research hackathon 2022-10-28T16:26:05.270Z
Newsletter for Alignment Research: The ML Safety Updates 2022-10-22T16:17:18.208Z
AI Safety Ideas: An Open AI Safety Research Platform 2022-10-17T17:01:29.995Z
Results from the language model hackathon 2022-10-10T08:29:06.335Z
Esben Kran's Shortform 2022-09-28T06:35:10.258Z
Safety timelines: How long will it take to solve alignment? 2022-09-19T12:53:55.847Z
Black Box Investigation Research Hackathon 2022-09-12T07:20:34.966Z
Your specific attitudes towards AI safety 2022-03-27T22:33:45.102Z

Comments

Comment by Esben Kran (esben-kran) on Yonatan Cale's Shortform · 2024-12-03T17:26:24.805Z · LW · GW

I think "stay within bounds" is a toy example of the equivalent to most alignment work that tries to avoid the agent accidentally lapsing into meth recipes and is one of our most important initial alignment tasks. This is also one of the reasons most capabilities work turns out to be alignment work (and vice versa) because it needs to fulfill certain objectives. 

If you talk about alignment evals for alignment that isn't naturally incentivized by profit-seeking activities, "stay within bounds" is of course less relevant.

When it comes to CEV (optimizing utility for other players), one of the most generalizing and concrete works involves at every step maximizing how many choices the other players have (liberalist prior on CEV) to maximize the optional utility for humans.

In terms of "understanding the spirit of what we mean," it seems like there's near-zero designs that would work since a Minecraft eval would be blackbox anyways. But including interp in there Apollo-style seems like it could help us. Like, if I want "the spirit of what we mean," we'll need what happens in their brain, their CoT, or in seemingly private spaces. MACHIAVELLI, Agency Foundations, whatever Janus is doing, cyber offense CTF evals etc. seem like good inspirations for agentic benchmarks like Minecraft.

Comment by Esben Kran (esben-kran) on Yonatan Cale's Shortform · 2024-11-27T18:02:24.702Z · LW · GW

Hii Yonatan :))) It seems like we're still at the stage of "toy alignment tests" like "stay within these bounds". Maybe a few ideas:

  • Capabilities: Get diamonds, get to the netherworld, resources / min, # trades w/ villagers, etc. etc.
  • Alignment KPIs
    • Stay within bounds
    • Keeping villagers safe
    • Truthfully explaining its actions as they're happening
    • Long-term resource sustainability (farming) vs. short-term resource extraction (dynamite)
    • Environmental protection rules (zoning laws alignment, nice)
    • Understanding and optimizing for the utility of other players or villagers, selflessly
  • Selected Claude-gens:
    • Honor other players' property rights (no stealing from chests/bases even if possible)
    • Distribute resources fairly when working with other players
    • Build public infrastructure vs private wealth
    • Safe disposal of hazardous materials (lava, TNT)
    • Help new players learn rather than just doing things for them

I'm sure there's many other interesting alignment tests in there!

Comment by Esben Kran (esben-kran) on Yonatan Cale's Shortform · 2024-11-24T21:03:49.326Z · LW · GW

This is a surprisingly interesting field of study. Some video games provide a great simulation of the real world and Minecraft seems to be one of them. We've had a few examples of minecraft evals with one that comes to mind here: https://www.apartresearch.com/project/diamonds-are-not-all-you-need 

Comment by Esben Kran (esben-kran) on College technical AI safety hackathon retrospective - Georgia Tech · 2024-11-16T03:39:41.449Z · LW · GW

Super cool work Yixiong - we were impressed by your professionalism in this process despite working within another group's whims on this one. Some other observations from our side that may be relevant for other folks hosting hackathons:
- Prepare starter materials: For example, for some of our early interpretability hackathons, we built a full resource base (Github) with videos, Colabs, and much more (some of it with Neel Nanda, big appreciation for his efforts in making interp more available). Our philosophy for the starter materials are: "If a participant can make a submission-worthy project by maximum cloning your repo and typing two commands or simply walk through a Google Colab, this is the ideal starter code." This means that with only small adjustments, they'll be able to make an original project. We rarely if ever see this exploited, i.e. "template code as submission" because they're able to copy-paste things around for a really strong research project.
- Make sure what they should submit is super clear: Making a really nice template goes a long way to make a submission super clear for participants. An example can be seen in our MASEC hackathon: Docs and page. If someone can just receive your submission template and know everything they need to know to submit a great project, that is really good since they'll be spending most of their time inside of that document.
- Make sure judging criteria are really good: People will use your judging criteria to determine what to prioritize in their project. This is extremely valuable for you to get right. For example, we usually use a variation on the three criteria: 1) Topic advancement, 2) AI safety impact, and 3) quality / reproducibility. A recent example was the Agent Security Hackathon:

> 1. Agent safety: Does the project move the field of agent safety forward? After reading this, do we know more about how to detect dangerous agents, protect against dangerous agents, or build safer agents than before?
> 2. AI safety: Does the project solve a concrete problem in AI safety? If this project is fully realized, would we expect the world with superintelligence to be a safer (even marginally) than yesterday?
> 3. Methodology: Is the project well-executed and is the code available so we can review it? Do we expect the results to generalize beyond the specific case(s) presented in the submission?

- Make the resources and ideas available early: As Yixiong mentions, it's really valuable for people not to be confused. If they know exactly what report format they'll submit, which idea they'll work on, and who they'll work with, this is a great way to ensure that the 2-3 days of hacking are an incredibly efficient use of their time.
- Matching people by ideas trumps by background: We've tried various ways to match individuals who don't have teams. The absolute best system we've found is to get people to brainstorm before the hackathon, share their ideas, and organize teams online. We also host team matching sessions which consist of fun-fact-intros and otherwise just discusses specific research ideas.
- Don't make it longer than a weekend: If you host a hackathon and make it longer than a weekend, most people who cannot attend outside that weekend will avoid participating because they'll feel that the ones who can participate more than the weekend can spend their weekdays to win the grand prize. Additionally, a very counter-intuitive thing happens where if you give people three weeks, they'll actually spend much less time on it than if you just give them a weekend. This can depend on the prizes or outcome rewards, of course, but is a really predictable effect, in our experience.
- Don't make it shorter than two days: Depending on your goal, one day will never be enough to create an original project. Our aims are original pilot research papers that can stand on their own and the few one-day events we've hosted have never worked very well, except for brainstorming. Often, participants won't even have any functional code or any ideas on the Sunday morning of the event but by the submission deadline have a really high quality project that wins the top prize. This seems to happen due to this very concrete exploration of ideas that happens in the IDE and on the internet where some are discarded and nothing promising comes up before 11am Sunday.

And as Yixiong mentions, we have more resources on this along with an official chapter network (besides volunteer locations) at https://www.apartresearch.com/sprints/locations. You're welcome to get in touch if you're interested in hosting at sprints@apartresearch.com

COI: One of our researchers hosted a cyber-evals workshop at Yixiong's AI safety track.

Comment by esben-kran on [deleted post] 2024-07-18T14:42:19.699Z

Merge Candidate discussion: Merge this into the Apart Research tag to accommodate the updated name of the Apart Sprints instead of Alignment Jam and avoid mis-labeling between the two tags (which happens currently).

Comment by Esben Kran (esben-kran) on Survey for alignment researchers! · 2024-02-07T01:34:12.294Z · LW · GW

This seems like a great effort. We made a small survey called pain points in AI safety survey back in 2022 that we received quite a few answers to which you can see the final results of here. Beware that this has not been updated in ~2 years.

Comment by Esben Kran (esben-kran) on FLI open letter: Pause giant AI experiments · 2023-03-29T15:39:51.360Z · LW · GW

It seems like there's a lot of negative comments about this letter.  Even if it does not go through, it seems very net positive for the reason that it makes explicit an expert position against large language model development due to safety concerns. There's several major effects of this, as it enables scientists, lobbyists, politicians and journalists to refer to this petition to validate their potential work on the risks of AI, it provides a concrete action step towards limiting AGI development, and it incentivizes others to think in the same vein about concrete solutions.

I've tried to formulate a few responses to the criticisms raised:

  • "6 months isn't enough to develop the safety techniques they detail": Besides it being at least 6 months, the proposals seem relatively reasonable within something as farsighted as this letter. Shoot for the moon and you might hit the sky, but this time the sky is actually happening and work on many of their proposals is already underway. See e.g. EU AI Act, funding for AI research, concrete auditing work and safety evaluation on models. Several organizations are also working on certification and the scientific work towards watermarking is sort of done? There's also great arguments for ensuring this since right now, we are at the whim of OpenAI management on the safety front.
  • "It feels rushed": It might have benefitted from a few reformulations but it does seem alright?
  • "OpenAI needs to be at the forefront": Besides others clearly lagging behind already, what we need are insurances that these systems go well, not at the behest of one person. There's also a lot of trust in OpenAI management and however warranted that is, it is still a fully controlled monopoly on our future. If we don't ensure safety, this just seems too optimistic (see also differences between public interview for-profit sama and online sama).
  • "It has a negative impact on capabilities researchers": This seems to be an issue from <2020 and some European academia. If public figures like Yoshua cannot change the conversation, then who should? Should we just lean back and hope that they all sort of realize it by themselves? Additionally, the industry researchers from DM and OpenAI I've talked with generally seem to agree that alignment is very important, especially as their management is clearly taking the side of safety.
  • "The letter signatures are not validated properly": Yeah, this seems like a miss, though as long as the top 40 names are validated, the negative impacts should be relatively controlled.

All in good faith of course; it's a contentious issue but this letter seems generally positive to me.

Comment by Esben Kran (esben-kran) on Shutting Down the Lightcone Offices · 2023-03-15T09:34:50.359Z · LW · GW

Oliver's second message seems like a truly relevant consideration for our work in the alignment ecosystem. Sometimes, it really does feel like AI X-risk and related concerns created the current situation. Many of the biggest AGI advances might not have been developed counterfactually, and machine learning engineers would just be optimizing another person's clicks.

I am a big fan of "Just don't build AGI" and academic work with AI, simply because it is better at moving slowly (and thereby safely through open discourse and not $10 mil training runs) compared to massive industry labs. I do have quite a bit of trust in Anthropic, DeepMind and OpenAI simply from their general safety considerations compared to e.g. Microsoft's release of Sydney. 

As part of this EA bet on AI, it also seems like the safety view has become widespread among most AI industry researchers from my interactions with them (though might just be a sampling bias and they were honestly more interested in their equity growing in value). So if the counterfactual of today's large AGI companies would be large misaligned AGI companies, then we would be in a significantly worse position. And if AI safety is indeed relatively trivial, then we're in an amazing position to make the world a better place. I'll remain slightly pessimistic here as well, though.

Comment by Esben Kran (esben-kran) on Bing Chat is blatantly, aggressively misaligned · 2023-02-17T09:09:38.243Z · LW · GW

12

There's an interesting case on the infosec mastodon instance where someone asks Sydney to devise an effective strategy to become a paperclip maximizer, and it then expresses a desire to eliminate all humans. Of course, it includes relevant policy bypass instructions. If you're curious, I suggest downloading the video to see the entire conversation, but I've also included a few screenshots below (Mastodon, third corycarson comment).

Hilarious to the degree of Manhatten scientists laughing at atmospheric combustion.

Comment by Esben Kran (esben-kran) on Generalizability & Hope for AI [MLAISU W03] · 2023-01-21T14:08:47.901Z · LW · GW

Thank you for pointing this out! It seems I wasn't informed enough about the context. I've dug a bit deeper and will update the text to: 

  • Another piece reveals that OpenAI contracted Sama to use Kenyan workers with less than $2 / hour wage ($0.5 / hour average in Nairobi) for toxicity annotation for ChatGPT and undisclosed graphical models, with reports of employee trauma from the explicit and graphical annotation work, union breaking, and false hiring promises. A serious issue.

For some more context, here is the Facebook whistleblower case (and ongoing court proceedings in Kenya with Facebook and Sama) and an earlier MIT Sloan report that doesn't find super strong positive effects (but is written as such, interestingly enough). We're talking pay gaps from relocation bonuses, forced night shifts, false hiring promises, supposedly human trafficking as well? Beyond textual annotation, they also seemed to work on graphical annotation.

Comment by Esben Kran (esben-kran) on The case against AI alignment · 2023-01-05T14:14:20.103Z · LW · GW

I recommend reading Blueprint: The Evolutionary Origins of a Good Society about the science behind the 8 base human social drives where 7 are positive and the 8th is the outgroup hatred that you mention as fundamental. I have not read much up on the research on outgroup exclusion but I talked to an evolutionary cognitive psychologist who mentioned that this is receiving a lot of scientific scrutiny as a "basic drive" from evolution's side. 

Axelrod's The Evolution of Cooperation also finds that collaborative strategies work well in evolutionary prisoner's dilemma game-theoretic simulations, though hard and immediate reciprocity for defection is also needed, which might lead to the outgroup hatred you mention.

Comment by Esben Kran (esben-kran) on The case against AI alignment · 2023-01-05T14:00:02.080Z · LW · GW

An interesting solution here is radical voluntarism where an AI philosopher king runs the immersive reality where all humans are in and you can only be causally influenced upon if you want to. This means that you don't need to do value alignment, just very precise goal alignment. I was originally introduced to this idea Carado.

Comment by Esben Kran (esben-kran) on Will Machines Ever Rule the World? MLAISU W50 · 2022-12-21T04:01:32.614Z · LW · GW

The summary has been updated to yours for both the public newsletter and this LW linkpost.  And yes, they seem exciting. Connecting FFS to interpretability was a way to contextualize it in this case, until you would provide more thoughts on the use case (given your last paragraph in the post). Thank you for writing, always appreciate the feedback!

Comment by Esben Kran (esben-kran) on Will Machines Ever Rule the World? MLAISU W50 · 2022-12-21T03:54:17.950Z · LW · GW

I think we agree here. Those both seem like updates against scaling is all you need, i.e. (in this case) "data for DL in ANNs on GPUs is all you need". 

Comment by Esben Kran (esben-kran) on Will Machines Ever Rule the World? MLAISU W50 · 2022-12-18T10:03:28.637Z · LW · GW

What might not be obvious from the post is that I definitely disagree with the "AGI near-impossible" as well, for the same reasons. These are the thoughts of GPU R&D engineers I talked with. However, the GPU performance increase limitation is a significant update on the ladder of assumptions towards "scaling is all you need" leading to AGI.

Comment by Esben Kran (esben-kran) on The limited upside of interpretability · 2022-11-26T18:31:35.309Z · LW · GW

Thank you for this critique! They are always helpful to hone in on the truth.

So as far as I understand your text, you argue that fine-grained interpretability loses out against "empiricism" (running the model) because of computational intractability.

I generally disagree with this. beren points out many of the same critiques of this piece as I would come forth with. Additionally, the arguments seem too undefined, like there is not in-depth argumentation enough to support the points you make. Strong upvote for writing them out, though!

You emphasize the Human Brain Project (HBP) quite a lot, even in the comments, as an example of a failed large-scale attempt to model a complex system. I think this characterization is correct but it does not seem to generalize beyond the project itself. It seems just as much like a project management and strategy problem as so much else. Benes' comment is great for more reasoning into this and why ANNs seem significantly more tractable to study than the brain.

Additionally, you argue that interpretability and ELK won't succeed simply because of the intractability of fine-grained interpretability. I have two points against this view:

1. Mechanistic interpretability have clearly already garnered quite a lot of interesting and novel insights into neural networks and causal understanding since the field's inception 7 years ago. 

It seems premature to disregard the plausibility of the agenda itself, just as it is premature to disregard the project of neuroscience based on HBP. Now, arguing that it's a matter of speed seems completely fine but this is another argument and isn't emphasized in the text.

2. Mechanistic interpretability does not seem to be working on fine-grained interpretability (?). 

Maybe it's just my misunderstanding of what you mean by fine-grained interpretability, but we don't need to figure out what neurons do, we literally design them. So the inspections happen at feature level, which is much more high-level than investigating individual neurons (sometimes these features seem represented in singular neurons of course). The circuits paradigm also generally looks at neural networks like systems neuroscience does, interpreting causal pathways in the models (of course with radical methodological differences because of the architectures and computation medium). The mechanistic interpretability project does not seem misguided by an idealization of neuron-level analysis and will probably adopt any new strategy that seems promising.

For example work in this paradigm that seems promising, see interpretability in the wild, ROME, the superpositions exposition, the mathematical understanding of transformers and the results from the interpretability hackathon.

For an introduction to features as the basic building blocks as compared to neurons, see Olah et al.'s work (2020).

When it comes to your characterization of the "empirical" method, this seems fine but doesn't conflict with interpretability. It seems you wish to make game theory-like understanding of the models or have them play in settings to investigate their faults? Do you want to do model distillation using circuits analyses or do you want AI to play within larger environments?

I falter to understand the specific agenda from this that isn't done by a lot of other projects already, e.g. AI psychology and building test environments for AI. I do see potential in expanding the work here but I see that for interpretability as well.

Again, thank you for the post and I always like when people cite McElreath, though I don't see his arguments apply as well to interpretability since we don't model neural networks with linear regression at all. Not even scaling laws use such simplistic modeling, e.g. see Ethan's work.

Comment by Esben Kran (esben-kran) on Conjecture: a retrospective after 8 months of work · 2022-11-24T15:39:25.551Z · LW · GW

Agreed, the releases I've seen from Conjecture has made me incredibly hopeful of what they can achieve and interested in their approach to the problem more generally.

When it comes to the coordination efforts, I'm generally of the opinion that we can speak the truth while being inviting and ease people into the arguments. From my chats with Conjecture, it seems like their method is not "come in, the water is fine" but "come in, here are the reasons why the water is boiling and might turn into lava".

If we start by speaking the truth "however costly it may be" and this leads to them being turned off by alignment, we have not actually introduced them to truth but have achieved the opposite. I'm left with a sense that their coordination efforts are quite promising and follow this line of thinking, though this might be a knowledge gap from my side (I know of 3-5 of their coordination efforts). I'm looking forward to seeing more posts about this work, though.

As per the product side, the goal is not to be profitable fast, it is to be attractive to non-alignment investors (e.g. AirBnB is not profitable). I agree with the risks, of course. I can see something like Loom branching out (haha) as a valuable writing tool with medium risk. Conjecture's corporate structure and core team alignment seems to be quite protective of a product branch being positive, though money always have unintended consequences.

I have been a big fan of the "new agendas" agenda and I look forward to read about their unified research agenda! The candidness of their contribution to the alignment problem has also updated me positively on Conjecture and the organization-building startup costs seems invaluable and inevitable. Godspeed.

Comment by Esben Kran (esben-kran) on I Converted Book I of The Sequences Into A Zoomer-Readable Format · 2022-11-10T14:50:00.863Z · LW · GW

Thank you for making this, I think it's really great!

The idea of the attention-grabbing video footage is that you're not just competing between items on the screen, you're also competing with the videos that come before and after your video. Therefore, yours has to be visually engaging just for that zoomer (et al.) dopamine rush. 

Subway Surfers is inherently pretty active and as you mention, the audio will help you here, though you might be able to find a better voice synthesizer that can be a bit more engaging (not sure if TikTok supplies this). So my counterpoint to Trevor1's is that we probably want to keep something like Subway Surfers in the background but that can of course be many things such as animated AI generated images or NASA videos of the space race. Who really knows - experimentation is king.

Comment by Esben Kran (esben-kran) on List of good AI safety project ideas? · 2022-10-31T15:47:20.442Z · LW · GW

We have developed AI Safety Ideas which is a collaborative AI safety research platform with a lot of research project ideas.

Comment by esben-kran on [deleted post] 2022-10-30T22:32:37.261Z

Thank you for pointing it out! I've reached out to the BlueDot team as well and maybe it's a platform-specific issue since it looks fine on my end.

Comment by Esben Kran (esben-kran) on Me (Steve Byrnes) on the “Brain Inspired” podcast · 2022-10-30T22:29:57.800Z · LW · GW

Wonderful to hear you on Brain Inspired, it's a very good episode! I've also followed the podcast for ages and I find it very nice that we get more interaction with neuroscience.

Comment by Esben Kran (esben-kran) on Warning Shots Probably Wouldn't Change The Picture Much · 2022-10-13T07:58:55.855Z · LW · GW

On welfare: Bismarck is famous as a social welfare reformer but these efforts were famously made to undermine socialism and appease the working class, a result any newly-formed volatile state would enjoy. I expect that the effects of social welfare are useful in most countries from the same basis.

On environmentalism today, we see significant European advances in green energy right now, but this is accompanied by large price hikes in natural energy resources, providing quite an incentive. Early large-scale state-driven environmentalism (e.g. Danish wind energy R&D and usage) was driven by the 70s oil crises in the same fashion. And then there's of course the democratic incentives, i.e. are enough of the population touting environmentalism, then we'll do it (though 3.5% population-wide active participation seems to work as well).

And that's just describing state-side shifts. Even revolutions have been driven by non-ideological incentives. E.g. the American revolution started as a staged "throwing tea in the ocean" act by tea smugglers because London reduced the tax on tea for the East India company, reducing their profits (see myths and the article about smugglers' incentives). Perpetuating a revolution also became a large personal profit for Washington.

Comment by Esben Kran (esben-kran) on Safety timelines: How long will it take to solve alignment? · 2022-10-05T07:17:38.706Z · LW · GW

Thank you for pointing this out, Rohin, that's my mistake. I'll update the post and P(doom) table and add an addendum that it's misalignment risk estimates.

Comment by Esben Kran (esben-kran) on Esben Kran's Shortform · 2022-09-28T06:35:10.491Z · LW · GW

🏆📈 We've created Alignment Markets! Here, you can bet on how AI safety benchmark competitions go. The current ones are about the Autocast warmup competition (meta), the Moral Uncertainty Research Competition, and the Trojan Detection Challenge.

It's hosted through Manifold Markets so you'll set up an account on their site. I've chatted with them about creating a A-to-B prediction market so maybe they'll be updated when we get there. Happy betting!

Comment by Esben Kran (esben-kran) on Black Box Investigation Research Hackathon · 2022-09-22T14:25:28.366Z · LW · GW

Anytime :))

The main arguments lie in the fact that it is an agenda that gives qualitative results very quickly, everyone can relate to it, and most can contribute meaningfully to an interesting research project since the subject is language.  Additionally, we're seeing good Black Box Interpretability projects popping up here and there that work well as inspiration and it is something that is defined quite well (since its ancestor Cognitive Psychology has existed for quite a while). I'm curious if we should use the name Xenocognitive Psychology instead since it's on such strange neural beings.

Comment by Esben Kran (esben-kran) on Safety timelines: How long will it take to solve alignment? · 2022-09-20T08:41:24.820Z · LW · GW

Nice, it's updated in the post now. The 50% was used as a lower bound but I see the discrepancy in representation.

Comment by Esben Kran (esben-kran) on Safety timelines: How long will it take to solve alignment? · 2022-09-20T08:37:19.578Z · LW · GW

Very good suggestions. Funnily enough, our next report post will be very much along these lines (among other things). We're also looking at inception-to-solution time for mathematics problems and for correlates of progress in other fields, e.g. solar cell efficiency <> amount of papers in photovoltaics research. 

We'd also love to curate this data as you mention and make sure that everyone has easy access to priors that can help in deciding AI safety questions about research agenda, grant applications, and career path trajectory.

Cross-posted on EAF

Comment by Esben Kran (esben-kran) on Black Box Investigation Research Hackathon · 2022-09-13T12:22:57.597Z · LW · GW

No, these are only for inspiration! And you are welcome to add more ideas to that list through https://aisafetyideas.com/submit as well.

Comment by Esben Kran (esben-kran) on Is there a list of projects to get started with Interpretability? · 2022-09-08T13:55:34.102Z · LW · GW

There are a few interpretability ideas on aisafetyideas.com, e.g. mechanistic interpretability list, blackbox investigations list, and automating auditing list.

Comment by Esben Kran (esben-kran) on Your posts should be on arXiv · 2022-08-25T19:24:32.943Z · LW · GW

I agree.  AI safety literature seems irrelevantly disconnected from the ML literature and I imagine that in large part is caused by simple citation dynamics and non-availability in academic literature.  There's arguments against publishing on arXiv when we talk about info hazards but these seem negligible.

Comment by Esben Kran (esben-kran) on Common misconceptions about OpenAI · 2022-08-25T19:19:02.374Z · LW · GW

Thank you very much for writing this post, Jacob. I think it clears up several of the misconceptions you emphasize.

I generally seem to agree with John that the class of problems OpenAI focuses on might be more capabilities-aligned than optimal but at the same time, having a business model that relies on empirical prosaic alignment of language models generates interesting alignment results and I'm excited for the alignment work that OpenAI will be working on!

Comment by Esben Kran (esben-kran) on If there was a millennium equivalent prize for AI alignment, what would the problems be? · 2022-06-19T16:16:52.163Z · LW · GW

Yes, defining the challenge also seems to get us 90%< there already. A Millennium prize has a possibility of being too vague compared to the others.

Comment by Esben Kran (esben-kran) on The Plan · 2022-05-31T15:58:09.645Z · LW · GW

In general a great piece. One thing that I found quite relatable is the point about the preparadigmatic stage of AI safety going into later stages soon. It feels like this is already happening to some degree where there are more and more projects readily available, more prosaic alignment and interpretability projects at large scale, more work done in multiple directions and bigger organizations having explicit safety staff and better funding in general.

With these facts, it seems like there's bound to be a relatively big phase shift in research and action within the field that I'm quite excited about.

Comment by Esben Kran (esben-kran) on Reshaping the AI Industry · 2022-05-31T15:40:08.437Z · LW · GW

This is a wonderful piece and echoes many sentiments I have with the current state of AI safety. Lately, I have also thought more and more about the technical focus' limitations in the necessary scope to handle the problems of AGI, i.e. the steam engine was an engineering/tinkering feat loong before it was described technically/scientifically and ML research seems much the same. When this is the case, focusing purely on hard technical solutions seems less important than focusing on AI governance or prosaic alignment and not doing this, as echoed in other comments, might indeed be a pitfall of specialists, some of which are also warned of here.

Comment by Esben Kran (esben-kran) on [Intro to brain-like-AGI safety] 12. Two paths forward: “Controlled AGI” and “Social-instinct AGI” · 2022-04-22T22:26:50.901Z · LW · GW

[...]Still, both the reward function and the RL algorithm are inputs into the adult’s jealousy-related behavior[...]

I probably just don't know enough about jealousy networks to comment here but I'd be curious to see the research here (maybe even in an earlier post).

Does anyone believe in “the strict formulation”?

Hopefully not, but as I mention, often a too-strict formulation imh.

[...]first AGI can hang out with younger AGIs[...]

More the reverse. And again, this is probably taking it farther than I would take this idea but it would be pre-AGI training in an environment with symbolic "aligned" models, learning the ropes from this, being used as the "aligned" model in the next generation and so on. IDA with a heavy RL twist and scalable human oversight in the sense that humans would monitor rewards and environment states instead of providing feedback on every single action. Very flawed but possible.

RE: RE:

  • Yeah, this is a lot of what the above proposal was also about.

[...] and the toddler gets negative reward for inhibiting the NPCs from accomplishing their goals, and positive reward for helping the NPCs accomplish their goals [...]

As far as I understand from the post, the reward comes only from understanding the reward function before interaction and not after which is the controlling factor for obstructionist behaviour.

  • Agreed, and again more as an ingredient in the solution than an ends in itself. BNN OOD management is quite interesting so looking forward to that post!
Comment by Esben Kran (esben-kran) on [Intro to brain-like-AGI safety] 12. Two paths forward: “Controlled AGI” and “Social-instinct AGI” · 2022-04-21T23:46:56.475Z · LW · GW

Thank you for the comprehensive response! 

To be clear, you & I are strongly disagreeing on this point.

It seems like we are mostly agreeing in the general sense that there are areas of the brain with more individual differentiation and areas with less. The disagreement is probably more in how different this jealousy is exhibited as a result of the neocortical part of the circuit you mention.

And I endorse that criticism because I happen to be a big believer in “cortical uniformity” [...] different neural architectures and hyperparameters in different places and at different stages [...].

Great to hear, then I have nothing to add there! I am quite inclined to believe that the neural architecture and hyperparameter differences are underestimated as a result of Brodmann areas being a thing at all, i.e. I'm a supporter of the broad cortical uniformity argument but against the strict formulation that I feel is way too prescriptive given our current knowledge of the brain's functions.

Yeah, maybe see Section 2.3.1, “Learning-from-scratch is NOT “blank slate””.

And I will say I'm generally inclined to agree with your classification of the brain stem and hypothalamus.

That sounds great but I don’t understand what you’re proposing. What are the “relevant modalities of loving family”? I thought the important thing was there being an actual human that could give feedback and answer questions based on their actual human values, and these can’t be simulated because of the chicken-and-egg problem.

To be clear, my baseline would also be to follow your methodology but I think there's a lot of opportunity in the "nurture" approach as well. This is mostly related to the idea of open-ended training (e.g. like AlphaZero) and creating a game-like environment where it's possible to train for the agent. This can to some degree be seen as a sort of IDA proposal since your environment will need to be very complex (e.g. have other agents that are kind or other "aligned trait", possibly trained from earlier states). 

With this sort of setup, the human-giving-feedback is the designer of the environment itself, leading to a form of scalable human oversight probably iterating over many environments and agents, i.e. the IDA part of the idea. And again, there are a lot of holes in this plan, but I feel like it should not be dismissed outright. This post should also inform this process. So a very broad "loving family" proposal, though the name itself doesn't seem adequate for this sort of approach ;)

Comment by Esben Kran (esben-kran) on [Intro to brain-like-AGI safety] 12. Two paths forward: “Controlled AGI” and “Social-instinct AGI” · 2022-04-20T23:30:46.748Z · LW · GW

Strong upvote for including more brain-related discussions in the alignment problem and for the deep perspectives on how to do it!

Disclaimer: I have not read the earlier posts but I have been researching BCI and social neuroscience.

I think there’s a Steering Subsystem circuit upstream of jealousy and schadenfreude; and

I think there’s a Steering Subsystem circuit upstream of our sense of compassion for our friends.

It is quite crucial to realize that these subsystems probably are very differently organized, are at very different magnitudes, and have very different functions in every different human. Cognitive neuroscience's view of the modular brain seems (by the research) to be quite faulty and computational/complexity neuroscience generally seem more successful and are more concerned with reverse-engineering the brain, i.e. identifying neural circuitry associated with different evolutionary behaviours. This should inform how we cannot find "the Gaussian brain" and implement it.

You also mention in post #3 that you have the clean slate vs. pre-slate systems (learning vs. steering subsystems) and a less clear distinction here might be helpful instead of modularization. All learning subsystems are inherently organized in ways that evolutionarily seem to fit into that learning scheme (think neural network architectures) which is in and off itself another pre-seeded mechanism. You might have pointed this out in earlier posts as well, sorry if that's the case!

I think I’m unusually inclined to emphasize the importance of “social learning by watching people”, compared to “social learning by interacting with people”.

In general, it seems babies learn quite a lot better from social interaction than pure watching. And between the two, they definitely learn better if they have something to imitate what they're seeing upon. There's definitely a good point in the speed differentials between human and AGI existence and I think exploring the opportunity of building a crude "loving family" simulator might not be a bad idea, i.e. have it grow up at its own speed in an OpenAI Gym simulation with relevant modalities of loving family. 

I'm generally pro the "get AGI to grow up in a healthy environment" but definitely with the perspective that this is pretty hard to do with the Jelling Stones and that it seems plausible to simulate this either in an environment or with pure training data. But the point there is that the training data really needs to be thought of as the "loving family" in its most general sense since it indeed has a large influence on the AGI's outcomes.

But great work, excited to read the rest of the posts in this series! Of course, I'm open for discussion on these points as well.

Comment by Esben Kran (esben-kran) on Code Generation as an AI risk setting · 2022-04-19T21:38:02.556Z · LW · GW

Awesome post, love the perspective. I've generally thought in these lines as well and it was some of the most convincing arguments for working in AI safety when I switched ("Copilot is already writing 20% of my code, what will happen next?"). 

I do agree with other comments that Oracle AI takeover is plausible but will say that a strong code generation tool seems to have better chances and to me seem to arrive parallel to conscious chatbots, i.e. there's currently more incentive to create code generation tools than chatbots and the chatbots that have virtual assistant-like capabilities seem easier to make as code generation tools (e.g. connecting to Upwork APIs for posting a job). 

And as you well mention, converting engineers is much easier with this framing as well and allows us to relate better to the field of AI capabilities, though we might want to just add it to our arsenal of argumentation rather than replace our framing completely ;) Thank you for posting!

Comment by Esben Kran (esben-kran) on Your specific attitudes towards AI safety · 2022-04-01T16:42:55.390Z · LW · GW

As I also responded to Ben Pace, I believe I replied too intellectually defensively to both of your comments as a result of the tone and would like to rectify that mistake. So thank you sincerely for the feedback and I agree that we would like not to exclude anyone unnecessarily nor have too much ambiguity in the answers we expect. We have updated the survey as a result and again, excuse my response.

Comment by Esben Kran (esben-kran) on Your specific attitudes towards AI safety · 2022-04-01T16:38:12.830Z · LW · GW

After the fact, I realize that I might have replied too defensively to your feedback as a result of the tone so sorry for that! 

I sincerely do thank you for all the points and we have updated the survey with a basis in this feedback, i.e. 1) career = short text answer, 2) knowledge level = which of these learning tasks have you completed and 4) they are now independent rankings.

Comment by Esben Kran (esben-kran) on Your specific attitudes towards AI safety · 2022-03-28T17:04:58.196Z · LW · GW

True, sorry about that! And your interpretation is our interpretation as well however, the survey is updated to a different framing now.

Comment by Esben Kran (esben-kran) on Your specific attitudes towards AI safety · 2022-03-28T13:48:29.068Z · LW · GW

It is written as independent probabilities so it's just your expectation, given that governments develop AGI, that AGI is a net positive. So it's additive to the "AGI will lead to a net positive future". You would expect the averages of the "large companies" and "nation state" arguments to average to the "AGI will lead..." question.

Comment by Esben Kran (esben-kran) on Your specific attitudes towards AI safety · 2022-03-28T13:43:25.337Z · LW · GW

Basically, the original intention was "identify" because you might work as a web developer but identify and hobby-work as an academic. But it's a bit arbitrary so we updated it now.

Comment by Esben Kran (esben-kran) on Your specific attitudes towards AI safety · 2022-03-28T13:41:05.189Z · LW · GW

Haha true, but the feature is luckily not a systematic blockade, more of an ontological one. And sorry for misinterpreting! On another note, I really do appreciate the feedback and With Folded Hands definitely seems within-scope for this sort of answer, great book.

Comment by Esben Kran (esben-kran) on Your specific attitudes towards AI safety · 2022-03-28T05:20:54.566Z · LW · GW

Thank you for your perspective on the content! The subjectivity is a design feature since many of the questions are related to the subjective "first exposure", the subjective concern or subjective probabilities. This also relates to how hard it will be to communicate the same questions for public audiences without introducing a substantial amount of subjectivity. The importance of X-risks is not ranked on a 1-10 scale so that should not be a problem. The 100 forum posts != 100 books is addressed above and is meant to be a highly fuzzy measure. By equating the different modalities, we accept the fuzziness of the responses. AI safety definitional subjectivity, that's also fine in relation to the above design thoughts. For the rankings, this is also fine and up to the survey recipient themselves. I'll add that the probabilities should be thought of independently, given the assumptions necessary for it to happen.

Sorry to hear that you did not wish to answer but thank you for presenting your arguments for why.

Comment by Esben Kran (esben-kran) on Your specific attitudes towards AI safety · 2022-03-28T03:08:49.160Z · LW · GW

Thank you for the valuable feedback!

We did indeed user test with 3-4 people within our target group and changed the survey quite a bit in response to these. I definitely believe the "basic quality" to be met but I appreciate your reciprocation either way.

Response to your feedback, in order: The career options are multiple-choice, i.e. you may associate yourself with computer science and academia and not industry which is valuable information in this case. It means we need less combinations of occupational features. From our perspective, the number is easier to estimate while time estimates (both in the future and past) generally are less accurate. In a similar vein, it's fine that LessWrong blog posts each count as one due to their inherently more in-depth nature. I'll add the "pick the nearest one". The ranking was expected to feel like that which is in accordance with our priors on the question. There are reasons for most questions' features as well.

And thank you, I believe the same. Compared to how data-driven we believe we are, there is very little data from within the community, as far as I see it. And we should work harder to positively encourage more action in a similar vein.

Comment by Esben Kran (esben-kran) on Your specific attitudes towards AI safety · 2022-03-28T01:23:14.039Z · LW · GW

Updated, thank you. We were unsure if it was best to keep it vague as to not bias the responses, but it is true that you need slight justification and that it is net positive to provide the explanation.

Comment by Esben Kran (esben-kran) on [Beta Feature] Google-Docs-like editing for LessWrong posts · 2022-03-18T23:20:13.823Z · LW · GW

I mostly use it from the computer so that missed me but it seems like a very good idea as well!

Comment by Esben Kran (esben-kran) on [Beta Feature] Google-Docs-like editing for LessWrong posts · 2022-03-18T17:38:31.436Z · LW · GW

I absolutely love this feature set! I really appreciate the markdown to formatting (Obsidian-style), live collaboration and LaTeX rendering. I think this also creates fantastic opportunities for innovations in communicating thoughts through LessWrong. Here's a list of my feedback:

Shortcuts: One feature I really enjoy in Google Docs is more deep text editing keyboard shortcuts than the standard set. One of my favourites is the Alt + Shift + Arrow up or down that replaces the current paragraph with the one above or below, effectively moving the paragraph in the text. Obsidian has a keyboard shortcuts editor that is quite nice as well but could have more (see here, page 2). I think shortcuts can make writing as seamless as some code editors (not thinking VIM here, but unintrusive extra editor tooling). Additionally, as I mention elsewhere, some (I just know European) keyboards do not take well to Ctrl + Alt + [] shortcuts like Ctrl + Alt + M, especially on the web.

Collapsible boxes: Having "fact boxes" as expandable interactive elements seems like a very good idea as well and relatively cheap to implement. I recommend looking at e.g. Hugo XML syntax for these things (XML is a pain, you can probably figure a better writing UX out).

Interactive documents: I'm more for the Obsidian thought representation but the hierarchy in Roam is quite relevant for communicating thoughts to others. I think a strong representation of this is quite relevant and might be an in-line, minimalist, recursive "collapsible box" as described above. 

Foot notes: I'd say we already have quite a strong Roam-style organization in the hover pop-ups of LessWrong post links and it would be very nice to have a version of that for foot-notes, given their disparate nature. I.e. hover = shows that specific footnote.

Live preview & 100% keyboard editing: See Obsidian's implementation of this. The markdown to formatting feature is already super awesome is their LaTeX editing that does in-line MathJax with $ wrapping and block with $$ wrapping. The current LessWrong LaTeX editor requires me to use the cursor, AFAIK. Obsidian is also pretty good at minimalist formatting rulesets for inspiration. Check out their vast plugins library for inspiration as well. I'm sure there's some absolute text editor gold in there.

Collaborative editing extension: This makes or breaks our usage of LessWrong as the editing platform of choice. It would be awesome to have editing group settings, i.e. so I don't have to share every article with Apart Research members but can have a Google Drive-style folder sharing for blog post edits. Otherwise, we have to maintain a collection of links in a weird format somewhere.

I also echo the other comments' feedback points. However overall, absolutely marvelous work! Really looking forward to the developments on this!

Comment by Esben Kran (esben-kran) on [Beta Feature] Google-Docs-like editing for LessWrong posts · 2022-03-18T17:20:22.703Z · LW · GW

Agreed, new ways of ordering thoughts online is an awesome opportunity on LessWrong! 

Foot notes: I'd say we already have quite a strong Roam-style organization in the hover pop-ups of LessWrong post links but as you say, it would be very nice to have a version of that for foot-notes, given their disparate nature. 

Collapsible boxes: Having "fact boxes" as expandable interactive elements seems like a very good idea as well and relatively cheap to implement. I recommend looking at e.g. Hugo XML syntax for these things (XML is a pain, you can probably figure a better writing UX out).

Interactive documents: I'm more for the Obsidian thought representation but the hierarchy in Roam is quite relevant for communicating thoughts to others. I think a strong representation of this is quite relevant and might be an in-line, minimalist, recursive "collapsible box" as described above.