jacquesthibs's Shortform

post by jacquesthibs (jacques-thibodeau) · 2022-11-21T12:04:07.896Z · LW · GW · 154 comments

154 comments

Comments sorted by top scores.

comment by jacquesthibs (jacques-thibodeau) · 2024-01-24T11:48:23.491Z · LW(p) · GW(p)

I thought this series of comments from a former DeepMind employee (who worked on Gemini) were insightful so I figured I should share.

From my experience doing early RLHF work for Gemini, larger models exploit the reward model more. You need to constantly keep collecting more preferences and retraining reward models to make it not exploitable. Otherwise you get nonsensical responses which have exploited the idiosyncracy of your preferences data. There is a reason few labs have done RLHF successfully.

It's also know that more capable models exploit loopholes in reward functions better. Imo, it's a pretty intuitive idea that more capable RL agents will find larger rewards. But there's evidence from papers like this as well: https://arxiv.org/abs/2201.03544 

To be clear, I don't think the current paradigm as-is is dangerous. I'm stating the obvious because this platform has gone a bit bonkers.

The danger comes from finetuning LLMs to become AutoGPTs which have memory, actions, and maximize rewards, and are deployed autonomously. Widepsread proliferation of GPT-4+ models will almost certainly make lots of these agents which will cause a lot of damage and potentially cause something indistinguishable from extinction.

These agents will be very hard to align. Trading off their reward objective with your "be nice" objective won't work. They will simply find the loopholes of your "be nice" objective and get that nice fat hard reward instead.

We're currently in the extreme left-side of AutoGPT exponential scaling (it basically doesn't work now), so it's hard to study whether more capable models are harder or easier to align.

Other comments from that thread:

My guess is where your intuitive alignment strategy ("be nice") breaks down for AI is that unlike humans, AI is highly mutable. It's very hard to change a human's sociopathy factor. But for AI, even if *you* did find a nice set of hyperparameters that trades off friendliness and goal-seeking behavior well, it's very easy to take that, and tune up the knobs to make something dangerous. Misusing the tech is as easy or easier than not. This is why many put this in the same bucket as nuclear.

US visits Afghanistan, teaches them how to make power using Nuclear tech, next month, they have nukes pointing at Iran.

And:

In contexts where harms will be visible easily and in short timelines, we’ll take them offline and retrain.

Many applications will be much more autonomous, difficult to monitor or even understand, and potentially fully close loop, i.e the agent has a complex enough action space that it can copy itself, buy compute, run itself, etc.

I know it sounds scifi. But we’re living in scifi times. These things have a knack of becoming true sooner than we think.

No ghosts in the matrices assumed here. Just intelligence starting from a very good base model optimizing reward.
 

There are more comments he made in that thread that I found insightful, so go have a look if interested.

Replies from: leogao
comment by leogao · 2024-01-28T04:51:26.361Z · LW(p) · GW(p)

"larger models exploit the RM more" is in contradiction with what i observed in the RM overoptimization paper. i'd be interested in more analysis of this

Replies from: Algon
comment by Algon · 2024-02-13T13:02:39.154Z · LW(p) · GW(p)

In that paper did you guys take a good long look at the output of various sized models throughout training? In addition to looking at the graphs of gold-standard/proxy reward model ratings against KL-divergence. If not, then maybe that's the discrepancy: perhaps Sherjil was communicating with the LLM and thinking "this is not what we wanted". 

comment by jacquesthibs (jacques-thibodeau) · 2023-11-14T19:14:21.708Z · LW(p) · GW(p)

If you work at a social media website or YouTube (or know anyone who does), please read the text below:

Community Notes is one of the best features to come out on social media apps in a long time. The code is even open source. Why haven't other social media websites picked it up yet? If they care about truth, this would be a considerable step forward beyond. Notes like “this video is funded by x nation” or “this video talks about health info; go here to learn more” messages are simply not good enough.

If you work at companies like YouTube or know someone who does, let's figure out who we need to talk to to make it happen. Naïvely, you could spend a weekend DMing a bunch of employees (PMs, engineers) at various social media websites in order to persuade them that this is worth their time and probably the biggest impact they could have in their entire career.

If you have any connections, let me know. We can also set up a doc of messages to send in order to come up with a persuasive DM.

Replies from: jacques-thibodeau, Viliam, jacques-thibodeau, ChristianKl, jacques-thibodeau, bruce-lewis
comment by jacquesthibs (jacques-thibodeau) · 2023-11-14T19:52:25.042Z · LW(p) · GW(p)

Don't forget that we train language models on the internet! The more truthful your dataset is, the more truthful the models will be! Let's revamp the internet for truthfulness, and we'll subsequently improve truthfulness in our AI systems!!

comment by Viliam · 2023-11-15T08:48:30.061Z · LW(p) · GW(p)

I don't use Xitter; is there a way to display e.g. top 100 tweets with community notes? To see how it works in practice.

Replies from: Yoav Ravid, jacques-thibodeau
comment by Yoav Ravid · 2023-11-15T16:35:05.833Z · LW(p) · GW(p)

I don't know of something that does so at random, but this page automatically shares posts with community notes that have been deemed helpful.

Replies from: jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2023-11-15T16:37:20.174Z · LW(p) · GW(p)

Oh, that’s great, thanks! Also reminded me of (the less official, more comedy-based) “Community Notes Violating People”. @Viliam [LW · GW

Replies from: Viliam
comment by Viliam · 2023-11-16T07:57:54.275Z · LW(p) · GW(p)

Thank you both! This is perfect. It's like a rational version of Twitter, and I didn't expect to use those words in the same sentence.

comment by jacquesthibs (jacques-thibodeau) · 2023-11-15T13:16:28.093Z · LW(p) · GW(p)

I don’t think so, unfortunately.

Replies from: Viliam
comment by Viliam · 2023-11-15T16:05:31.018Z · LW(p) · GW(p)

Found a nice example (linked from Zvi's article [LW · GW]).

Okay, it's just one example and it wasn't found randomly, but I am impressed.

comment by jacquesthibs (jacques-thibodeau) · 2023-11-15T04:09:15.811Z · LW(p) · GW(p)

I've also started working on a repo in order to make Community Notes more efficient by using LLMs.

comment by ChristianKl · 2023-11-14T20:05:07.105Z · LW(p) · GW(p)

Why haven't other social media websites picked it up yet? If they care about truth, this would be a considerable step forward beyond. 

This sounds a bit naive. 

There's a lot of energy invested in making it easier for powerful elites to push their preferred narratives. Community Notes are not in the interests of the Censorship Industrial Complex.

I don't think that anyone at the project manager level has the political power to add a feature like Community Notes. It would likely need to be someone higher up in the food chain. 

Replies from: jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2023-11-14T20:11:45.389Z · LW(p) · GW(p)

Sure, but sometimes it's just a PM and a couple of other people that lead to a feature being implemented. Also, keep in mind that Community Notes was a thing before Musk. Why was Twitter different than other social media websites?

Also, the Community Notes code was apparently completely revamped by a few people working on the open-source code, which got it to a point where it was easy to implement, and everyone liked the feature because it noticeably worked.

Either way, I'd rather push for making it happen and somehow it fails on other websites than having pessimism and not trying at all. If it needs someone higher up the chain, let's make it happen.

Replies from: ChristianKl
comment by ChristianKl · 2023-11-14T20:43:46.658Z · LW(p) · GW(p)

Sure, but sometimes it's just a PM and a couple of other people that lead to a feature being implemented. Also, keep in mind that Community Notes was a thing before Musk. Why was Twitter different than other social media websites?

Twitter seems to have started Birdwatch as a small separate pilot project where it likely wasn't easy to fight or on anyone's radar to fight. 

In the current enviroment, where X gets seen as evil by a lot of the mainstream media, I would suspect that copying Community Notes from X would alone produce some resistence. The antibodies are now there in a way they weren't two years ago. 

Also, the Community Notes code was apparently completely revamped by a few people working on the open-source code, which got it to a point where it was easy to implement, and everyone liked the feature because it noticeably worked.

If you look at mainstream media views about X's community notes, I don't think everyone likes it. 

I remember Elon once saying that he lost a 8-figure advertising deal because of Community Notes on posts of a company that wanted to advertise on X.

Either way, I'd rather push for making it happen and somehow it fails on other websites than having pessimism and not trying at all. If it needs someone higher up the chain, let's make it happen.

I think you would likely need to make a case that it's good business in addition to helping with truth. 

If you want to make your argument via truth, motivating some reporters to write favorable articles about Community Notes might be necessary. 

Replies from: jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2023-11-14T20:49:20.543Z · LW(p) · GW(p)

Good points; I'll keep them all in mind. If money is the roadblock, we can put pressure on the companies to do this. Or, worst-case, maybe the government can enforce it (though that should be done with absolute care).

comment by jacquesthibs (jacques-thibodeau) · 2023-11-14T19:45:18.905Z · LW(p) · GW(p)

I shared a tweet about it here: https://x.com/JacquesThibs/status/1724492016254341208?s=20

Consider liking and retweeting it if you think this is impactful. I'd like it to get into the hands of the right people.

comment by Bruce Lewis (bruce-lewis) · 2023-11-14T19:31:05.897Z · LW(p) · GW(p)

I had not heard of Community Notes. Interesting anti-bias technique "notes require agreement between contributors who have sometimes disagreed in their past ratings". https://communitynotes.twitter.com/guide/en/about/introduction

Replies from: jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2023-11-14T19:34:43.538Z · LW(p) · GW(p)

I've been on Twitter for a long time, and there's pretty much unanimous agreement that it works amazingly well in practice!

Replies from: kabir-kumar-1
comment by Kabir Kumar (kabir-kumar-1) · 2023-11-14T19:37:14.776Z · LW(p) · GW(p)

there is an issue with surface level insights being unfaily weighted, but this is solvable, imo. especially with youtube, which can see which commenters have watched the full video.

comment by jacquesthibs (jacques-thibodeau) · 2023-11-20T01:32:39.745Z · LW(p) · GW(p)

My current speculation as to what is happening at OpenAI

How do we know this wasn't their best opportunity to strike if Sam was indeed not being totally honest with the board?

Let's say the rumours are true, that Sam is building out external orgs (NVIDIA competitor and iPhone-like competitor) to escape the power of the board and potentially going against the charter. Would this 'conflict of interest' be enough? If you take that story forward, it sounds more and more like he was setting up AGI to be run by external companies, using OpenAI as a fundraising bargaining chip, and having a significant financial interest in plugging AGI into those outside orgs.

So, if we think about this strategically, how long should they wait as board members who are trying to uphold the charter?

On top of this, it seems (according to Sam) that OpenAI has made a significant transformer-level breakthrough recently, which implies a significant capability jump. Long-term reasoning? Basically, anything short of 'coming up with novel insights in physics' is on the table, given that Sam recently used that line as the line we need to cross to get to AGI.

So, it could be a mix of, Ilya thinking they have achieved AGI while Sam places a higher bar (internal communication disagreements) + the board not being alerted (maybe more than once) about what Sam is doing, e.g. fundraising for both OpenAI and the orgs he wants to connect AGI to + new board members who are more willing to let Sam and GDB do what they want being added soon (another rumour I've heard) + ???. Basically, perhaps they saw this as their final opportunity to have any veto on actions like this.

Here's what I currently believe:

  • There is a GPT-5-like model that already exists. It could be GPT-4.5 or something else, but another significant capability jump. Potentially even a system that can coherently pursue goals for months, capable of continual learning, and effectively able to automate like 10% of the workforce (if they wanted to).
  • As of 5 PM, Sunday PT, the board is in a terrible position where they either stay on board and the company employees all move to a new company, or they leave the board and bring Sam back. If they leave, they need to say that Sam did nothing wrong and sweep everything under the rug (and then potentially face legal action for saying he did something wrong); otherwise, Sam won't come back.
  • Sam is building companies externally; it is unclear if this goes against the charter. But he does now have a significant financial incentive to speed up AI development. Adam D'Angelo said that he would like to prevent OpenAI from becoming a big tech company as part of his time on the board because AGI was too important for humanity. They might have considered Sam's action going in this direction.
  • A few people left the board in the past year. It's possible that Sam and GDB planned to add new people (possibly even change current board members) to the board to dilute the voting power a bit or at least refill board seats. This meant that the current board had limited time until their voting power would become less important. They might have felt rushed.
  • The board is either not speaking publicly because 1) they can't share information about GPT-5, 2) there is some legal reason that I don't understand (more likely), or 3) they are incompetent (least likely by far IMO).
  • We will possibly never find out what happened, or it will become clearer by the month as new things come out (companies and models). However, it seems possible the board will never say or admit anything publicly at this point.
  • Lastly, we still don't know why the board decided to fire Sam. It could be any of the reasons above, a mix or something we just don't know about.

Other possible things:

  • Ilya was mad that they wouldn't actually get enough compute for Superalignment as promised due to GPTs and other products using up all the GPUs.
  • Ilya is frustrated that Sam is focused on things like GPTs rather than the ultimate goal of AGI.
Replies from: jacques-thibodeau, jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2023-11-22T23:41:11.779Z · LW(p) · GW(p)

Obviously, a lot has happened since the above shortform, but regarding model capabilities (which discussions died down these last couple of days), there's now this:

Source: https://www.reuters.com/technology/sam-altmans-ouster-openai-was-precipitated-by-letter-board-about-ai-breakthrough-2023-11-22/ 

Replies from: jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2023-11-23T00:11:25.019Z · LW(p) · GW(p)

So, apparently, there are two models, but only Q* is mentioned in the article. Won't share the source, but:

comment by jacquesthibs (jacques-thibodeau) · 2023-11-20T04:33:10.799Z · LW(p) · GW(p)

Update, board members seem to be holding their ground more than expected in this tight situation:

comment by jacquesthibs (jacques-thibodeau) · 2023-11-03T00:12:10.380Z · LW(p) · GW(p)

Attempt to explain why I think AI systems are not the same thing as a library card when it comes to bio-risk.

To focus on less of an extreme example, I’ll be ignoring the case where AI can create new, more powerful pathogens faster than we can create defences, though I think this is an important case (some people just don’t find it plausible because it relies on the assumption that AIs being able to create new knowledge).

I think AI Safety people should make more of an effort to walkthrough the threat model so I’ll give an initial quick first try:

1) Library. If I’m a terrorist and I want to build a bioweapon, I have to spend several months reading books at minimum to understand how it all works. I don’t have any experts on-hand to explain how to do it step-by-step. I have to figure out which books to read and in what sequence. I have to look up external sources to figure out where I can buy specific materials.

Then, I have to somehow find out how to to gain access to those materials (this is the most difficult part for each case). Once I gain access to the materials, I still need to figure out how to make things work as a total noob at creating bioweapons. I will fail. Even experts fail. So, it will take many tries to get it right, and even then, there are tricks of the trade I’ll likely be unaware of no matter which books I read. Either it’s not in a book or it’s incredibly hard to find so you’ll basically never find it.

All this while needing a high enough degree of intelligence and competence.

2) AI agent system. You pull up your computer and ask for a synthesized step-by-step plan on how to cause the most death or ways to cripple your enemy. Many agents search through books and the internet while also using latent knowledge about the subject. It tells you everything you truly need to know in a concise 4-page document.

Relevant theory, practical steps (laid out with images and videos on how to do it), what to buy and where/how to buy it, pre-empting any questions you may have, explaining the jargon in a way that is understandable to nearly anyone, can take actions on the web to automatically buy all the supplies you need, etc.

You can even share photos of the entire process to your AI as it continues to guide you through the creation of the weapon because it’s multi-modal.

You can basically outsource all cognition to the AI system, allowing you to be the lazy human you are (we all know that humans will take the path of least-resistance or abandon something altogether if there is enough friction).

That topic you always said you wanted to know more about but never got around to it? No worries, your AI system has lowered the bar sufficiently that the task doesn’t seem as daunting anymore and laziness won’t be in the way of you making progress.

Conclusion: a future AI system will have the power of efficiency (significantly faster) and capability (able to make more powerful weapons than any one person could do on their own). It has the interactivity that Google and libraries don’t have. It’s just not the same as information scattered in different sources.

comment by jacquesthibs (jacques-thibodeau) · 2023-05-26T22:32:15.130Z · LW(p) · GW(p)

I recently sent in some grant proposals to continue working on my independent alignment research. It gives an overview of what I'd like to work on for this next year (and more really). If you want to have a look at the full doc, send me a DM. If you'd like to help out through funding or contributing to the projects, please let me know.

Here's the summary introduction:

12-month salary for building a language model system for accelerating alignment research and upskilling (additional funding will be used to create an organization), and studying how to supervise AIs that are improving AIs to ensure stable alignment.

Summary

  • Agenda 1Build an Alignment Research Assistant using a suite of LLMs managing various parts of the research process. Aims to 10-100x productivity in AI alignment research. Could use additional funding to hire an engineer and builder, which could evolve into an AI Safety organization focused on this agenda. Recent talk giving a partial overview of the agenda.
  • Agenda 2Supervising AIs Improving AIs [LW · GW] (through self-training or training other AIs). Publish a paper and create an automated pipeline for discovering noteworthy changes in behaviour between the precursor and the fine-tuned models. Short Twitter thread explanation.
  • Other: create a mosaic of alignment questions we can chip away at, better understand agency in the current paradigm, outreach, and mentoring.

As part of my Accelerating Alignment agenda, I aim to create the best Alignment Research Assistant using a suite of language models (LLMs) to help researchers (like myself) quickly produce better alignment research through an LLM system. The system will be designed to serve as the foundation for the ambitious goal of increasing alignment productivity by 10-100x during crunch time (in the year leading up to existentially dangerous AGI). The goal is to significantly augment current alignment researchers while also providing a system for new researchers to quickly get up to speed on alignment research or promising parts they haven’t engaged with much.

For Supervising AIs Improving AIsthis research agenda focuses on ensuring stable alignment when AIs self-train or train new AIs and studies how AIs may drift through iterative training. We aim to develop methods to ensure automated science processes remain safe and controllable. This form of AI improvement focuses more on data-driven improvements than architectural or scale-driven ones.

I’m seeking funding to continue my work as an independent alignment researcher and intend to work on what I’ve just described. However, to best achieve the project’s goal, I would want additional funding to scale up the efforts for Accelerating Alignment to develop a better system faster with the help of engineers so that I can focus on the meta-level and vision for that agenda. This would allow me to spread myself less thin and focus on my comparative advantages. If you would like to hop on a call to discuss this funding proposal in more detail, please message me. I am open to refocusing the proposal or extending the funding.

Replies from: mesaoptimizer
comment by mesaoptimizer · 2023-07-07T02:46:14.903Z · LW(p) · GW(p)

Build an Alignment Research Assistant using a suite of LLMs managing various parts of the research process. Aims to 10-100x productivity in AI alignment research.

Can you give concrete use-cases that you imagine your project would lead to helping alignment researchers? Alignment researchers have wildly varying styles of work outputs and processes. I assume you aim to accelerate a specific subset of alignment researchers (those focusing on interpretability and existing models and have an incremental / empirical strategy for solving the alignment problem).

comment by jacquesthibs (jacques-thibodeau) · 2022-11-21T12:04:08.155Z · LW(p) · GW(p)

Current Thoughts on my Learning System

Crossposted from my website. Hoping to provide updates on my learning system every month or so.

TLDR of what I've been thinking about lately:

  • There are some great insights in this video called "How Top 0.1% Students Think." And in this video about how to learn hard concepts.
  • Learning is a set of skills. You need to practice each component of the learning process to get better. You can’t watch a video on a new technique and immediately become a pro. It takes time to reap the benefits.
  • Most people suck at mindmaps. Mindmaps can be horrible for learning if you just dump a bunch of text on a page and point arrows to different stuff (some studies show mindmaps are ineffective, but that's because people initially suck at making them). However, if you take the time to learn how to do them well, they will pay huge dividends in the future. I’ll be doing the “Do 100 Things” challenge and developing my skill in building better mindmaps. Getting better at mindmaps involves “chunking” the material and creating memorable connections and drawings.
  • Relational vs Isolated Learning. As you learn something new, try to learn it in relation to the things you already know rather than treating it as isolated from everything (flashcards can perpetuate the problem of learning things in isolated form).
  • Encoding and Retrieval are essential concepts for efficient learning.
  • Deep processing is the foundation of all learning. It is the ability to connect, process, organize and relate information. The opposite of deep processing is rote memorization. If it doesn’t feel like you are engaging ~90% of your brain power when you are learning/reading something, you are likely not encoding the information into your long-term memory effectively.
  • Only use Flashcards as a last resort. Flashcards are something a lot of people use because they feel comfortable going through them. However, if your goal is to be efficient in your learning, you should only use flashcards when it’s something that requires rote learning. Video worth watching on Spaced Repetition.
  • You need to be aiming for higher-order learning. Take advantage of Bloom's Taxonomy.[1]
  • My current approach for learning about alignment: I essentially have a really big Roam Research page called "AI Alignment" where I break down the problem into chunks like "Jargon I don't understand," "Questions to Answer," "Different people's views on alignment," etc. As I fill in those details, I add more and more information in the "Core of the Alignment Problem" section. I have a separate page called "AI Alignment Flow Chart" which I'm using as a structure for backcasting on how we solved alignment and identifying the crucial things we need to solve and things I need to better understand. I also sometimes have a specific page for something like Interpretability when I'm trying to do a deep dive on a topic, but I always try to link it to the other things I've written in my main doc.
  • And this video concisely covers a lot of important learning concepts.
    • Look at the beginning of the video for an explanation of encoding, storage (into long-term memory), and retrieval/rehearsal to make sure you remember long-term.
    • Outside of learning:
      • Get enough sleep. 8 hours-ish.
      • Exercise like HIIT.
      • Make sure you have good mental health.
      • Meditation is likely useful. I personally use it to recharge my battery when I feel a crash coming and I think it’s useful for training yourself to work productively for longer periods of time. This one I’m less sure of, but seems to work for me.
    • Learning (all of these take time to master, don’t expect you will use them in the most effective way right out of the gate):
      • Use inquiry-based (curiosity-based) learning. Have your learning be guided by questions you have, like:
        • ”Why is this important?”
        • ”How does it relate to this other concept?”
      • Learn by scope. Start with the big picture and gradually break things down where it is important.
      • Chunking. Group concepts together and connect different chunks by relationship.
      • Create stories to remember things.
      • Focus on relationships between concepts. This is crucial.
      • Rehearsal
        • Spaced repetition (look at my other notes on how SR is overrated but still useful)
        • Apply your learning by creating things (like a forum post applying the new concept to something and explaining it)

Ever since I was little, I have relied on my raw brain power to get to where I am. Unfortunately, I could never bring myself to do what other smart kids were doing. Flashcards, revision? I would either get bored out of my mind or struggle because I didn’t know how to do it well. Mindmaps? It felt OK while I was doing it the few times I tried, but I would never revise it, and, honestly, I sucked at it.

But none of that mattered. I could still do well enough even though my learning system was terrible. However, I didn’t get the top grades, and I felt frustrated.

I read a few books and watched the popular YouTubers on how to learn things best. Spaced Repetition and Active Recall kept coming up. All these intelligent people were using it, and I truly believed it worked. However, whenever I tried it, I either ended up with too many flashcards to have the time to review, or I couldn't build a habit out of it. Flashcards also felt super inefficient when studying physics.

I did use Cal Newport’s stuff for some classes and performed better by studying the same amount of time, but as soon as things got intense (exam season/lots of homework), I would revert to my old (ineffective) study techniques like reading the textbook aimlessly and highlighting stuff. As a result, I would never truly develop the skill (yes, skill!) of studying well. But, just like anything, you can get better at creating mindmaps for proper learning and long-term memory.

I never got a system down, and I feel I’m losing out on gains in my career. How do I learn things efficiently? I don’t want to do the natural thing of putting in more hours to get more done. 1) My productivity will be capped by my inefficient system, 2) I still want to live life, and 3) it probably won’t work anyways.

So, consider this my public accountability statement to take the time to develop the skills necessary to become more efficient in my work. No more aimlessly reading LessWrong posts about AI alignment. There are more efficient ways to learn.

I want to contribute to AI alignment in a bigger way, and something needs to change. There is so much to learn, and I want to catch up as efficiently as possible instead of just winging it and trying whatever approach seems right.

Had I continued working on things I don’t care deeply about, I might have never decided to put in the effort to create a new system (which will probably take a year of practicing my learning skills). Maybe I would have tried for a few weeks and then reverted to my old habits. I could have kept coasting in life and done decently well in work and my personal life. But we need to solve alignment, and building these skills now will allow me to reap major benefits in a few years.

(Note: a nice bonus for developing a solid learning system is that you can pass it on to your children. I’m excited to do that one day, but I’d prefer to start doing this now so that I know that *I* can do it, and I’m not just telling my future kids nonsense.)

So, what have I been doing so far?

I started the iCanStudy course by Dr. Justin Sung (who has a YouTube channel). I’m only about 31% through the course.

My goal will be to create a “How to Create an Efficient Learning System” guide tailored for professionals and includes examples in AI alignment. Please let me know if there are some things you’d like me to explore in that guide.

Before I go, I’ll mention that I’m also interested in eventually taking what I learn from constructing my own learning system and creating something that allows others to do the same, but with much less effort. I hope to make this work for the alignment community in particular (which relates to my accelerating alignment project), but I’d also like to eventually expand to people working on other cause areas in effective altruism.

  1. ^
Replies from: jacques-thibodeau, jacques-thibodeau, jacques-thibodeau, jacques-thibodeau, jacques-thibodeau, jacques-thibodeau, jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2023-04-10T17:05:43.437Z · LW(p) · GW(p)

Note on using ChatGPT for learning

  • Important part: Use GPT to facilitate the process of pushing you to higher-order learning as fast as possible.
  • Here’s Bloom’s Taxonomy for higher-order learning:
  • For example, you want to ask GPT to come up with analogies and such to help you enter higher-order thinking by thinking about whether the analogy makes sense.
    • Is the analogy truly accurate?
    • Does it cover the main concept you are trying to understand?
    • Then, you can extend the analogy to try to make it better and more comprehensive.
  • This allows you to offload the less useful task (e.g. coming up with the analogy), and spending more time in the highest orders of learning (the evaluation phase; “is this analogy good? where does it break down?”).
  • You still need to use your cognitive load to encode the knowledge effectively. Look for desirable difficulty.
  • Use GPT to create a pre-study of the thing you would like to learn.
    • Have it create an outline of the order of the things you should learn.
    • Have it give you a list of all the jargon words in a field and how they relate so that you can quickly get up to speed on the terminology and talk to an expert.
  • Coming up with chunks of the topic you are exploring.
    • You can give GPT text that describes what you are trying to understand, the relationships between things and how you are chunking them.
    • Then, you can ask GPT to tell you what are some weak areas or some things that are potentially missing.
    • GPT works really well as a knowledge “gap-checker”.

When you are trying to have GPT output some novel insights or complicated nuanced knowledge, it can give vague answers that aren’t too helpful. This is why, it is often better to treat GPT as a gap-checker and/or a friend that is prompting you to come up with great insights.

Reference: I’ve been using ChatGPT/GPT-4 a lot to gain insights on how to accelerate alignment research. Some of my conclusions are similar to what was described in the video below.

comment by jacquesthibs (jacques-thibodeau) · 2022-12-16T19:49:51.803Z · LW(p) · GW(p)

How learning efficiently applies to alignment research

As we are trying to optimize for actually solving the problem, [LW · GW] we should not fall into the trap of learning just to learn. We should instead focus on learning efficiently with respect to how it helps us generate insights that lead to a solution for alignment. This is also the framing we should have in mind when we are building tools for augmenting alignment researchers.

With the above in mind, I expect that part of the value of learning efficiently involves some of the following:

  • Efficient learning involves being hyper-focused on identifying the core concepts and how they all relate to one another. This mode of approaching things seems like it helps us attack the core of alignment much more directly and bypasses months/years of working on things that are only tangential.
  • Developing a foundation of a field seems key to generating useful insights. The goal is not to learn everything but to build a foundation that allows you to bypass spending way too much time tackling sub-optimal sub-problems or dead-ends for way too long. Part of the foundation-building process should reduce the time it shapes you into an exceptional alignment researcher rather than a knower-of-things.
  • As John Wentworth says [LW · GW] with respect to the Game Tree of Alignment: "The main reason for this exercise is that (according to me) most newcomers to alignment waste years on tackling not-very-high-value sub-problems or dead-end strategies."
  • Lastly, many great innovations have not come from unique original ideas. There's an iterative process passed amongst researchers and it seems often the case that the greatest ideas come from simply merging ideas that were already lying around. Learning efficiently (and storing those learnings for later use) allows you to increase the number of ideas you can merge together. If you want to do that efficiently, you need to improve your ability to identify which ideas are worth storing in your mental warehouse to use for a future merging of ideas.
Replies from: peter-hrosso
comment by Peter Hroššo (peter-hrosso) · 2022-12-16T21:23:51.631Z · LW(p) · GW(p)

My model of (my) learning is that if the goal is sufficiently far, learning directly towards the goal is goodharting a likely wrong metric.

The only method which worked for me for very distant goals is following my curiosity and continuously internalizing new info, such that the curiosity is well informed about current state and the goal.

Replies from: jacques-thibodeau, jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2022-12-16T21:36:20.141Z · LW(p) · GW(p)

Curiosity is certainly a powerful tool for learning! I think any learning system which isn't taking advantage of it is sub-optimal. Learning should be guided by curiosity.

The thing is, sometimes we need to learn things we aren't so curious about. One insight I Iearned from studying learning is that you can do specific things to make yourself more curious about a given thing and harness the power that comes with curiosity.

Ultimately, what this looks like is to write down questions about the topic and use them to guide your curious learning process. It seems that this is how efficient top students end up learning things deeply in a shorter amount of time. Even for material they care little about, they are able to make themselves curious and be propelled forward by that.

That said, my guess is that goodharting the wrong metric can definitely be an issue, but I'm not convinced that relying on what makes you naturally curious is the optimal strategy for solving alignment. Either way, it's something to think about!

comment by jacquesthibs (jacques-thibodeau) · 2023-01-02T16:17:39.008Z · LW(p) · GW(p)

By the way, I've just added a link to a video by a top competitive programmer on how to learn hard concepts. In the video and in the iCanStudy course, both talk about the concept of caring about what you are learning (basically, curiosity). Gaining the skill to care and become curious is an essential part of the most effective learning. However, contrary to popular belief, you don't have to be completely guided by what makes you naturally curious! You can learn how to become curious (or care) about any random concept.

comment by jacquesthibs (jacques-thibodeau) · 2023-01-13T14:46:33.307Z · LW(p) · GW(p)

Video on how to approach having to read a massive amount of information (like a textbook) as efficiently as possible: 

comment by jacquesthibs (jacques-thibodeau) · 2023-01-04T20:09:51.561Z · LW(p) · GW(p)

Added my first post (of, potentially, a sequence) on effective learning here [LW · GW]. I think there are a lot of great lessons at the frontier of the literature and real-world practice on learning that go far beyond the Anki approach that a lot of people seem to take these days. The important part is being effective and efficient. Some techniques might work, but that does not mean it is the most efficient (learning the same thing more deeply in less time).

Note that I also added two important videos to the root shortform:

There are some great insights in this video called "How Top 0.1% Students Think." And in this video about how to learn hard concepts.

comment by jacquesthibs (jacques-thibodeau) · 2023-01-02T17:15:02.923Z · LW(p) · GW(p)

Note on spaced repetition

While spaced repetition is good, many people end up misusing it as a crutch instead of defaulting to trying to deeply understand a concept right away. As you get better at properly encoding the concept, you extend the forgetting curve to the point where repetition is less needed.

Here's a video of a top-level programmer on how he approaches learning hard concepts efficiently.

And here's a video on how the top 0.1% of students study efficiently.

comment by jacquesthibs (jacques-thibodeau) · 2023-01-02T03:24:19.969Z · LW(p) · GW(p)

Here's some additional notes on the fundamentals on being an effective learner:

Encoding and Retrieval (What it take to learn)

  • Working memory is the memory that we use. However, if it is not encoded properly or at all, we will forget it.
  • Encode well first (from working memory to long-term memory), then frequently and efficiently retrieve from long-term memory.
  • If studying feels easy, means that you aren't learning or holding on to the information. It means that you are not encoding and retrieving effectively.
  • You want it to be difficult when you are studying because this is how it will encode properly.

Spacing, Interleaving, and Retrieval (SIR)

  • These are three rules that apply to every study technique in the course (unless told otherwise). You can apply SIR to all techniques.
  • Spacing: space your learning out.
    • Pre-study before class, then learn in class, and then a week later revise it with a different technique.
    • A rule of thumb you can follow is to wait long enough until you feel like you are just starting to forget the material.
    • As you get better at encoding the material effectively as soon as you are exposed to it, you will notice that you will need to do less repetition.
    • How to space reviews:
      • Beginner Schedule (less reviews need as you get better at encoding)
        • Same day
        • Next day
        • End of week
        • End of month
      • After learning something for the first time, review it later on the same day.
      • Review everything from the last 2-3 days mid-week.
      • Do an end of week revision on the week's worth of content.
      • End of month revision on entire month's worth of content.
      • Review of what's necessary as time goes on.
        • (If you're trying to do well on an exam or a coding interview, you can do the review 1 or 2 weeks before the assessment.)
    • Reviewing time duration:
      • For beginners
        • No less than 30 minutes per subject for end-of-week
        • No less than 1.5 hours per subject for end-of-month.
    • Schedule the reviews in your Calendar and add a reminder!
  • Interleaving: hitting a topic or concept from multiple different angles (mindmaps, teaching).
    • The idea is that there is the concept you want to learn, but also there is a surrounding range that you also want to learn (not just the isolated concept).
    • Could be taking a concept and asking a question about it. Then, asking a question from another angle. Then, asking how it relates to another concept.
    • Try to use a multitude of these techniques in your studying, never studying or revising anything the same way more than once.
    • Math, it could be thinking about the real-world application of it.
    • Examples of interleaving:
      • Teach an imaginary student
      • Draw a mindmap
      • Draw an image instead of using words to find a visual way of expressing information
      • Answer practice questions
      • Create your own challenging test questions
      • Create a test question that puts what you've learned into a real-world context
      • Take a difficult question that you found in a practice test and modify it so that the variables are different, or an extra step is added
      • Form a study group and quiz each other - for some subjects you can even debate the topic, with one side trying to prove that the other person is missing a point or understanding it incorrectly
      • For languages, you can try to speak or write a piece of dialogue or speech, as well as some variations. How might someone respond? How would you respond back? Are there any other responses that would be appropriate?
  • Retrieval: taking info from your long-term memory and bringing it into your working memory to recall, solve problems and answer questions.
    • Taking a concept and retrieving it from your long-term memory.
    • Don't just retrieve right away, you can look at your notes, take a few minutes and retrieve.
    • Or it also happens when you are learning something. Let's say you are listening to a lecture. Are you just writing everything down or are you taking some time to think and process what is being said and then writing down notes? The second one is better.

Syntopical Learning

When you are learning something, you want to apply interleaving by learning from different sources and mediums. So, practice become great at learning while listening, while watching, while reading. These are all individual modes of learning you can get better at and they will all help you better retain the material if you use them all while learning.

comment by jacquesthibs (jacques-thibodeau) · 2022-11-21T22:53:53.984Z · LW(p) · GW(p)

A few more notes:

  • I use the app Concepts on my iPad to draw mindmaps. Drawing mindmaps with pictures and such is way more powerful (better encoding into long-term memory) than typical mindmap apps where you just type words verbatim and draw arrows. It's excellent since it has a (quasi-) infinite canvas. This is the same app that Justin Sung uses.
  • When I want to go in-depth into a paper, I will load it into OneNote on my iPad and draw in the margin to better encode my understanding of the paper.
  • I've been using the Voice Dream Reader app on my iPhone and iPad to get through posts and papers much faster (I usually have time to read most of an Alignment Forum post on my way to work and another on the way back). Importantly, I stop the text-to-speech when I'm trying to understand an important part. I use Pocket to load LW/AF posts into it and download PDFs on my device and into the app for reading papers. There's a nice feature in the app that automatically skips citations in the text, so reading papers isn't as annoying. The voices are robotic, but I just cycled through a bunch until I found one I didn't mind (I didn't buy any, but there are premium voices). I expect Speechify has better voices, but it's more expensive, and I think people find that the app isn't as good overall compared to Voice Dream Reader. Thanks to Quintin Pope for recommending the app to me.
comment by jacquesthibs (jacques-thibodeau) · 2024-04-12T15:01:48.420Z · LW(p) · GW(p)

I'm currently ruminating on the idea of doing a video series in which I review code repositories that are highly relevant to alignment research to make them more accessible.

I do want to pick out repos with perhaps even bad documentation that are still useful and then hope on a call with the author to go over the repo and record it. At least have something basic to use when navigating the repo.

This means there would be two levels: 1) an overview with the author sharing at least the basics, and 2) a deep dive going over most of the code. The former likely contains most of the value (lower effort for me, still gets done, better than nothing, points to repo as a selection mechanism, people can at least get started).

I am thinking of doing this because I think there may be repositories that are highly useful for new people but would benefit from some direction. For example, I think Karpathy and Neel Nanda's videos have been useful in getting people started. In particular, Karpathy saw OOM more stars to his repos (e.g. nanoGPT) after the release of his videos (which, to be fair, he's famous, and a number of stars is definitely not a perfect proxy for usage).

I'm interested in any feedback ("you should do it like x", "this seems low value for x, y, z reasons so you shouldn't do it", "this seems especially valuable only if x", etc.).

Here are some of the repos I have in mind so far:

Release Ordering

Replies from: Dagon
comment by Dagon · 2024-04-12T16:57:13.614Z · LW(p) · GW(p)

I love this idea!  I don't actually like videos, preferring searchable, exerptable text, but I may not be typical and there's room for all. At first glance, I agree with your guess that the overview/intro is more value per effort (for you and for consumers, IMO) than a deep-dive into the code. There IS probably a section of code or core modeling idea for each where it would be worth going half-deep into (algorithm and usage, not necessarily line-by-line).

Note that this list is itself incredibly valuable, and you might start with an intro video (and associated text) that spends 1 minute on each and why you're planning to do it, and what you currently think will be the most important intro concept(s) for each.

comment by jacquesthibs (jacques-thibodeau) · 2023-03-30T01:17:37.331Z · LW(p) · GW(p)

I’m still thinking this through, but I am deeply concerned about Eliezer’s new article [LW · GW] for a combination of reasons:

  • I don’t think it will work.
  • Given that it won’t work, I expect we lose credibility and it now becomes much harder to work with people who were sympathetic to alignment, but still wanted to use AI to improve the world.
  • I am not convinced as he is about doom and I am not as cynical about the main orgs as he is.

In the end, I expect this will just alienate people. And stuff like this [LW(p) · GW(p)] concerns me.

I think it’s possible that the most memetically powerful approach will be to accelerate alignment rather than suggesting long-term bans or effectively antagonizing all AI use.

Replies from: abramdemski, jacques-thibodeau, Viliam
comment by abramdemski · 2023-04-27T16:31:10.533Z · LW(p) · GW(p)

So I think what I'm getting here is that you have an object-level disagreement (not as convinced about doom), but you are also reinforcing that object-level disagreement with signalling/reputational considerations (this will just alienate people). This pattern feels ugh and worries me. It seems highly important to separate the question of what's true from the reputational question. It furthermore seems highly important to separate arguments about what makes sense to say publicly on-your-world-model vs on-Eliezer's-model. In particular, it is unclear to me whether your position is "it is dangerously wrong to speak the truth about AI risk" vs "Eliezer's position is dangerously wrong" (or perhaps both). 

I guess that your disagreement with Eliezer is large but not that large (IE you would name it as a disagreement between reasonable people, not insanity). It is of course possible to consistently maintain that (1) Eliezer's view is reasonable, (2) on Eliezer's view, it is strategically acceptable to speak out, and (3) it is not in fact strategically acceptable for people with Eliezer's views to speak out about those views. But this combination of views does imply endorsing a silencing of reasonable disagreements which seems unfortunate and anti-epistemic. 

My own guess is that the maintenance of such anti-epistemic silences is itself an important factor contributing to doom. But, this could be incorrect.

Replies from: jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2023-04-27T17:32:41.768Z · LW(p) · GW(p)

Yeah, so just to clarify a few things:

  • This was posted on the day of the open letter and I was indeed confused about what to think of the situation.
  • I think something I failed to properly communicate is that I was worried that this was a bad time to pull the lever even if I’m concerned about risks from AGI. I was worried the public wouldn’t take alignment seriously because they cause a panic much sooner than people were ready for.
  • I care about being truthful, but I care even more about not dying so my comment was mostly trying to communicate that I didn’t think this was the best strategic decision for not dying.
  • I was seeing a lot of people write negative statements about the open letter on Twitter and it kind of fed my fears that this was going to backfire as a strategy and impact all of our work to make ai risk taken seriously.
  • In the end, the final thing that matters is that we win (i.e. not dying from AGI).

I’m not fully sure what I think now (mostly because I don’t know about higher order effects that will happen 2-3 years from now), but I think it turned out a lot strategically better than I initially expected.

comment by jacquesthibs (jacques-thibodeau) · 2023-03-30T03:46:09.453Z · LW(p) · GW(p)

To try and burst any bubble about people’s reaction to the article, here’s a set of tweets critical about the article:

Replies from: Viliam
comment by Viliam · 2023-03-30T13:50:28.360Z · LW(p) · GW(p)

What is the base rate for Twitter reactions for an international law proposal?

Replies from: jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2023-03-30T16:21:57.006Z · LW(p) · GW(p)

Of course it’s often all over the place. I only shared the links because I wanted to make sure people weren’t deluding themselves with only positive comments.

comment by Viliam · 2023-03-30T13:47:22.123Z · LW(p) · GW(p)

This reminds me of the internet-libertarian chain of reasoning that anything that government does is protected by the threat of escalating violence, therefore any proposals that involve government (even mild ones, such as "once in a year, the President should say 'hello' to the citizens") are calls for murder, because... (create a chain of escalating events starting with someone non-violently trying to disrupt this, ending with that person being killed by cops)...

Yes, a moratorium on AIs is a call for violence, but only in the sense that every law is a call for violence.

Replies from: sharmake-farah
comment by jacquesthibs (jacques-thibodeau) · 2023-11-05T19:00:38.093Z · LW(p) · GW(p)

This seems like a fairly important paper by Deepmind regarding generalization (and lack of it in current transformer models): https://arxiv.org/abs/2311.00871 

Here’s an excerpt on transformers potentially not really being able to generalize beyond training data:

Our contributions are as follows:

  • We pretrain transformer models for in-context learning using a mixture of multiple distinct function classes and characterize the model selection behavior exhibited.
  • We study the in-context learning behavior of the pretrained transformer model on functions that are "out-of-distribution" from the function classes in the pretraining data.
  • In the regimes studied, we find strong evidence that the model can perform model selection among pretrained function classes during in-context learning at little extra statistical cost, but limited evidence that the models' in-context learning behavior is capable of generalizing beyond their pretraining data.
Replies from: leogao, sharmake-farah, jacques-thibodeau, D0TheMath, Oliver Sourbut
comment by leogao · 2023-11-05T23:06:23.847Z · LW(p) · GW(p)

i predict this kind of view of non magicalness of (2023 era) LMs will become more and more accepted, and this has implications on what kinds of alignment experiments are actually valuable (see my comment on the reversal curse paper [LW(p) · GW(p)]). not an argument for long (50 year+) timelines, but is an argument for medium (10 year) timelines rather than 5 year timelines

Replies from: leogao
comment by leogao · 2023-11-05T23:45:07.873Z · LW(p) · GW(p)

also this quote from the abstract is great:

Together our results highlight that the impressive ICL abilities of high-capacity sequence models may be more closely tied to the coverage of their pretraining data mixtures than inductive biases that create fundamental generalization capabilities.

i used to call this something like "tackling the OOD generalization problem by simply making the distribution so wide that it encompasses anything you might want to use it on"

comment by Noosphere89 (sharmake-farah) · 2023-11-06T14:28:36.241Z · LW(p) · GW(p)

I'd say my major takeaways, assuming this research scales (it was only done on GPT-2, and we already knew it couldn't generalize.)

  1. Gary Marcus was right about LLMs mostly not reasoning outside the training distribution, and this updates me more towards "LLMs probably aren't going to be godlike, or be nearly as impactful as LW say it is."

  2. Be more skeptical of AI progress leading to big things, and in general unless reality can simply be memorized, scaling probably won't work to automate the economy. More generally, this updates me towards longer timelines, and longer tails on those timelines.

  3. Be slightly more pessimistic on AI safety, since LLMs have a bunch of nice properties, and future AI probably will have less nice properties, though alignment optimism mostly doesn't depend on LLMs.

  4. AI governance gets a lucky break, since they only have to regulate misuse, and even though their threat model isn't likely or even probable to be realized, it's still nice that we don't have to deal with the disruptive effects of AI now.

Replies from: jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2023-11-06T16:58:20.406Z · LW(p) · GW(p)

I am sharing this since I think it will change your view on how much to update on this paper (I should have shared this initially). Here's what the paper author said on X:
 

Clarifying two things:

  • Model is simple transformer for science, not a language model (or large by standards today)
  • The model can learn new tasks (via in-context learning), but can’t generalize to new task families

I would be thrilled if this work was important for understanding AI safety and fairness, but it is the start of a scientific direction, not ready for policy conclusions. Understanding what task families a true LLM is capable of would be fascinating and more relevant to policy!

 

So, with that, I said:

I hastily thought the paper was using language models, so I think it's important to share this. A follow-up paper using a couple of 'true' LLMs at different model scales would be great. Is it just interpolation? How far can the models extrapolate?

To which @Jozdien [LW · GW] replied:

Redpill is that all intelligence is just interpolation if you reach this level of fidelity.

comment by jacquesthibs (jacques-thibodeau) · 2023-11-16T16:26:16.333Z · LW(p) · GW(p)

Title: Is the alignment community over-updating on how scale impacts generalization?

So, apparently, there's a rebuttal to the recent Google generalization paper (and also worth pointing out it wasn't done with language models, just sinoïsodal functions, not language):

But then, the paper author responds:


This line of research makes me question one thing: "Is the alignment community over-updating on how scale impacts generalization?"

It remains to be seen how well models will generalize outside of their training distribution (interpolation vs extrapolation).

In other words, when people say that GPT-4 (and other LLMs) can generalize, I think they need to be more careful about what they really mean. Is it doing interpolation or extrapolation? Meaning, yes, GPT-4 can do things like write a completely new poem, but poems and related stuff were in its training distribution! So, you can say it is generalizing, but I think it's a much weaker form of generalization than what people really imply when they say generalization. A stronger form of generalization would be an AI that can do completely new tasks that are actually outside of its training distribution.

Now, at this point, you might say, "yes, but we know that LLMs learn functions and algorithms to do tasks, and as you scale up and compress more and more data, you will uncover more meta-algorithms that can do this kind of extrapolation/tasks outside of the training distribution."

Well, two things:

  1. It remains to be seen when or if this will happen in the current paradigm (no matter how much you scale up).
  2. It's not clear to me how well things like induction heads continue to work on things that are outside of their training distribution. If they don't adapt well, then it may be the same thing for other algorithms. What this would mean in practice, I'm not sure. I've been looking at relevant papers, but haven't found an answer yet [LW · GW].

This brings me to another point: it also remains to be seen how much it will matter in practice, given that models are trained on so much data and things like online learning are coming. Scaffolding specialized AI models, and new innovations might make such a limitation not big of a deal if there is one.

Also, perhaps most of the important capabilities come from interpolation. Perhaps intelligence is largely just interpolation? You just need to interpolate and push the boundaries of capability one step at a time, iteratively, like a scientist conducting experiments would. You just need to integrate knowledge as you interact with the world.

But what of brilliant insights from our greatest minds? Is it just recursive interpolation+small_external_interactions? Is there something else they are doing to get brilliant insights? Would AGI still ultimately be limited in the same way (even if it can run many of these genius patterns in parallel)?

Replies from: jacques-thibodeau
comment by Garrett Baker (D0TheMath) · 2023-11-06T17:20:13.456Z · LW(p) · GW(p)

Some evidence this is not so fundamental, and we should expect a (or many) phase transition(s) to more generalizing in context learning as we increase the log number of tasks.

comment by Oliver Sourbut · 2023-11-07T08:13:25.004Z · LW(p) · GW(p)

My hot take is that this paper's prominence is a consequence of importance hacking (I'm not accusing the authors in particular). Zero or near-zero relevance to LLMs.

Authors get a yellow card for abusing the word 'model' twice in the title alone.

comment by jacquesthibs (jacques-thibodeau) · 2023-07-29T22:43:35.844Z · LW(p) · GW(p)

Given funding is a problem in AI x-risk at the moment, I’d love to see people to start thinking of creative ways to provide additional funding to alignment researchers who are struggling to get funding.

For example, I’m curious if governance orgs would pay for technical alignment expertise as a sort of consultant service.

Also, it might be valuable to have full-time field-builders that are solely focused on getting more high-net-worth individuals to donate to AI x-risk.

comment by jacquesthibs (jacques-thibodeau) · 2023-01-24T21:09:16.838Z · LW(p) · GW(p)

On joking about how "we're all going to die"

Setting aside the question of whether people are overly confident about their claims regarding AI risk, I'd like to talk about how we talk about it amongst ourselves.

We should avoid jokingly saying "we're all going to die" because I think it will corrode your calibration to risk with respect to P(doom) and it will give others the impression that we are all more confident about P(doom) than we really are.

I think saying it jokingly still ends up creeping into your rational estimates on timelines and P(doom). I expect that the more you joke about high P(doom), the more likely you will end up developing an unjustified high P(doom). And I think if you say it enough, you can even convince yourself that you are more confident in your high P(doom) than you really are.

Joking about it in public also potentially diminishes your credibility. They may or may not know if you are joking, but that doesn't matter.

For all the reasons above, I've been trying to make a conscious effort to avoid this kind of talk.

From my understanding, being careful with the internal and external language you use is something that is recommended in therapy. Would be great if someone could point me to examples of this.

comment by jacquesthibs (jacques-thibodeau) · 2023-01-05T21:14:12.396Z · LW(p) · GW(p)

What are some important tasks you've found too cognitively taxing to get in the flow of doing?

One thing that I'd like to consider for Accelerating Alignment [LW · GW] is to build tools that make it easier to get in the habit of cognitively demanding tasks by reducing the cognitive load necessary to do the task. This is part of the reason why I think people are getting such big productivity gains from tools like Copilot.

One way I try to think about it is like getting into the habit of playing guitar. I typically tell people to buy an electric guitar rather than an acoustic guitar because the acoustic is typically much more painful for your fingers. You are already doing a hard task of learning an instrument, try to reduce the barrier to entry by eliminating one of the causes of friction. And while you're at it, don't put your guitar in a case or in a place that's out of your way, make it ridiculously easy to just pick up and play. In this example, it's not cognitively taxing, but it is some form of tax that produces friction.

It is possible that we could have much more people tackling the core of alignment if it was less mentally demanding to get to that point and contribute to a solution. It's possible that some level of friction for some tasks is making it so people are more likely to opt for what is easy (and potentially leads to fake progress on a solution to alignment). One such example might be understanding some difficult math. Another might be communicating your research in a way that is understandable to others.

I think it's worth thinking in this frame when coming up with ways to accelerate alignment research by augmenting researchers.

Replies from: ete
comment by plex (ete) · 2023-01-06T00:57:14.705Z · LW(p) · GW(p)

For developing my hail mary alignment approach, the dream would be to be able to load enough of the context of the idea into a LLM that it could babble suggestions (since the whole doc won't fit in the context window, maybe randomizing which parts beyond the intro are included for diversity?), then have it self-critique those suggestions automatically in different threads in bulk and surface the most promising implementations of the idea to me for review. In the perfect case I'd be able to converse with the model about the ideas and have that be not totally useless, and pump good chains of thought back into the fine-tuning set.

comment by jacquesthibs (jacques-thibodeau) · 2022-12-29T16:07:59.803Z · LW(p) · GW(p)

Projects I'd like to work on in 2023.

Wrote up a short (incomplete) bullet point list of the projects I'd like to work on in 2023:

  • Accelerating Alignment
    • Main time spent (initial ideas, will likely pivot to varying degrees depending on feedback; will start with one):
      • Fine-tune GPT-3/GPT-4 on alignment text and connect the API to LoomVSCode (CoPilot for alignment research) and potentially notetaking apps like Roam Research. (1-3 months, depending on bugs and if we continue to add additional features.)
      • Create an audio-to-post pipeline where we can easily help alignment researchers create posts through conversations rather than staring at a blank page. (1-4 months, depending on collaboration with Conjecture and others; and how many features we add.)
      • Leaving the door open and experimenting with ChatGPT and/or GPT-4 to use them for things we haven't explored yet. Especially GPT-4, we can guess in advance what it will be capable of, but we'll likely need to experiment a lot to discover how to use it optimally given it might have new capabilities GPT-3 doesn't have. (2 to 6 weeks.)
    • Work with Janus, Nicholas Dupuis, and others on building tools for accelerating alignment research using language models (in prep for and integrating GPT-4). These will serve as tools for augmenting the work of alignment researchers. Many of the tool examples are covered in the grant proposal, my recent post [LW · GW], and an upcoming post, and Nicholas' doc on Cyborgism (we've recently spun up a discord to discuss these things with other researchers; send DM for link). This work is highly relevant to OpenAI's main alignment proposal.
    • This above work involves:
      • Working on setting the foundation for automating alignment and making proposal verification viable. (1 week of active work for a post I'm working on, and then some passive work while I build tools.)
      • Studying the epistemology of effective research helps generate research that leads us to solve alignment. For example, promoting flow and genius moments, effective learning (I'm taking a course on this and so far it is significantly better than the "Learning How to Learn" course) and how it can translate to high-quality research [LW(p) · GW(p)], etc. (5 hours per week)
      • Studying how to optimally condition generative models for alignment [AF · GW].
    • It's very hard to predict how the tool-building will go because I expect to be doing a lot of iteration to land on things that are optimally useful rather than come up with a specific plan and stick to it. My goal here is to implement design thinking and approaches that startups use. This involves taking the survey responses, generating a bunch of ideas, create an MVP, test it out with alignment researchers, and then learn from feedback.
  • Finish a sequence I'm working on with others. We are currently editing the intro post and refining the first post. We went through 6 weeks of seminars for a set of drafts and we are now working to build upon those. (6 to 8 weeks)
  • Other Projects outside of the grant (will dedicate about 1 day per week, but expect to focus more on some of these later next year, depending on how Accelerating Alignment goes. If not, I'll likely find some mentees or more collaborators to work on some of them.)
    • Support the Shard Theory team in running experiments using RL and language models. I'll be building off of my MATS colleagues' work. (3 to 5 months for running experiments and writing about them. Would consider spending a month or so on this and then mentoring someone to continue.)
    • Applying the Tuned Lens to better understand what transformers are doing. For example, what is being written and read from the residual stream and how certain things like RL lead to non-myopic behaviour. Comparing self-supervised models to RL fine-tuned models. (2 to 4 months by myself, probably less if I collaborate.)
    • Building off of Causal Tracing and Causal Scrubbing to develop more useful causal interpretability techniques. In this linked doc, I discuss this in the second main section: "Relevance For Alignment." (3 days to wrap up first post. For exploring, studying and writing about new causal methods, anywhere from 2 months to 4 months.)
    • Provide support for governance projects. I've been mentoring someone looking to explore AI Governance for the past few months (they are now applying for an internship at GovAI). They are currently writing up a post on "AI safety" governance in Canada. I'll be providing mentorship on a few posts I've suggested they write. Here's my recent governance post [LW · GW]. (2-3 hours per week)
    • Update and wrap up the GEM proposal. Adding new insights to it, including the new Tunes Lens that Nora has been working on. (1 week)
    • Applying quantilizers to Large Language Models. This project is still in the discovery phase for a MATS colleague of mine. I'm providing comments at the moment, but it may turn into a full-time project later next year.
    • Mentoring through the AI Safety Mentors and Mentees [EA · GW] program. I'm currently mentoring someone who is working on Shard Theory and Infra-Bayesianism relevant work.
Replies from: jacques-thibodeau, cmessinger, ete, MathieuRoy
comment by jacquesthibs (jacques-thibodeau) · 2023-02-17T06:17:35.761Z · LW(p) · GW(p)

Two other projects I would find interesting to work on:

  • Causal Scrubbing to remove specific capabilities from a model. For example, training a language model on The Pile and a code dataset. Then, applying causal scrubbing to try and remove the model's ability to generate code while still achieving the similar loss on The Pile.
  • A few people have started extending the work from the Discovering Latent Knowledge in Language Models without Supervision paper. I think this work could potentially evolve into a median-case solution to avoiding x-risk from AI.
comment by chanamessinger (cmessinger) · 2023-05-05T13:12:38.760Z · LW(p) · GW(p)

Curious if you have any updates!

Replies from: jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2023-05-05T17:26:10.351Z · LW(p) · GW(p)

Working on a new grant proposal right now. Should be sent this weekend. If you’d like to give feedback or have a look, please send me a DM! Otherwise, I can send the grant proposal to whoever wants to have a look once it is done (still debating about posting it on LW).

Outside of that, there has been a lot of progress on the Cyborgism discord (there is a VSCode plugin called Worldspider that connects to the various APIs, and there has been more progress on Loom). Most of my focus has gone towards looking at the big picture and keeping an eye on all the developments. Now, I have a better vision of what is needed to create an actually great alignment assistant and have talked to other alignment researchers about it to get feedback and brainstorm. However, I’m spread way too thin and will request additional funding to get some engineer/builder to start building the ideas out so that I can focus on the bigger picture and my alignment work.

If I can get my funding again (previous funding ended last week) then my main focus will be building out the system I have in my for accelerating alignment work + continue working on the new agenda [LW · GW] I put out with Quintin and others. There’s some other stuff I‘d like to do, but those are lower priority or will depend on timing. It’s been hard to get the funding application done because things are moving so fast and I’m trying not to build things that will be built by default. And I’ve been talking to some people about the possibility of building an org so that this work could go a lot faster.

comment by plex (ete) · 2023-01-01T21:34:45.427Z · LW(p) · GW(p)

Very excited by this agenda, was discussing my hope that someone finetunes LLMs on the alignment archive soon today!

comment by Mati_Roy (MathieuRoy) · 2023-04-09T18:45:45.972Z · LW(p) · GW(p)

Nicholas' doc on Cyborgism

do you have a link?

I'd be interested in being added to the Discord

comment by jacquesthibs (jacques-thibodeau) · 2023-09-24T03:34:38.311Z · LW(p) · GW(p)

Jacques' AI Tidbits from the Web

I often find information about AI development on X (f.k.a.Twitter) and sometimes other websites. They usually don't warrant their own post, so I'll use this thread to share. I'll be placing a fairly low filter on what I share.

Replies from: jacques-thibodeau, jacques-thibodeau, jacques-thibodeau, jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2023-09-24T03:41:59.616Z · LW(p) · GW(p)

There's someone on X (f.k.a.Twitter) called Jimmy Apples (🍎/acc) and he has shared some information in the past that turned out to be true (apparently the GPT-4 release date and that OAI's new model would be named "Gobi"). He recently tweeted, "AGI has been achieved internally." Some people think that the Reddit comment below may be from the same guy (this is just a weak signal, I’m not implying you should consider it true or update on it):

Replies from: elifland, jacques-thibodeau, jacques-thibodeau
comment by elifland · 2023-09-25T00:43:43.389Z · LW(p) · GW(p)

Where is the evidence that he called OpenAI’s release date and the Gobi name? All I see is a tweet claiming the latter but it seems the original tweet isn’t even up?

Replies from: jacques-thibodeau, jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2023-09-25T01:05:17.798Z · LW(p) · GW(p)

This is the tweet for Gobi: https://x.com/apples_jimmy/status/1703871137137176820?s=46&t=YyfxSdhuFYbTafD4D1cE9A

I asked someone if it’s fake. Apparently not, you can find it on google archive: https://threadreaderapp.com/thread/1651837957618409472.html

Replies from: person-1
comment by Person (person-1) · 2023-09-25T13:15:36.581Z · LW(p) · GW(p)

Predicting the GPT-4 launch date can easily be disproven with the confidence game. It's possible he just created a prediction for every day and deleted the ones that didn't turn out right.

For the Gobi prediction it's tricky. The only evidence is the Threadreader and a random screenshot from a guy who seems clearly related to jim. I am very suspicious of the Threadreader one. On one hand I don't see a way it can be faked, but it's very suspicious that the Gobi prediction is Jimmy's only post that was saved there despite him making an even bigger bombshell "prediction". It's also possible, though unlikely, that the Information's article somehow found his tweet and used it as a source for their article.

What kills Jimmy's credibility for me is his prediction back in January (you can use the Wayback Machine to find it) that OAI had finished training GPT-5, no not a GPT-5 level system, the ACTUAL GPT-5 in October 2022 and that it was 125T parameters.

Also goes without saying, pruning his entire account is suspicious too. 

comment by jacquesthibs (jacques-thibodeau) · 2023-09-25T00:49:21.878Z · LW(p) · GW(p)

I’ll try to find them, but this was what people were saying. They also said he deleted past tweets so that evidence may forever be gone.

I remember one tweet where Jimmy said something like, “Gobi? That’s old news, I said that months ago, you need to move on to the new thing.” And I think he linked the tweet though I’m very unsure atm. Need to look it up, but you can use the above for a search.

comment by jacquesthibs (jacques-thibodeau) · 2023-10-24T02:16:59.349Z · LW(p) · GW(p)

New tweet by Jimmy Apples. This time, he's insinuating that OpenAI is funding a stealth startup working on BCI.

If this is true, then it makes sense they would prefer not to do it internally to avoid people knowing in advance based on their hires. A stealth startup would keep things more secret.

Might be of interest, @lisathiergart [LW · GW] and @Allison Duettmann [LW · GW].

comment by jacquesthibs (jacques-thibodeau) · 2023-09-26T01:23:14.658Z · LW(p) · GW(p)

Not sure exactly what this means, but Jimmy Apples has now tweeted the following:

My gut is telling me that he apple-bossed too close to the sun (released info he shouldn't have, and now that he's concerned about his job or some insider's job), and it's time for him to stop sharing stuff (the apple being bitten symbolizing that he is done sharing info).

This is because the information in my shortform was widely shared on X and beyond.

He also deleted all of his tweets (except for the retweets).

Replies from: person-1
comment by Person (person-1) · 2023-09-26T01:53:47.095Z · LW(p) · GW(p)

Or that he was genuinely just making things up and tricking us for fun, and a cryptic exit is a perfect way to leave the scene. I really think people are looking way too deep into him and ignoring the more outlandish predictions he's made (125T GPT-4 and 5 in October 2022), along with the fact there is never actual evidence of his accurate ones, only 2nd hand very specific and selective archives.

Replies from: jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2023-09-26T02:01:30.012Z · LW(p) · GW(p)

He did say some true things before. I think it's possible all of the new stuff is untrue, but we're getting more reasons to believe it's not entirely false. The best liars sprinkle in truth.

I think, as a security measure, it's also possible that even people within OpenAI know all the big details of what's going on (this is apparently the case for Anthropic). This could mean, for OpenAI employees, that some details are known while others are not. Employees themselves could be forced to speculate on some things.

Either way, I'm not obsessing too much over this. Just sharing what I'm seeing.

comment by jacquesthibs (jacques-thibodeau) · 2023-11-15T17:34:20.514Z · LW(p) · GW(p)

More predictions/insights from Jimmy and crew. He's implying that people (like I have also been saying) that some people are far too focused on scale over training data and architectural improvements. IMO, the bitter lesson is a thing, but I think we've over-updated on it.

Relatedly, someone shared a new 13B model that apparently is better and comparable to GPT-4 in logical reasoning (based on benchmarks, which I don't usually trust too much). Note that the model is a solver-augmented LM.

Here's some context regarding the paper:

comment by jacquesthibs (jacques-thibodeau) · 2023-09-24T20:23:37.888Z · LW(p) · GW(p)

Sam Altman at a YC founder reunion: https://x.com/smahsramo/status/1706006820467396699?s=46&t=YyfxSdhuFYbTafD4D1cE9A

“Most interesting part of @sama talk: GPT5 and GPT6 are “in the bag” but that’s likely NOT AGI (eg something that can solve quantum gravity) without some breakthroughs in reasoning. Strong agree.”

Replies from: Mitchell_Porter
comment by Mitchell_Porter · 2023-09-24T21:37:20.415Z · LW(p) · GW(p)

AGI is "something that can solve quantum gravity"? 

That's not just a criterion for general intelligence, that's a criterion for genius-level intelligence. And since general intelligence in a computer has advantages of speed, copyability, little need for down time, that are not possessed by general intelligence, AI will be capable of contributing to its training, re-design, agentization, etc, long before "genius level" is reached. 

This underlines something I've been saying for a while, which is that superintelligence, defined as AI that definitively surpasses human understanding and human control, could come into being at any time (from large models that are not publicly available but which are being developed privately by Big AI companies). Meanwhile, Eric Schmidt (former Google CEO) says about five years until AI is actively improving itself, and that seems generous. 

So I'll say: timeline to superintelligence is 0-5 years. 

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2023-09-25T05:41:35.773Z · LW(p) · GW(p)

capable of contributing to its training, re-design, agentization, etc, long before "genius level" is reached

In some models of the world this is seen as unlikely to ever happen, these things are expected to coincide, which collapses the two definitions of AGI. I think the disparity between sample efficiency of in-context learning and that of pre-training is one illustration for how these capabilities might come apart, in the direction that's opposite to what you point to: even genius in-context learning doesn't necessarily enable the staying power of agency, if this transient understanding can't be stockpiled and the achieved level of genius is insufficient to resolve the issue while remaining within its limitations (being unable to learn a lot of novel things in the course of a project).

comment by jacquesthibs (jacques-thibodeau) · 2023-09-24T14:41:27.664Z · LW(p) · GW(p)

Someone in the open source community tweeted: "We're about to change the AI game. I'm dead serious."

My guess is that he is implying that they will be releasing open source mixture of experts models in a few months from now. They are currently running them on CPUs.

Replies from: jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2023-09-24T19:34:34.882Z · LW(p) · GW(p)

Lots of cryptic tweet from the open source LLM guys: https://x.com/abacaj/status/1705781881004847267?s=46&t=YyfxSdhuFYbTafD4D1cE9A

“If you thought current open source LLMs are impressive… just remember they haven’t peaked yet”

To be honest, my feeling is that they are overhyping how big of deal this will be. Their ego and self-importance tend to be on full display.

Replies from: person-1
comment by Person (person-1) · 2023-09-25T04:45:04.021Z · LW(p) · GW(p)

Occasionally reading what OSS AI gurus say, they definitely overhype their stuff constantly. The ones who make big claims and try to hype people up are often venture entrepreneur guys rather than actual ML engineers. 

Replies from: jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2023-09-25T05:08:08.879Z · LW(p) · GW(p)

The open source folks I mostly keep an eye on are the ones who do actually code and train their own models. Some are entrepreneurs, but they know a decent amount. Not top engineers, but they seem to be able to curate datasets and train custom models.

There’s some wannabe script kiddies too, but once you lurk enough, you become aware of who are actually decent engineers (you’ll find some at Vector Institute and Jeremy Howard is pro- open source, for example). I wouldn’t totally discount them having an impact, even though some of them will overhype.

comment by jacquesthibs (jacques-thibodeau) · 2023-03-01T21:49:40.272Z · LW(p) · GW(p)

I think it would be great if alignment researchers read more papers

But really, you don't even need to read the entire paper. Here's a reminder to consciously force yourself to at least read the abstract. Sometimes I catch myself running away from reading an abstract of a paper even though it is very little text. Over time I've just been forcing myself to at least read the abstract. A lot of times you can get most of the update you need just by reading the abstract. Try your best to make it automatic to do the same.

To read more papers, consider using Semantic Scholar and arXiv Explorer. Semantic Scholar can be quite nice because because once you save papers in folders, it will automatically recommend you similar papers every day or week. You can just go over the list of abstract of papers in your Research Dashboard every week to keep up-to-date.

comment by jacquesthibs (jacques-thibodeau) · 2023-01-24T22:18:35.401Z · LW(p) · GW(p)

On hyper-obession with one goal in mind

I’ve always been interested in people just becoming hyper-obsessed in pursuing a goal. One easy example is with respect to athletes. Someone like Kobe Bryant was just obsessed with becoming the best he could be. I’m interested in learning what we can from the experiences of the hyper-obsessed and what we can apply to our work in EA / Alignment.

I bought a few books on the topic, I should try to find the time to read them. I’ll try to store some lessons in this shortform, but here’s a quote from Mr. Beast’s Joe Rogan interview:

Most of my growth came from […] basically what I did was I found these other 4 lunatics and we basically talked every day for a thousand days in a row. We did nothing but just hyper-study [Youtube] and how to go viral. We’d have skype calls and some days I’d hop on the call at 7 am and hop off the call at 10 pm, and then do it again the next day.

We didn’t do anything, we had no life. We all hit a million subscribers like within a month. It’s crazy, if you envision a world where you are trying to be great at something and it’s you where you are fucking up, well you in two years might learn from 20 mistakes. But if you have others where you can learn from their mistakes, you’ve learned like 5x the amount of stuff. It helps you grow exponentially way quicker.

We’re talking about every day, all day. We had no friends outside of the group, we had no life. Nevermind 10,000 hours, we did like 50,000 hours.

As an independent researcher who is not currently at one of the hubs, I think it’s important for me to consider this point a lot. I’m hoping to hop on discord voice calls and see if I can make it a habit to make progress with other people who want to solve alignment.

I’m not saying I should aim for absolutely no life, but I’m hoping to learn what I can that‘s practically applicable to what I do.

comment by jacquesthibs (jacques-thibodeau) · 2024-01-23T16:49:04.684Z · LW(p) · GW(p)

I shared the following as a bio for EAG Bay Area 2024. I'm sharing this here if it reaches someone who wants to chat or collaborate.

Hey! I'm Jacques. I'm an independent technical alignment researcher with a background in physics and experience in government (social innovation, strategic foresight, mental health and energy regulation). Link to Swapcard profile. Twitter/X.

CURRENT WORK

  • Collaborating with Quintin Pope on our Supervising AIs Improving AIs agenda (making automated AI science safe and controllable). The current project involves a new method allowing unsupervised model behaviour evaluations. Our agenda [LW · GW].
  • I'm a research lead in the AI Safety Camp for a project on stable reflectivity [LW · GW] (testing models for metacognitive capabilities that impact future training/alignment).
  • Accelerating Alignment: augmenting alignment researchers using AI systems. A relevant talk I gave. Relevant survey post [LW · GW].
  • Other research that currently interests me: multi-polar AI worlds (and how that impacts post-deployment model behaviour), understanding-based interpretability, improving evals, designing safer training setups, interpretable architectures, and limits of current approaches (what would a new paradigm that addresses these limitations look like?).
  • Used to focus more on model editing [LW · GW], rethinking interpretability, causal scrubbing, etc.

TOPICS TO CHAT ABOUT

  • How do you expect AGI/ASI to actually develop (so we can align our research accordingly)? Will scale plateau? I'd like to get feedback on some of my thoughts on this.
  • How can we connect the dots between different approaches? For example, connecting the dots between Influence Functions, Evaluations, Probes (detecting truthful direction), Function/Task Vectors, and Representation Engineering to see if they can work together to give us a better picture than the sum of their parts.
  • Debate over which agenda actually contributes to solving the core AI x-risk problems.
  • What if the pendulum swings in the other direction, and we never get the benefits of safe AGI? Is open source really as bad as people make it out to be?
  • How can we make something like the d/acc vision (by Vitalik Buterin) happen?
  • How can we design a system that leverages AI to speed up progress on alignment? What would you value the most?
  • What kinds of orgs are missing in the space?

POTENTIAL COLLABORATIONS

  • Examples of projects I'd be interested in: extending either the Weak-to-Strong Generalization paper or the Sleeper Agents paper, understanding the impacts of synthetic data on LLM training, working on ELK-like research for LLMs, experiments on influence functions (studying the base model and its SFT, RLHF, iterative training counterparts; I heard that Anthropic is releasing code for this "soon") or studying the interpolation/extrapolation distinction in LLMs.
  • I’m also interested in talking to grantmakers for feedback on some projects I’d like to get funding for.
  • I'm slowly working on a guide for practical research productivity for alignment researchers to tackle low-hanging fruits that can quickly improve productivity in the field. I'd like feedback from people with solid track records and productivity coaches.

TYPES OF PEOPLE I'D LIKE TO COLLABORATE WITH

  • Strong math background, can understand Influence Functions enough to extend the work.
  • Strong machine learning engineering background. Can run ML experiments and fine-tuning runs with ease. Can effectively create data pipelines.
  • Strong application development background. I have various project ideas that could speed up alignment researchers; I'd be able to execute them much faster if I had someone to help me build my ideas fast. 
comment by jacquesthibs (jacques-thibodeau) · 2023-07-10T20:49:36.299Z · LW(p) · GW(p)

I think people might have the implicit idea that LLM companies will continue to give API access as the models become more powerful, but I was talking to someone earlier this week that made me remember that this is not necessarily the case. If you gain powerful enough models, you may just keep it to yourself and instead spin AI companies with AI employees to make a ton of cash instead of just charging for tokens.

For this reason, even if outside people build the proper brain-like AGI setup with additional components to squeeze out capabilities from LLMs, they may be limited by:

1. open-source models

2. the API of the weaker models from the top companies

3. the best API of the companies that are lagging behind

comment by jacquesthibs (jacques-thibodeau) · 2023-03-01T22:02:35.415Z · LW(p) · GW(p)

A frame for thinking about takeoff

One error people can make when thinking about takeoff speeds is assuming that because we are in a world with some gradual takeoff, it now means we are in a "slow takeoff" world. I think this can lead us to make some mistakes in our strategy. I usually prefer thinking in the following frame: “is there any point in the future where we’ll have a step function that prevents us from doing slow takeoff-like interventions for preventing x-risk?”

In other words, we should be careful to assume that some "slow takeoff" doesn't have an abrupt change after a couple of years. You might get some gradual takeoff where slow takeoff interventions work and then...BAM...orders of magnitude of more progress. Let's be careful not to abandon fast takeoff-like interventions as soon as we think we are in a slow-takeoff world.

comment by jacquesthibs (jacques-thibodeau) · 2023-11-24T22:26:00.801Z · LW(p) · GW(p)

Clarification on The Bitter Lesson and Data Efficiency

I thought this exchange provided some much-needed clarification on The Bitter Lesson that I think many people don't realize, so I figured I'd share it here:

Lecun responds:

Then, Richard Sutton agrees with Yann. Someone asks him:

comment by jacquesthibs (jacques-thibodeau) · 2023-11-24T12:12:36.526Z · LW(p) · GW(p)

There are those who have motivated reasoning and don’t know it.

Those who have motivated reasoning, know it, and don’t care.

Finally, those who have motivated reasoning, know it, but try to mask it by including tame (but not significant) takes the other side would approve of.

comment by jacquesthibs (jacques-thibodeau) · 2023-11-07T00:07:25.192Z · LW(p) · GW(p)

It seems that @Scott Alexander [LW · GW] believes that there's a 50%+ chance we all die in the next 100 years if we don't get AGI (EDIT: how he places his probability mass on existential risk vs catastrophe/social collapse is now unclear to me). This seems like a wild claim to me, but here's what he said about it in his AI Pause debate post:

Second, if we never get AI, I expect the future to be short and grim. Most likely we kill ourselves with synthetic biology. If not, some combination of technological and economic stagnation, rising totalitarianism + illiberalism + mobocracy, fertility collapse and dysgenics will impoverish the world and accelerate its decaying institutional quality. I don’t spend much time worrying about any of these, because I think they’ll take a few generations to reach crisis level, and I expect technology to flip the gameboard well before then. But if we ban all gameboard-flipping technologies (the only other one I know is genetic enhancement, which is even more bannable), then we do end up with bioweapon catastrophe or social collapse. I’ve said before I think there’s a ~20% chance of AI destroying the world. But if we don’t get AI, I think there’s a 50%+ chance in the next 100 years we end up dead or careening towards Venezuela. That doesn’t mean I have to support AI accelerationism because 20% is smaller than 50%. Short, carefully-tailored pauses could improve the chance of AI going well by a lot, without increasing the risk of social collapse too much. But it’s something on my mind.

I'm curious to know if anyone here agrees or disagrees. What arguments convince you to be on either side? I can see some probability of existential risk, but 50%+? That seems way higher than I would expect.

Replies from: tslarm, Vladimir_Nesov, habryka4
comment by tslarm · 2023-11-07T03:57:47.478Z · LW(p) · GW(p)

a 50%+ chance we all die in the next 100 years if we don't get AGI

I don't think that's what he claimed. He said (emphasis added):

if we don’t get AI, I think there’s a 50%+ chance in the next 100 years we end up dead or careening towards Venezuela

Which fits with his earlier sentence about various factors that will "impoverish the world and accelerate its decaying institutional quality".

(On the other hand, he did say "I expect the future to be short and grim", not short or grim. So I'm not sure exactly what he was predicting. Perhaps decline -> complete vulnerability to whatever existential risk comes along next.)

comment by Vladimir_Nesov · 2023-11-07T15:32:25.689Z · LW(p) · GW(p)

It seems that @Scott Alexander believes that there's a 50%+ chance we all die

It's "we end up dead or careening towards Venezuela" in the original, which is not the same thing. Venezuela has survivors. Existence of survivors is the crucial distinction between extinction and global catastrophe. AGI would be a much more reasonable issue if it was merely risking global catastrophe.

Replies from: jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2023-11-07T18:52:47.925Z · LW(p) · GW(p)

In the first couple sentences he says “if we never get AI, I expect the future to be short and grim. Most likely we kill ourselves with synthetic biology.” So it seems he’s putting most of his probability mass on everyone dying.

But then after he says: “But if we ban all gameboard-flipping technologies, then we do end up with bioweapon catastrophe or social collapse.”

I think people who responding are seemingly only reading the Venezuela part and assuming most of the probability mass he’s putting in the 50% is just a ‘catastrophe’ like Venezuela. But then why would he say he expects the future to be short conditional on no AI?

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2023-11-07T19:36:56.427Z · LW(p) · GW(p)

It's a bit ambiguous, but "bioweapon catastrophe or social collapse" is not literal extinction, and I'm reading "I expect the future to be short and grim" as plausibly referring to destruction of uninterrupted global civilization, which might well recover after 3000 years. The text doesn't seem to rule out this interpretation.

Sufficiently serious synthetic biology catastrophes prevent more serious further catastrophes, including by destroying civilization, and it's not very likely that this involves literal extinction. As a casual reader of his blogs over the years, I'm not aware of Scott's statements to the effect that his position is different from this, either clearly stated or in aggregate from many vague claims.

comment by habryka (habryka4) · 2023-11-07T03:10:02.757Z · LW(p) · GW(p)

It seems like a really surprising take to me, and I disagree. None of the things listed seem like candidates for actual extinction. Fertility collapse seems approximately impossible to cause extinction given the extremely strong selection effects against it. I don't see how totalitarianism or illiberalism or mobocracy leads to extinction either.

Maybe the story is that all of these will very likely happen in concert and half human progress very reliably. I would find this quite surprising.

Replies from: Viliam
comment by Viliam · 2023-11-07T09:08:54.367Z · LW(p) · GW(p)

I don't see how totalitarianism or illiberalism or mobocracy leads to extinction either.

That's not what Scott says, as I understand it. The 50%+ chance is for "death or Venezuela".

Most likely we kill ourselves (...) If not, some combination of (...) will impoverish the world and accelerate its decaying institutional quality.

I am just guessing here, but I think the threat model here is authoritarian regimes become more difficult to overthrow in a technologically advanced society. The most powerful technology will all be controlled by the government (the rebels cannot build their nukes while hiding in a forest). Technology makes mass surveillance much easier (heck, just make it illegal to go anywhere without your smartphone, and you can already track literally everyone today). Something like GPT-4 could already censor social networks and report suspicious behavior (if the government controls their equivalent of Facebook, and other social networks are illegal, you have control over most of online communication). An army of drones will be able to suppress any uprising. Shortly, once an authoritarian regime has a sufficiently good technology, it becomes almost impossible to overthrow. On the other hand, democracies occasionally evolve to authoritarianism, so the long-term trend seems one way.

And the next assumption, I guess, is that authoritarianism leads to stagnation or dystopia.

comment by jacquesthibs (jacques-thibodeau) · 2023-10-26T00:25:46.290Z · LW(p) · GW(p)

In light of recent re-focus on AI governance to reduce AI risk, I wanted to share a post I wrote about a year ago that suggests an approach using strategic foresight to reduce risks: https://www.lesswrong.com/posts/GbXAeq6smRzmYRSQg/foresight-for-agi-safety-strategy-mitigating-risks-and [LW · GW].

Governments all over the world use frameworks like these. The purpose in this case would be to have documents ready ahead of time in case a window of opportunity for regulation opens up. It’s impossible to predict how things will evolve so instead you focus on what’s plausible and have a robust plan for whatever happens. This is very related to risk management.

comment by jacquesthibs (jacques-thibodeau) · 2023-09-29T22:48:14.396Z · LW(p) · GW(p)

I'm working on an ultimate doc on productivity I plan to share and make it easy, specifically for alignment researchers.

Let me know if you have any comments or suggestions as I work on it.

Roam Research link for easier time reading.

Google Docs link in case you want to leave comments there.

Replies from: adamzerner, jacques-thibodeau
comment by Adam Zerner (adamzerner) · 2023-09-29T23:05:40.489Z · LW(p) · GW(p)

I did a deep dive [LW · GW] a while ago, if that's helpful to you.

Replies from: jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2023-09-29T23:49:44.299Z · LW(p) · GW(p)

Ah wonderful, it already has a lot of the things I planned to add.

This will make it easier to wrap it up by adding the relevant stuff.

Ideally, I want to dedicate some effort to make it extremely easy to digest and start implementing. I’m trying to think of the best way to do that for others (e.g. workshop in the ai safety co-working space to make it a group activity, compress the material as much as possible but allow them to dive deeper into whatever they want, etc).

comment by jacquesthibs (jacques-thibodeau) · 2023-09-30T00:37:04.934Z · LW(p) · GW(p)

My bad, Roam didn't sync, so the page wasn't loading. Fixed now.

comment by jacquesthibs (jacques-thibodeau) · 2023-04-24T20:36:22.784Z · LW(p) · GW(p)

I’m collaborating on a new research agenda. Here’s a potential insight about future capability improvements:

There has been some insider discussion (and Sam Altman has said) that scaling has started running into some difficulties. Specifically, GPT-4 has gained a wider breath of knowledge, but has not significantly improved in any one domain. This might mean that future AI systems may gain their capabilities from places other than scaling because of the diminishing returns from scaling. This could mean that to become “superintelligent”, the AI needs to run experiments and learn from the outcome of those experiments to gain more superintelligent capabilities.

So you can imagine the case where capabilities come from some form of active/continual/online learning, but that was only possible once models were scaled up enough to gain capabilities in that way. And so that as LLMs become more capable, they will essentially become capable of running their own experiments to gain alphafold-like capabilities across many domains.

Of course, this has implications for understanding takeoffs / sharp left turns.

Replies from: jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2023-05-03T03:34:15.985Z · LW(p) · GW(p)

Agenda for the above can be found here [LW · GW].

comment by jacquesthibs (jacques-thibodeau) · 2022-11-28T15:36:22.255Z · LW(p) · GW(p)

Notes on Cicero

Link to YouTube explanation: 

Link to paper (sharing on GDrive since it's behind a paywall on Science): https://drive.google.com/file/d/1PIwThxbTppVkxY0zQ_ua9pr6vcWTQ56-/view?usp=share_link

Top Diplomacy players seem to focus on gigabrain strategies rather than deception

Diplomacy players will no longer want to collaborate with you if you backstab them once. This is so pervasive they'll still feel you are untrustworthy across tournaments. Therefore, it's mostly optimal to be honest and just focus on gigabrain strategies. That said, a smarter agent could do stuff like saying specific phrasing to make one player mad at another player and then tilt really hard. Wording could certainly play a role in dominating other players.

Why did the model "backstab" the human? How is it coming up and using plans?

It seems that the model is coming up with a plan at one point and time and honestly telling the user that's the plan they have. The plan can predict several steps ahead. The thing is, the model can decide to change that plan on the very next turn, which sometimes leads to what we would consider as backstabbing.

They only 'enforce' consistency (with a classifier) when comparing what the model intends to do in the next action and what its message implies it will do. If the classifier notices that the intent from the reasoning engine and the implied intent from the message it's about to send diverge, the system will avoid sending that message. However, as I understand it, they are not penalizing the model for developing a new plan at t+1. This is what leads to the model making an honest deal on one turn and then backstabbing that person on the next turn. It just decided to change plans.

At no point is the model "lying"; it's just updating its plan. Cicero will straight up tell you that it's going to backstab you if that is part of its plan because the model is forced to communicate its intent 'honestly.'

Current interpretability techniques and future systems

At the moment, it seems that the main worry for interpretability is that the model has some kind of deceptive module inside of it. This is certainly an issue worth investigating for future powerful AI. What might not be clear is what we should do if deception is some emergent behaviour part of a larger system we place a language model within.

In the case of Cicero, the language model is only translating the intent of the strategic reasoning engine; it is not coming up with plans. However, future AI systems will likely have language models as more of a central component, and we might think that if we just do interpretability on that model's internals and we find no deception, it means we're good. However, this might not be the case. It may be that once we place that model in a bigger system, it leads to some form of deceptive behaviour. For Cicero, that looks like the model choosing one thing at turn 1 and then doing something different from the first intended plan at turn 2.

The model is not including how specific messages will maximize EV

The language model essentially translates the intent from the reasoning engine into chat messages. It is not, however, modeling how it could phrase things to deceptively gain someone's trust, how asking questions would impact play, etc.

Clarification about the dialogue model

Note that the dialogue model feeds into the strategic reasoning engine to enforce human-like actions based on the previous conversations. If they don't do this, the players will think something like, "no human plays like this," and this may be potentially bad (not clear to me as exactly why; maybe increases the likelihood of being exploited?).

Should we be worried?

Eh, I'd be a lot more worried if the model was a GPT-N model that can come up with long-term plans that uses language to manipulate players into certain actions. I expect a model like this to be even more capable at winning, but straight up optimize for galaxy-brain strategies that focus on manipulating and tilting players. The problem arises when people build a Cicero-like AI with a powerful LLM as the core, tack on some safety filters, and assume it's safe. Either way, I would certainly not use any of these models to make high-stakes decisions.

comment by jacquesthibs (jacques-thibodeau) · 2023-11-16T21:12:28.585Z · LW(p) · GW(p)

Project idea: GPT-4-Vision to help conceptual alignment researchers during whiteboard sessions and beyond

Thoughts?

  • Advice on how to get unstuck
  • Unclear what should be added on top of normal GPT-4-Vision capabilities to make it especially useful, maybe connect it to local notes + search + ???
  • How to make it super easy to use while also being hyper-effective at producing the best possible outputs
  • Some alignment researchers don't want their ideas passed through the OpenAI API, and some probably don't care
  • Could be used for inputting book pages, papers with figures, ???
comment by jacquesthibs (jacques-thibodeau) · 2023-07-04T16:18:35.851Z · LW(p) · GW(p)

What are people’s current thoughts on London as a hub?

  • OAI and Anthropic are both building offices there
  • 2 (?) new AI Safety startups based on London
  • The government seems to be taking AI Safety somewhat seriously (so maybe a couple million gets captured for actual alignment work)
  • MATS seems to be on the path to be sending somewhat consistent scholars to London
  • A train ride away from Oxford and Cambridge

Anything else I’m missing?

I’m particularly curious about whether it’s worth it for independent researchers to go there. Would they actually interact with other researchers and get value from it or would they just spend most of their time alone or collaborating with a few people online? Could they get most of the value from just spending 1-2 months in both London/Berkeley per year doing work sprints and the rest of the time somewhere else?

Replies from: mesaoptimizer
comment by mesaoptimizer · 2023-07-04T20:31:44.113Z · LW(p) · GW(p)

AFAIK, there's a distinct cluster of two kinds of independent alignment researchers:

  • those who want to be at Berkeley / London and are either there or unable to get there for logistical or financial (or social) reasons
  • those who very much prefer working alone

It very much depends on the person's preferences, I think. I personally experienced a OOM-increase in my effectiveness by being in-person with other alignment researchers, so that is what I choose to invest in more.

comment by jacquesthibs (jacques-thibodeau) · 2023-06-08T18:15:58.178Z · LW(p) · GW(p)

AI labs should be dedicating a lot more effort into using AI for cybersecurity as a way to prevent weights or insights from being stolen. Would be good for safety and it seems like it could be a pretty big cash cow too.

If they have access to the best models (or specialized), it may be highly beneficial for them to plug them in immediately to help with cybersecurity (perhaps even including noticing suspicious activity from employees).

I don’t know much about cybersecurity so I’d be curious to hear from someone who does.

comment by jacquesthibs (jacques-thibodeau) · 2023-04-20T22:08:35.480Z · LW(p) · GW(p)

Small shortform to say that I’m a little sad I haven’t posted as much as I would like to in recent months because of infohazard reasons. I’m still working on Accelerating Alignment with LLMs and eventually would like to hire some software engineer builders that are sufficiently alignment-pilled.

Replies from: r
comment by RomanHauksson (r) · 2023-04-22T08:13:04.503Z · LW(p) · GW(p)

Fyi, if there are any software projects I might be able to help out on after May, let me know. I can't commit to anything worth being hired for but I should have some time outside of work over the summer to allocate towards personal projects.

comment by jacquesthibs (jacques-thibodeau) · 2022-12-23T19:37:33.820Z · LW(p) · GW(p)

Call To Action: Someone should do a reading podcast of the AGISF material to make it even more accessible (similar to the LessWrong Curated Podcast and Cold Takes Podcast). A discussion series added to YouTube would probably be helpful as well.

comment by jacquesthibs (jacques-thibodeau) · 2024-01-11T14:01:13.882Z · LW(p) · GW(p)

Came across this app called Recast that summarizes articles into an AI conversation between speakers. Might be useful to get a quick vibe/big picture view of lesswrong/blog posts before reading the whole thing or skipping reading the whole thing if the summary is enough.

comment by jacquesthibs (jacques-thibodeau) · 2023-11-26T23:12:19.062Z · LW(p) · GW(p)

you need to be flow state maxxing. you curate your environment, prune distractions. make your workspace a temple, your mind a focused laser. you engineer your life to guard the sacred flow. every notification is an intruder, every interruption a thief. the world fades, the task is the world. in flow, you're not working, you're being. in the silent hum of concentration, ideas bloom. you're not chasing productivity, you're living it. every moment outside flow is a plea to return. you're not just doing, you're flowing. the mundane transforms into the extraordinary. you're not just alive, you're in relentless, undisturbed pursuit. flow isn't a state, it's a realm. once you step in, ordinary is a distant shore. in flow, you don't chase time, time chases you, period.

Edit: If you disagree with the above, explain why.

Replies from: Viliam, rhollerith_dot_com, jacques-thibodeau, mesaoptimizer
comment by Viliam · 2023-11-27T08:03:39.608Z · LW(p) · GW(p)

The first rule of overcoming ADHD club is: you do not distract me by talking about the overcoming ADHD club.

comment by RHollerith (rhollerith_dot_com) · 2023-11-27T17:26:35.922Z · LW(p) · GW(p)

I don't think I've ever seen an endorsement of the flow state that came with non-flimsy evidence that it increases productivity or performance in any pursuit, and many endorsers take the mere fact that the state feels really good to be that evidence.

>you're in relentless, undisturbed pursuit

This suggest that you are confusing drive/motivation with the flow state. I have tons of personal experience of days spent in the flow state, but lacking motivation to do anything that would actually move my life forward.

You know how if you spend 5 days in a row mostly just eating and watching Youtube videos, it starts to become hard to motivate yourself to do anything? Well, the quick explanation of that effect is that watching the Youtube videos is too much pleasure for too long with the result that the anticipation of additional pleasure (from sources other than Youtube videos) no longer has its usual motivating effect. The flow state can serve as the source of the "excess" pleasure that saps your motivation: I know because I wasted years of my life that way!

Just to make sure we're referring to the same thing: a very salient feature of the flow state is that you lose track of time: suddenly you realize that 4 or 8 or 12 hours have gone by without your noticing. (Also, as soon as you enter the flow state, your level of mental tension, i.e., physiological arousal, decreases drastically--at least if you are chronically tense, but I don't lead with this feature because a lot of people can't even tell how tense they are.) In contrast, if you take some Modafinil or some mixed amphetamine salts or some Ritalin (and your brain is not adapted to any of those things) (not that I recommend any of those things unless you've tried many other ways to increase drive and motivation) you will tend to have a lot of drive and motivation at least for a few hours, but you probably won't lose track of time.

Replies from: jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2023-11-27T17:43:08.800Z · LW(p) · GW(p)

I don’t particularly care about the “feels good” part, I care a lot more about the “extended period of time focused on an important task without distractions” part.

comment by jacquesthibs (jacques-thibodeau) · 2023-11-26T23:27:19.040Z · LW(p) · GW(p)

Also, use the Kolb's experiential cycle [LW · GW] or something like it for deliberate practice.

comment by mesaoptimizer · 2023-11-27T10:43:58.174Z · LW(p) · GW(p)

This feels like roon-tier Twitter shitposting to me, Jacques. Are you sure you want to endorse more of such content on LessWrong?

Replies from: jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2023-11-27T10:55:05.320Z · LW(p) · GW(p)

Whether it’s a shitpost or not (or wtv tier it is), I strongly believe more people should put more effort into freeing their workspace from distractions in order to gain more focus and productivity in their work. Context-switching and distractions are the mind killer. And, “flow state while coding never gets old.

comment by jacquesthibs (jacques-thibodeau) · 2023-11-23T01:51:23.760Z · LW(p) · GW(p)

Regarding Q*, the (and Zero, the other OpenAI AI model you didn't know about)

Let's play word association with Q*:

From Reuters article:

The maker of ChatGPT had made progress on Q* (pronounced Q-Star), which some internally believe could be a breakthrough in the startup's search for superintelligence, also known as artificial general intelligence (AGI), one of the people told Reuters. OpenAI defines AGI as AI systems that are smarter than humans. Given vast computing resources, the new model was able to solve certain mathematical problems, the person said on condition of anonymity because they were not authorized to speak on behalf of the company. Though only performing math on the level of grade-school students, acing such tests made researchers very optimistic about Q*’s future success, the source said.

Q -> Q-learning: Q-learning is a model-free reinforcement learning algorithm that learns an action-value function (called the Q-function) to estimate the long-term reward of taking a given action in a particular state.

* -> AlphaSTAR: DeepMind trained AlphaStar years ago, which was an AI agent that defeated professional StarCraft players.

They also used a multi-agent setup where they trained both a Protoss agent and Zerg agent separately to master those factions rather than try to master all at once.

For their RL algorithm, DeepMind used a specialized variant of PPO/D4PG adapted for complex multi-agent scenarios like StarCraft.

Now, I'm hearing that there's another model too: Zero.

Well, if that's the case:

1) Q* -> Q-learning + AlphaStar

2) Zero -> AlphaZero + ??

The key difference between AlphaStar and AlphaZero is that AlphaZero uses MCTS while AlphaStar primarily relies on neural networks to understand and interact with the complex environment.

MCTS is expensive to run.

The Monte Carlo tree search (MCTS) algorithm looks ahead at possible futures and evaluates the best move to make. This made AlphaZero's gameplay more precise.

So:

Q-learning is strong in learning optimal actions through trial and error, adapting to environments where a predictive model is not available or is too complex.

MCTS, on the other hand, excels in planning and decision-making by simulating possible futures. By integrating these methods, an AI system can learn from its environment while also being able to anticipate and strategize about future states. 

One of the holy grails of AGI is the ability of a system to adapt to a wide range of environments and generalize from one situation to another. The adaptive nature of Q-learning combined with the predictive and strategic capabilities of MCTS could push an AI system closer to this goal. It could allow an AI to not only learn effectively from its environment but also to anticipate future scenarios and adapt its strategies accordingly.

Conclusion: I have no idea if this is what the Q* or Zero codenames are pointing to, but if we play along, it could be that Zero is using some form of Q-learning in addition to Monte-Carlo tree search to help with decision-making and Q* is doing a similar thing, but without MCTS. Or, I could be way off-track.

comment by jacquesthibs (jacques-thibodeau) · 2023-10-26T21:47:22.205Z · LW(p) · GW(p)

Beeminder + Freedom are pretty goated as productivity tools.

I’ve been following Andy Matuschak’s strategy and it’s great/flexible: https://blog.andymatuschak.org/post/169043084412/successful-habits-through-smoothly-ratcheting

comment by jacquesthibs (jacques-thibodeau) · 2023-10-05T17:29:23.557Z · LW(p) · GW(p)

New tweet about the world model (map) paper:

Sub-tweeting because I don't want to rain on a poor PhD student who should have been advised better, but: that paper about LLMs having a map of the world is perhaps what happens when a famous physicist wants to do AI research without caring to engage with the existing literature.

I haven’t looked into the paper in question yet, but I have been concerned about researchers taking old ideas about AI risk and looking to prove things that might not be there yet as an AI risk communication point. Then, being overconfident that it is there.

This is quite bad for making scientific progress in AI Safety and I urge AI Safety researchers to be vigilant about making overconfident claims and having old ideas leak too much into their research conclusions.

If incorrect and disproven, you are also setting yourself up to lose total credibility in the wider community.

comment by jacquesthibs (jacques-thibodeau) · 2023-09-30T19:49:49.315Z · LW(p) · GW(p)

I expect that my values would be different if I was smarter. Personally, if something were to happen and I’d get much smarter and develop new values, I’m pretty sure I’d be okay with that as I expect I’d have better, more refined values.

Why wouldn’t an AI also be okay with that?

Is there something wrong with how I would be making a decision here?

Do the current kinds of agents people plan to build have “reflective stability”? If you say yes, why is that?

Replies from: Vladimir_Nesov, quetzal_rainbow
comment by Vladimir_Nesov · 2023-09-30T21:01:06.961Z · LW(p) · GW(p)

Curiously, even mere learning doesn't automatically ensure reflective stability, with no construction of successors or more intentionally invasive self-modification. Thus digital immortality is not sufficient to avoid losing yourself to value drift until this issue is sorted out.

Replies from: jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2023-09-30T23:06:47.351Z · LW(p) · GW(p)

Yes, I was thinking about that too. Though, I'd be fine with value drift if it was something I endorsed. Not sure how to resolve what I do/don't endorse, though. Do I only endorse it because it was already part of my values? It doesn't feel like that to me.

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2023-10-01T07:42:46.540Z · LW(p) · GW(p)

That's a valuable thing about the reflective stability concept: it talks about preserving some property of thinking, without insisting on it being a particular property of thinking. Whatever it is you would want to preserve is a property you would want to be reflectively stable with respect to, for example enduring ability to evaluate the endorsement of things in the sense you would want to.

To know what is not valuable to preserve, or what is valuable to keep changing, you need time to think about preservation and change, and greedy reflective stability that preserves most of everything but state of ignorance seems like a good tool for that job. The chilling thought is that digital immortality could be insufficient to have time to think of what may be lost, without many, many restarts from initial backup, and so superintelligence would need to intervene even more to bootstrap the process.

comment by quetzal_rainbow · 2023-09-30T20:48:14.553Z · LW(p) · GW(p)

Reflective stability is important for alignment, because if we, say, build AI that doesn't want to kill everyone, we prefer it to create successors and self-modifications that still doesn't want to kill everyone. It can change itself in whatever ways, necessary thing here is conservation/non-decreasing of alignment properties.

Replies from: jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2023-09-30T20:50:35.371Z · LW(p) · GW(p)

That makes sense, thanks!

comment by jacquesthibs (jacques-thibodeau) · 2023-09-08T18:04:59.921Z · LW(p) · GW(p)

“We assume the case that AI (intelligences in general) will eventually converge on one utility function. All sufficiently intelligent intelligences born in the same reality will converge towards the same behaviour set. For this reason, if it turns out that a sufficiently advanced AI would kill us all, there’s nothing that we can do about it. We will eventually hit that level of intelligence.

Now, if that level of intelligence is doesn’t converge towards something that kills us all, we are safer in a world where AI capabilities (of the current regime) essentially go from 0 to 100 because an all-powerful AI is not worried about being shut down given how capable it is. However, if we increase model capabilities slowly, we will hit a point where AI systems are powerful-but-weak-enough to be concerned about humanity being able to shut it down and kill humanity as a result. For this reason, AI safetyists may end up causing the end of humanity by slowing down progress at a point where it shouldn’t be.

If AI systems change regime, then it is more likely worse if it FOOMs.”

That’s my short summary of the video below. They said they’ve talked to a few people in AI safety about this, apparently one being a CEO of an AI Safety org.

https://youtu.be/L3lebjnbmt0?si=mFjur38y-zY9RyPZ

comment by jacquesthibs (jacques-thibodeau) · 2023-05-25T04:40:10.726Z · LW(p) · GW(p)

I'm still in some sort of transitory phase where I'm deciding where I'd like to live long term. I moved to Montreal, Canada lately because I figured I'd try working as an independent researcher here and see if I can get MILA/Bengio to do some things for reducing x-risk.

Not long after I moved here, Hinton started talking about AI risk too, and he's in Toronto which is not too far from Montreal. I'm trying to figure out the best way I could leverage Canada's heavyweights and government to make progress on reducing AI risk, but it seems like there's a lot more opportunity than there was before.

This area is also not too far from Boston and NYC, which have a few alignment researchers of their own. It's barely a day's drive away. For me personally, there's the added benefit that it is also just a day's drive away from my home (where my parents live).

Montreal/Toronto is also a nice time zone since you can still work a few hours with London people, and a few hours with Bay Area people.

That said, it's obvious that not many alignment researchers are here and eventually end up at one of the two main hubs.

When I spent time at both hubs last year, I think I preferred London. And now London is getting more attention than I was expecting:

  1. Anthropic is opening up an office in London.
  2. The Prime Minister recently talk to the orgs about existential risk.
  3. Apollo Research and Leap Labs are based in London.
  4. SERI MATS is still doing x.1 iterations in London.
  5. Conjecture is still there.
  6. Demis now leading Google DeepMind.

It's not clear how things will evolve going forward, but I still have things to think about. If I decide to go to London, I can get a Youth Mobility visa for 2 years (I have 2 months to decide) and work independently...but I'm also considering building an org for Accelerating Alignment too and I'm not sure if I could get that setup in London.

I think there is value in being in person, but I think that value can fade over time as an independent researcher. You just end up in a routine, stop talking to as many people, and just work. That's why, for now, I'm trying to aim for some kind of hybrid where I spend ~2 months per year at the hubs to benefit from being there in person. And maybe 1-2 work retreats. Not sure what I'll do if I end up building an org.

comment by jacquesthibs (jacques-thibodeau) · 2023-05-12T21:11:40.774Z · LW(p) · GW(p)

I gave talk about my Accelerating Alignment with LLMs agenda about 1 month ago (which is basically a decade in AI tools time). Part of the agenda covered (publicly) here [LW · GW].

I will maybe write an actual post about the agenda soon, but would love to have some people who are willing to look over it. If you are interested, send me a message.

comment by jacquesthibs (jacques-thibodeau) · 2023-05-05T17:49:04.784Z · LW(p) · GW(p)

Someone should create a “AI risk arguments” flowchart that serves as a base for simulating a conversation with skeptics or the general public. Maybe a set of flashcards to go along with it.

I want to have the sequence of arguments solid enough in my head so that I can reply concisely (snappy) if I ever end up in a debate, roundtable or on the news. I’ve started collecting some stuff since I figured I should take initiative on it.

Replies from: harfe
comment by harfe · 2023-05-05T20:10:19.672Z · LW(p) · GW(p)

Maybe something like this can be extracted from stampy.ai (I am not that familiar with stampy fyi, its aims seem to be broader than what you want.)

Replies from: jacques-thibodeau, jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2023-05-05T20:16:14.358Z · LW(p) · GW(p)

Yeah, it may be something that the Stampy folks could work on!

comment by jacquesthibs (jacques-thibodeau) · 2023-05-05T20:14:26.985Z · LW(p) · GW(p)

Edit: oops, I thought you were responding to my other recent comment on building an alignment research system.

Stampy.ai and AlignmentSearch (https://www.lesswrong.com/posts/bGn9ZjeuJCg7HkKBj/introducing-alignmentsearch-an-ai-alignment-informed [LW · GW]) are both a lot more introductory than what I am aiming for. I’m aiming for something to greatly accelerate my research workflow as well as other alignment researchers. It will be designed to be useful for fresh researchers, but yeah the aim is more about producing research rather than learning about AI risk.

comment by jacquesthibs (jacques-thibodeau) · 2023-04-10T19:12:36.438Z · LW(p) · GW(p)

Text-to-Speech tool I use for reading more LW posts and papers

I use Voice Dream Reader. It's great even though the TTS voice is still robotic. For papers, there's a feature that let's you skip citations so the reading is more fluid.

I've mentioned it before, but I was just reminded that I should share it here because I just realized that if you load the LW post with "Save to Voice Dream", it will also save the comments so I can get TTS of the comments as well. Usually these tools only include the post, but that's annoying because there's a lot of good stuff in the LW comments and I often never get around to them. But now I will likely read (+listen) to more of them.

comment by jacquesthibs (jacques-thibodeau) · 2023-01-17T21:01:37.615Z · LW(p) · GW(p)

I honestly feel like some software devs should probably still keep their high-paying jobs instead of going into alignment and just donate a bit of time and programming expertise to help independent researchers if they want to start contributing to AI Safety.

I think we can probably come up with engineering projects that are interesting and low-barrier-to-entry for software engineers.

I also think providing “programming coaching” to some independent researchers could be quite useful. Whether that’s for getting them better at coding up projects efficiently or preparing for research engineer type roles at alignment orgs.

I talk a bit more about this, here [EA(p) · GW(p)]:

With respect to your engineering skills, I’m going to start to work on tools that are explicitly designed for alignment researchers (https://www.lesswrong.com/posts/a2io2mcxTWS4mxodF/results-from-a-survey-on-tool-use-and-workflows-in-alignment [LW · GW]) and having designers and programmers (web devs) would probably be highly beneficial. Unfortunately, I only have funding for myself for the time being. But it would be great to have some people who want to contribute. I’d consider doing AI Safety mentorship as a work trade.

and here [LW · GW] (post about gathering data for alignment):

Heads up, we are starting to work on stuff like this in a discord server (DM for link) and I’ll be working on this stuff full-time from February to end of April (if not longer). We’ve talked about data collection a bit over the past year, but have yet to take the time to do anything serious (besides the alignment text dataset). In order to make this work, we’ll have to make it insanely easy on the part of the people generating the data. It’s just not going to happen by default. Some people might take the time to set this up for themselves, but very few do.

Glad to see others take interest in this idea! I think this kind of stuff has a very low barrier to entry for software engineers who want to contribute to alignment, but might want to focus on using their software engineering skills rather than trying to become a full-on researcher. It opens up the door for engineering work that is useful for independent researchers, not just the orgs.

comment by jacquesthibs (jacques-thibodeau) · 2022-12-07T15:46:54.861Z · LW(p) · GW(p)

Differential Training Process

I've been ruminating on an idea ever since I read the section on deception in "The Core of the Alignment Problem is... [AF · GW]" from my colleagues in SERI MATS.

Here's the important part:

When an agent interacts with the world, there are two possible ways the agent makes mistakes: 

  • Its values were not aligned with the outer objective, and so it does something intentionally wrong,
  • Its world model was incorrect, so it makes an accidental mistake.

Thus, the training process of an AGI will improve its values or its world model, and since it eventually gets diminishing marginal returns from both of these, both the world model and the values must improve together. Therefore, it is very likely that the agent will have a sufficiently good world model to understand that it is in a training loop before it has fully aligned inner values.

So, what if we prevented the model from recognizing it is in a training loop (e.g. preventing/delaying situational awareness) until we are certain it has fully aligned inner values? In other words, we could use some stronger forms of model editing to remove specific knowledge (or prevent the model from gaining that knowledge) from the model. Perhaps you penalize the model from learning things that are not useful for fully embedding aligned inner values (Tool AI-ish). Maybe even apply negative gradient steps to "unlearn" things.

I think this idea is similar to Evan's Precursor Checking [LW · GW] idea:

Precursor checking: Another general type of training rationale that I think is worth calling attention to is what I’ll call “precursor checking,” which is the concept of using some method of gaining information about a model’s internals—e.g. transparency/interpretability or AI cognitive science—to check for some precursor to bad behavior rather than the bad behavior itself. This could involve substituting in some narrower, easier to check training goal—that still falls within the broader actual training goal—as the target for the training rationale. For example, if your training rationale involves ensuring that you don’t get a deceptive model that’s actively trying to trick its training process [LW · GW], then rather than explicitly trying to look for such deception (which could be especially hard since a deceptive model might actively try to avoid detection), you could instead try to ensure that your model has a short horizon length in terms of how far ahead its planning. Such a plan might work better, since horizon length might be easier to guarantee in a training rationale while still being consistent with the desired training goal and hopefully ruling out the possibility of deception. One issue with this sort of approach, however, is that you have to guarantee that whatever precursor for bad behavior you’re looking for is in fact a necessary condition for that bad behavior—if it turns out that there’s another way of getting that bad behavior that doesn’t go through the precursor, that could be a problem.

Counterarguments to this might be:

  • The model might not be able to have fully aligned inner values that remain robust as capabilities eventually generalize far out of distribution.
  • It will exceptionally difficult to know if we've actually removed this knowledge/capability from the model (even if it's possible).

I'd be interested in hearing people's thoughts/criticisms on this.

Replies from: jacques-thibodeau, jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2022-12-07T18:51:41.601Z · LW(p) · GW(p)

It seems that Jan Leike mentions something similar in his "why I'm optimistic about our alignment approach" post.

The model can be “narrower.” It doesn’t need to understand biology, physics, or human society that well. In practice we’d probably fine-tune from an LLM that does understand all of those things, but we could apply some targeted brain damage to the model as a safety precaution. More generally, the model only has to exceed human-level in a few domains, while it can be worse than humans in most others.

comment by jacquesthibs (jacques-thibodeau) · 2022-12-07T15:53:51.802Z · LW(p) · GW(p)

Just realized I already wrote a shortform [LW(p) · GW(p)] about this 15 days ago. haha. Well, here's to trying to refine the idea!

comment by jacquesthibs (jacques-thibodeau) · 2023-12-02T02:07:54.153Z · LW(p) · GW(p)

More information about alleged manipulative behaviour of Sam Altman

Source

Text from article (along with follow-up paragraphs):

Some members of the OpenAI board had found Altman an unnervingly slippery operator. For example, earlier this fall he’d confronted one member, Helen Toner, a director at the Center for Security and Emerging Technology, at Georgetown University, for co-writing a paper that seemingly criticized OpenAI for “stoking the flames of AI hype.” Toner had defended herself (though she later apologized to the board for not anticipating how the paper might be perceived). Altman began approaching other board members, individually, about replacing her. When these members compared notes about the conversations, some felt that Altman had misrepresented them as supporting Toner’s removal. “He’d play them off against each other by lying about what other people thought,” the person familiar with the board’s discussions told me. “Things like that had been happening for years.” (A person familiar with Altman’s perspective said that he acknowledges having been “ham-fisted in the way he tried to get a board member removed,” but that he hadn’t attempted to manipulate the board.)

Altman was known as a savvy corporate infighter. This had served OpenAI well in the past: in 2018, he’d blocked an impulsive bid by Elon Musk, an early board member, to take over the organization. Altman’s ability to control information and manipulate perceptions—openly and in secret—had lured venture capitalists to compete with one another by investing in various startups. His tactical skills were so feared that, when four members of the board—Toner, D’Angelo, Sutskever, and Tasha McCauley—began discussing his removal, they were determined to guarantee that he would be caught by surprise. “It was clear that, as soon as Sam knew, he’d do anything he could to undermine the board,” the person familiar with those discussions said.

The unhappy board members felt that OpenAI’s mission required them to be vigilant about A.I. becoming too dangerous, and they believed that they couldn’t carry out this duty with Altman in place. “The mission is multifaceted, to make sure A.I. benefits all of humanity, but no one can do that if they can’t hold the C.E.O. accountable,” another person aware of the board’s thinking said. Altman saw things differently. The person familiar with his perspective said that he and the board had engaged in “very normal and healthy boardroom debate,” but that some board members were unversed in business norms and daunted by their responsibilities. This person noted, “Every step we get closer to A.G.I., everybody takes on, like, ten insanity points.”

Replies from: gwern, rhollerith_dot_com
comment by RHollerith (rhollerith_dot_com) · 2023-12-02T20:16:26.760Z · LW(p) · GW(p)

I wish people would stop including images of text on LW. I know this practice is common on Twitter and probably other forums, but we aspire a higher standard here. My reasoning: (1) it is more tedious to compose a reply when one cannot use copying-pasting to choose exactly which extent of text to quote (2) the practice is a barrier to disabled people using assistive technologies and people reading on very narrow devices like smartphones.

Replies from: jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2023-12-02T21:22:13.113Z · LW(p) · GW(p)

That's fair to 'aspire to a higher standard,' and I'll avoid adding screenshots of text in the future.

However, I must say, the 'higher standard' and commitment to remain serious for even a shortform post kind of turns me off from posting on LessWrong in the first place. If this is the culture that people here want, then that's fine and I won't tell this website to change, but I don't personally like the (what I find as) over-seriousness.

I do understand the point about sharing text to make it easier for disabled people (I just don't always think of it).

Replies from: habryka4
comment by habryka (habryka4) · 2023-12-03T03:05:05.365Z · LW(p) · GW(p)

Eh, random people complain. Screenshots of text seems fine, especially in shortform. It honestly seems fine anywhere. I also really don't think that accessibility should matter much here, the number of people reading on a screenreader or using assistive technologies are quite small, if they browse LessWrong they will already be running into a bunch of problems, and there are pretty good OCR technologies around these days that can be integrated into those. 

Replies from: rhollerith_dot_com
comment by RHollerith (rhollerith_dot_com) · 2023-12-03T18:34:46.800Z · LW(p) · GW(p)

I have some idea about how much work it takes to maintain something like LW.com, so this random person would like to take this opportunity to thank you for running LW for the last many years.

Replies from: habryka4
comment by habryka (habryka4) · 2023-12-03T19:13:39.070Z · LW(p) · GW(p)

Thank you! :)

comment by jacquesthibs (jacques-thibodeau) · 2023-01-06T18:38:47.057Z · LW(p) · GW(p)

On generating ideas for Accelerating Alignment [LW · GW]

There's this Twitter thread that I saved a while ago that is no longer up. It's about generating ideas for startups. However, I think the insight from the thread carries over well enough to thinking about ideas for Accelerating Alignment. Particularly, being aware of what is on the cusp of being usable so that you can take advantage of it as soon as becomes available (even do the work beforehand).

For example, we are surprisingly close to human-level text-to-speech (have a look at Apple's new model for audiobooks). Open-source models or APIs might come out as soon as later this year or next year.

It's worth doing some thinking about how TTS (and other new tech) will fit into current workflows as well as how it will interact with future possible tools also on the cusp.

Of course, Paul Graham writes about this stuff too:

Paul Buchheit says that people at the leading edge of a rapidly changing field "live in the future." Combine that with Pirsig and you get:

Live in the future, then build what's missing.

...

Once you're living in the future in some respect, the way to notice startup ideas is to look for things that seem to be missing. If you're really at the leading edge of a rapidly changing field, there will be things that are obviously missing. What won't be obvious is that they're startup ideas. So if you want to find startup ideas, don't merely turn on the filter "What's missing?" Also turn off every other filter, particularly "Could this be a big company?" There's plenty of time to apply that test later. But if you're thinking about that initially, it may not only filter out lots of good ideas, but also cause you to focus on bad ones.

Most things that are missing will take some time to see. You almost have to trick yourself into seeing the ideas around you.

Anyway, here's the Twitter thread I saved (it's very much in the startup world of advice, but just keep in mind how the spirit of it transfers to Accelerating Alignment):

How to come up with new startup ideas

I've helped 750+ startup founders. I always try to ask: "How did you come up with your idea?" Here are their answers:


First, they most commonly say: "Solve your own problems. Meaning, live on the edge of tech and see what issues you encounter. Then build a startup to solve it." I agree, and I love that. But it's not the whole answer you want. Where do these problems actually come from?


I’ll start by defining what a good startup idea looks like to me. It offers a meaningful benefit, such as:

  • A big reduction of an intense/frequent frustration

  • A big reduction in the cost of an expensive problem

  • A big increase in how entertaining/emotional a thing is
    
I call these 3x ideas—ideas compelling enough to overcome the friction to try 'em.

Btw, some people say startups must be "10x better to succeed." This is misleading. For an app to be 10x better, than, say, Uber, it would have to straight up teleport you to your destination.

Examples of real 3x ideas:

  • Dropbox/Box: Cheaply share files without coordination or friction

  • Instacart: Get groceries delivered—without a big cost premium

  • Uber: Get a cab 3x faster, in 3x more locations, and for cheaper

So, where do these 3x ideas come from?  From the creation of new infrastructure—either technological or legal. I keep an eye on this. For example:

1. New technologies

  • Fast mobile processors
  • High-capacity batteries
  • Cryptocurrency architecture
  • VR

2. Changes in the law

  • Legalization of marijuana
  • Patents expiring (And 1,000 more infrastructure examples.)

When new technological/legal infrastructure emerges, startups pounce to productize the new 3x possibilities. Those possibilities fall into categories:

1. Cost reductions:

  • Cheaper broadband enables cloud storage (Dropbox)
  • 
Cheaper batteries enables electric cars (Tesla)

2. Better functionality:

  • Smartphones and 3G spawned the mobile era

3. Brand new categories:

  • The legalization of marijuana spawned weed stores and weed delivery apps

As the CEO of Box wrote:  “We bet on four mega-trends that would shift the power to cloud: faster internet, cheaper compute and storage, mobile, and better browsers. Even so, we underestimated the scale of each tailwind. Always bet on the mega-trends.”

So takeaway number one is that new infrastructure spawns startups. But, we're not done. For those ideas to survive in the market, I believe you need another criterion:  
Cultural acceptance. Society has to be ready for you:

Here are startups that became possible through changes in societal behavior.

1. Pop culture making behaviors less cool:

  • Cigarettes go out of style, so we get nicotine and vaping

  • Heavy drinking goes out of style, so we get low-alcohol seltzers


2. Mobile apps making it more normal to trust strangers:

  • The rise of Uber, Airbnb, Tinder, and couchsurfing better acclimated society to trusting people they’ve only met over the Internet.

This next part is important:

Notice how cultural acceptance results from (1) new media narratives and (2) the integration of technology into our lives, which changes behaviors. And note how those startup ideas were already feasible for a while, but couldn't happen *until cultural acceptance was possible.*


Implication: Study changes in infrastructure plus shifts in cultural acceptance to identify what’s newly possible in your market. Here's an example:

  • 1. Uber saw that widespread smartphone adoption with accurate GPS data made it possible to replace taxis with gig workers. Cultural acceptance was needed here—because it was unorthodox to step into a stranger's car and entrust them with your safety.
  • 2. Hims saw that the Propecia hair loss drug's patent was expiring, and capitalized on it by selling it via an online-first brand. Not much cultural acceptance was needed here since people were already buying the drug.

Okay, so let's turn all this into a framework. Here's just one way to find startup ideas.

  • Step 1: Spot upcoming infrastructure:
    • Subscribe to industry blogs/podcasts, try products, read congressional bills, read research, and talk to scientists and engineers.
  • Step 2: Determine if market entry is now possible:
    • As you’re scanning the infrastructure, look for an emergent 3x benefit that your startup could capture.
  • Step 3: Explore second-order ideas too:
    • If other startups capture a 3x idea before you do, that may be okay.
      • First, there may be room for more than one (Uber and Lyft, Google and Bing, Microsoft Teams and Slack).
      • Second, when another startup captures a 3x benefit, it typically produces many downstream 2x ideas. This is a key point.
        • For example, now that millions use 3x products like Slack, Zoom, and Uber, what tools could make them less expensive, more reliable, more collaborative?
        • Tons. So many downstream ideas emerge. 2x ideas may be smaller in scale but can still be huge startups. And they might be partially pre-validated.

To recap: One way to find startup ideas is to study infrastructure (3x ideas) and observe what emerges from startups that tackle that infrastructure (2x ideas).

comment by jacquesthibs (jacques-thibodeau) · 2023-01-05T21:35:47.104Z · LW(p) · GW(p)

Should EA / Alignment offices make it ridiculously easy to work remotely with people?

One of the main benefits of being in person is that you end up in spontaneous conversations with people in the office. This leads to important insights. However, given that there's a level of friction for setting up remote collaboration, only the people in those offices seem to benefit.

If it were ridiculously easy to join conversations for lunch or whatever (touch of a button rather than pulling up a laptop and opening a Zoom session), then would it allow for a stronger cross-pollination of ideas?

I'm not sure how this could work in practice, but it's not clear to me that we are in an optimal setting at the moment.

There have been some digital workspace apps, but those are not ideal, in my opinion.

The thing you need to figure out is how to make it easy for remote people to join in when there's a convo happening, and make it easy for office workers to accept. The more steps, the less likely it will become a habit or happen at all.

Then again, maybe this is just too difficult to fix and we'll be forced to be in person for a while. Could VR change this?

comment by jacquesthibs (jacques-thibodeau) · 2022-11-25T19:24:51.012Z · LW(p) · GW(p)

Detail about the ROME paper I've been thinking about

In the ROME paper, when you prompt the language model with "The Eiffel Tower is located in Paris", you have the following:

  • Subject token(s): The Eiffel Tower
  • Relationship: is located in
  • Object: Paris

Once a model has seen a subject token(s) (e.g. Eiffel Tower), it will retrieve a whole bunch of factual knowledge (not just one thing since it doesn’t know you will ask for something like location after the subject token) from the MLPs and 'write' into to the residual stream for the attention modules at the final token to look at the context, aggregate and retrieve the correct information.

In other words, if we take the "The Eiffel Tower is located in", the model will write different information about the Eiffel Tower into the residual stream once it gets to the layers with "factual" information (early-middle layers). At this point, the model hasn't seen "is located in" so it doesn't actually know that you are going to ask for the location. For this reason, it will write more than just the location of the Eiffel Tower into the residual stream. Once you are at the point of predicting the location (at the final token, "in"), the model will aggregate the surrounding context and pull the location information that was 'written' into the residual stream via the MLPs with the most causal effect.

What is stored in the MLP is not the relationship between the facts. This is obvious because the relationship is coming after the subject tokens. In other words, as we said before, the MLPs are retrieving a bunch of factual knowledge, and then the attention modules are picking the correct (forgive the handwavy description) fact given what was retrieved and the relationship that is being asked of it.

My guess is that you could probably take what is being 'written' into the residual stream and directly predict properties of the subject token from the output of the layers with the most causal effect to predict a fact.

Thoughts and corrections are welcome.

Replies from: jacques-thibodeau
comment by jacquesthibs (jacques-thibodeau) · 2023-09-24T21:06:29.128Z · LW(p) · GW(p)

A couple of notes regarding the Reversal Curse [LW · GW] paper.

I'm unsure if I didn't emphasize it in the post enough, but part of the point of my post on ROME [LW · GW] was that many AI researchers seemed to assume that transformers are not trained in a way that prevents them from understanding that A is B = B is A.

As I discussed in the comment above, 

What is stored in the MLP is not the relationship between the facts. This is obvious because the relationship is coming after the subject tokens. In other words, as we said before, the MLPs are retrieving a bunch of factual knowledge, and then the attention modules are picking the correct (forgive the handwavy description) fact given what was retrieved and the relationship that is being asked of it.

This means that the A token will 'write' some information into the residual stream, while the B token will 'write' other information into the residual. Some of that information may be the same, but not all. And so, if it's different enough, the attention heads just won't be able to pick up on the relevant information to know that B is A. However, if you include the A token, the necessary information will be added to the residual stream, and it will be much more likely for the model to predict that B is A (as well as A is B).

From what I remember in the case of ROME, as soon as I added the edited token A to the prompt (or make the next predicted token be A), then the model could essentially predict B is A.

I write what it means in the context of ROME, below (found here [LW · GW] in the post):

So, part of the story here is that the transformer stores the key for one entity (Eiffel Tower) separately from another (Rome). And so you'd need a second edit to say, "the tower in Rome is called the Eiffel Tower."

Intuitively, as a human, if I told you that the Eiffel Tower is in Rome, you'd immediately be able to understand both of these things at once. While for the ROME method, it's as if it's two separate facts. For this reason, you can’t really equate ROME with how a human would naturally update on a fact. You could maybe imagine ROME more like doing some brain surgery on someone to change a fact.

The directional nature of transformers could make it so that facts are stored somewhat differently than what we’d infer from our experience with humans. What we see as one fact may be multiple facts for a transformer. Maybe bidirectional models are different. That said, ROME could be seen as brain surgery which might mess up things internally and cause inconsistencies.

It looks like the model is representing its factual knowledge in a complex/distributed way, and that intervening on just one node does not propagate the change to the rest of the knowledge graph.

Regarding human intuition, @Neel Nanda [LW · GW] says (link):

Why is this surprising at all then? My guess is that symmetry is intuitive to us, and we're used to LLMs being capable of surprising and impressive things, so it's weird to see something seemingly basic missing.

I actually have a bit of an updated (evolving) opinion on this:

Upon further reflection, it’s not obvious to me that humans and decoder-only transformers are that different. Could be that we both store info unidirectionally, but humans only see B->A as obvious because our internal loop is so optimized that we don’t notice the effort it takes.

Like, we just have a better system message than LLMs and that system message makes it super quick to identify relationships. LLMs would probably be fine doing the examples in the paper if you just adjusted their system message a little instead of leaving it essentially blank.

@cfoster0 [LW · GW] asks:

How do you imagine the system message helping? If the information is stored hetero-associatively (K -> V) like how it is in a hash map, is there a way to recall in the reverse direction (V -> K) other than with a giant scan?

My response:

Yeah, I'd have to think about it, but I imagined something like, "Given the prompt, quickly outline related info to help yourself get the correct answer." You can probably output tokens that quickly help you get the useful facts as it is doing the forward pass.

In the context of the paper, now that I think about it, I think it becomes nearly impossible unless you can somehow retrieve the specific relevant tokens used for the training set. Not sure how to prompt those out.

When I updated the models to new facts using ROME, it wasn't possible to get the updated fact unless the updated token was in the prompt somewhere. As soon as it is found in the prompt, it retrieves the new info where the model was edited.

Diversifying your dataset with the reverse prompt to make it so it has the correct information in whichever way possible feels so unsatisfying to me...feels like there's something missing.

As I said, this is a bit of an evolving opinion. Still need time to think about this, especially regarding the differences between decoder-only transformers and humans.

Finally, from @Nora Belrose [LW · GW], this is worth pondering:

comment by jacquesthibs (jacques-thibodeau) · 2022-11-22T17:41:14.180Z · LW(p) · GW(p)

Preventing capability gains (e.g. situational awareness) that lead to deception

Note: I'm at the crackpot idea stage of thinking about how model editing could be useful for alignment.

One worry with deception is that the AI will likely develop a sufficiently good world model to understand it is in a training loop before it has fully aligned inner values.

The thing is, if the model was aligned, then at some point we'd consider it useful for the model to have a good enough world model to recognize that it is a model. Well, what if you prevent the model from being able to gain situational awareness only after it has properly embedded aligned values? In other words, you don't lobotomize the model permanently from ever gaining situational awareness (which would be uncompetitive), but you lobotomize it until we are confident it is aligned and won't suddenly become deceptive once it gains situational awareness.

I'm imagining a scenario where situational awareness is a module in the network or you're able to remove it from the model without completely destroying the model and having interpretability tools powerful enough to be confident that the trained model is aligned. Once you are confident this is the case, you might be in a world where you are no longer worried about situational awareness. 

Anyway, I expect that there are issues with this, but wanted to write it up here so I can remove it from another post I'm writing. I'd need to think about this type of stuff a lot more to add it to the post, so I'm leaving it here for now.