Posts

AISC Project: Benchmarks for Stable Reflectivity 2023-11-13T14:51:19.318Z
Research agenda: Supervising AIs improving AIs 2023-04-29T17:09:21.182Z
Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky 2023-03-29T23:16:19.431Z
Practical Pitfalls of Causal Scrubbing 2023-03-27T07:47:31.309Z
Can independent researchers get a sponsored visa for the US or UK? 2023-03-24T06:10:27.796Z
What‘s in your list of unsolved problems in AI alignment? 2023-03-07T18:58:32.864Z
[Simulators seminar sequence] #2 Semiotic physics - revamped 2023-02-27T00:25:52.635Z
Kolb's: an approach to consciously get better at anything 2023-01-03T18:16:00.018Z
[Simulators seminar sequence] #1 Background & shared assumptions 2023-01-02T23:48:50.298Z
But is it really in Rome? An investigation of the ROME model editing technique 2022-12-30T02:40:36.713Z
Results from a survey on tool use and workflows in alignment research 2022-12-19T15:19:52.560Z
How is ARC planning to use ELK? 2022-12-15T20:11:56.361Z
Foresight for AGI Safety Strategy: Mitigating Risks and Identifying Golden Opportunities 2022-12-05T16:09:46.128Z
Is the "Valley of Confused Abstractions" real? 2022-12-05T13:36:21.802Z
jacquesthibs's Shortform 2022-11-21T12:04:07.896Z
A descriptive, not prescriptive, overview of current AI Alignment Research 2022-06-06T21:59:22.344Z
AI Alignment YouTube Playlists 2022-05-09T21:33:54.574Z
A survey of tool use and workflows in alignment research 2022-03-23T23:44:30.058Z

Comments

Comment by jacquesthibs (jacques-thibodeau) on I'm open for projects (sort of) · 2024-04-23T19:43:40.796Z · LW · GW

I'm hoping to collaborate with some software engineers who can help me build an alignment research assistant. Some (a little bit outdated) info here: Accelerating Alignment. The goal is to augment alignment researchers using AI systems. A relevant talk I gave. Relevant survey post.

What I have in mind also relates to this post by Abram Demski and this post by John Wentworth (with a top comment by me).

Send me a DM if you (or any good engineer) are reading this.

Comment by jacquesthibs (jacques-thibodeau) on Express interest in an "FHI of the West" · 2024-04-19T17:03:17.671Z · LW · GW

Hah, literally just what I did.

Comment by jacquesthibs (jacques-thibodeau) on LLMs for Alignment Research: a safety priority? · 2024-04-12T16:12:30.273Z · LW · GW

Hey Abram! I appreciate the post. We've talked about this at length, but this was still really useful feedback and re-summarization of the thoughts you shared with me. I've written up notes and will do my best to incorporate what you've shared into the tools I'm working on.

Since we last spoke, I've been focusing on technical alignment research, but I will dedicate a lot more time to LLMs for Alignment Research in the coming months.

For anyone reading this: If you are a great safety-minded software engineer and want to help make this vision a reality, please reach out to me. I need all the help I can get to implement this stuff much faster. I'm currently consolidating all of my notes based on what I've read, interviews with other alignment researchers, my own notes about what I'd find useful in my research, etc. I'll be happy to share those notes with people who would love to know more about what I have in mind and potentially contribute.

Comment by jacquesthibs (jacques-thibodeau) on jacquesthibs's Shortform · 2024-04-12T15:01:48.420Z · LW · GW

I'm currently ruminating on the idea of doing a video series in which I review code repositories that are highly relevant to alignment research to make them more accessible.

I do want to pick out repos with perhaps even bad documentation that are still useful and then hope on a call with the author to go over the repo and record it. At least have something basic to use when navigating the repo.

This means there would be two levels: 1) an overview with the author sharing at least the basics, and 2) a deep dive going over most of the code. The former likely contains most of the value (lower effort for me, still gets done, better than nothing, points to repo as a selection mechanism, people can at least get started).

I am thinking of doing this because I think there may be repositories that are highly useful for new people but would benefit from some direction. For example, I think Karpathy and Neel Nanda's videos have been useful in getting people started. In particular, Karpathy saw OOM more stars to his repos (e.g. nanoGPT) after the release of his videos (which, to be fair, he's famous, and a number of stars is definitely not a perfect proxy for usage).

I'm interested in any feedback ("you should do it like x", "this seems low value for x, y, z reasons so you shouldn't do it", "this seems especially valuable only if x", etc.).

Here are some of the repos I have in mind so far:

Release Ordering

Comment by jacquesthibs (jacques-thibodeau) on "How could I have thought that faster?" · 2024-03-14T20:55:04.912Z · LW · GW

Self-plug, but I think this is similar to the kind of reflection process I tried to describe in "Kolb's: an approach to consciously get better at anything".

Comment by jacquesthibs (jacques-thibodeau) on Studying The Alien Mind · 2024-02-19T23:07:30.565Z · LW · GW

Given that you didn’t mention it in the post, I figured I should share that there’s a paper called “Machine Psychology: Investigating Emergent Capabilities and Behavior in Large Language Models Using Psychological Methods” that you might find interest and related to your work.

Due to the increasing impact of LLMs on societies, it is also increasingly important to study and assess  
their behavior and discover novel abilities. This is where machine psychology comes into play. As a  
nascent field of research, it aims to identify behavioral patterns, emergent abilities, and mechanisms of  
decision-making and reasoning in LLMs by treating them as participants in psychology experiments.

Comment by jacquesthibs (jacques-thibodeau) on How to train your own "Sleeper Agents" · 2024-02-18T20:13:09.937Z · LW · GW

Would you be excited if someone devised an approach to detect the sleeper agents' backdoor without knowing anything in advance? Or are you not interested in that and more interested in methods that remove the backdoor through safety training once we identify it? Maybe both are interesting?

Comment by jacquesthibs (jacques-thibodeau) on Critiques of the AI control agenda · 2024-02-15T01:12:42.336Z · LW · GW

Control evaluations are less likely to work if our AIs become wildly superhuman in problematic domains (such as hacking, persuasion, etc) before transformative AI

Somewhat relevant new paper:

As LLMs have improved in their capabilities, so have their dual-use capabilities.

But many researchers think they serve as a glorified Google. We show that LLM agents can autonomously hack websites, showing they can produce concrete harm.

Our LLM agents can perform complex hacks like blind SQL union attacks. These attacks can take up to 45+ actions to perform and require the LLM to take actions based on feedback.

We further show a strong scaling law, with only GPT-4 and GPT-3.5 successfully hacking websites (73% and 7%, respectively). No open-source model successfully hacks websites. 

Comment by jacquesthibs (jacques-thibodeau) on Foresight for AGI Safety Strategy: Mitigating Risks and Identifying Golden Opportunities · 2024-01-29T17:51:12.268Z · LW · GW

Any thoughts or feedback on how to approach this kind of investigation, or what existing foresight frameworks you think would be particularly helpful here are very much appreciated!

As I mentioned in the post, I think the Canadian and Singapore governments are both the best governments in this space, to my knowledge.

Fortunately, some organizations have created rigorous foresight methods. The top contenders I came across were Policy Horizons Canada within the Canadian Federal Government and the Centre for Strategic Futures within the Singaporean Government.

As part of this kind of work, you want to be doing scenario planning multiple levels down. How does AI interact with VR? Once you have that, how does it interact with security and defence? How does this impact offensive work? What are the geopolitical factors that work their way in? Does public sentiment through job loss impact the development of these technologies in some specific ways? For example, you might have more powerful pushback from industries with more distinguished, intelligent, heavily regulated industries with strong union support.

Aside from that, you might want to reach out to the Foresight Institute, though I'm a bit more skeptical that their methodology will help here (though I'm less familiar with it and like the organizers overall).

I also think that looking at the Malicious AI Report from a few years ago for some inspiration would be helpful, particularly because they held a workshop with people of different backgrounds. There might be some better, more recent work I'm unaware of.

Additionally, I'd like to believe that this post was a precursor to Vitalik's post on d/acc (defensive accelerationism), so I'd encourage you to look at that.

Another thing to look into are companies that are in the cybersecurity space. I think we'll be getting more AI Safety pilled orgs in this area soon. Lekara is an example of this, I met two employees and they essentially told me that the vision is to embed themselves into companies and then continue to figure out how to make AI safer and the world more robust once they are in that position.

There are also more organizations popping up, like the Center for AI Policy, and my understanding is that Cate Hall is starting an org that focuses on sensemaking (and grantmaking) for AI Safety.

If you or anyone is interested in continuing this kind of work, send me a DM. I'd be happy to help provide guidance in the best way I can.

Lastly, I will note that I think people have generally avoided this kind of work because "if you have a misaligned AGI, well, you are dead no matter how robust you make the world or wtv you plan around it." I think this view is misguided and I think you can potentially make our situation a lot better by doing this kind of work. I think recent discussions on AI Control (rather than Alignment) are useful in questioning previous assumptions.

Comment by jacquesthibs (jacques-thibodeau) on jacquesthibs's Shortform · 2024-01-24T11:48:23.491Z · LW · GW

I thought this series of comments from a former DeepMind employee (who worked on Gemini) were insightful so I figured I should share.

From my experience doing early RLHF work for Gemini, larger models exploit the reward model more. You need to constantly keep collecting more preferences and retraining reward models to make it not exploitable. Otherwise you get nonsensical responses which have exploited the idiosyncracy of your preferences data. There is a reason few labs have done RLHF successfully.

It's also know that more capable models exploit loopholes in reward functions better. Imo, it's a pretty intuitive idea that more capable RL agents will find larger rewards. But there's evidence from papers like this as well: https://arxiv.org/abs/2201.03544 

To be clear, I don't think the current paradigm as-is is dangerous. I'm stating the obvious because this platform has gone a bit bonkers.

The danger comes from finetuning LLMs to become AutoGPTs which have memory, actions, and maximize rewards, and are deployed autonomously. Widepsread proliferation of GPT-4+ models will almost certainly make lots of these agents which will cause a lot of damage and potentially cause something indistinguishable from extinction.

These agents will be very hard to align. Trading off their reward objective with your "be nice" objective won't work. They will simply find the loopholes of your "be nice" objective and get that nice fat hard reward instead.

We're currently in the extreme left-side of AutoGPT exponential scaling (it basically doesn't work now), so it's hard to study whether more capable models are harder or easier to align.

Other comments from that thread:

My guess is where your intuitive alignment strategy ("be nice") breaks down for AI is that unlike humans, AI is highly mutable. It's very hard to change a human's sociopathy factor. But for AI, even if *you* did find a nice set of hyperparameters that trades off friendliness and goal-seeking behavior well, it's very easy to take that, and tune up the knobs to make something dangerous. Misusing the tech is as easy or easier than not. This is why many put this in the same bucket as nuclear.

US visits Afghanistan, teaches them how to make power using Nuclear tech, next month, they have nukes pointing at Iran.

And:

In contexts where harms will be visible easily and in short timelines, we’ll take them offline and retrain.

Many applications will be much more autonomous, difficult to monitor or even understand, and potentially fully close loop, i.e the agent has a complex enough action space that it can copy itself, buy compute, run itself, etc.

I know it sounds scifi. But we’re living in scifi times. These things have a knack of becoming true sooner than we think.

No ghosts in the matrices assumed here. Just intelligence starting from a very good base model optimizing reward.
 

There are more comments he made in that thread that I found insightful, so go have a look if interested.

Comment by jacquesthibs (jacques-thibodeau) on jacquesthibs's Shortform · 2024-01-23T16:49:04.684Z · LW · GW

I shared the following as a bio for EAG Bay Area 2024. I'm sharing this here if it reaches someone who wants to chat or collaborate.

Hey! I'm Jacques. I'm an independent technical alignment researcher with a background in physics and experience in government (social innovation, strategic foresight, mental health and energy regulation). Link to Swapcard profile. Twitter/X.

CURRENT WORK

  • Collaborating with Quintin Pope on our Supervising AIs Improving AIs agenda (making automated AI science safe and controllable). The current project involves a new method allowing unsupervised model behaviour evaluations. Our agenda.
  • I'm a research lead in the AI Safety Camp for a project on stable reflectivity (testing models for metacognitive capabilities that impact future training/alignment).
  • Accelerating Alignment: augmenting alignment researchers using AI systems. A relevant talk I gave. Relevant survey post.
  • Other research that currently interests me: multi-polar AI worlds (and how that impacts post-deployment model behaviour), understanding-based interpretability, improving evals, designing safer training setups, interpretable architectures, and limits of current approaches (what would a new paradigm that addresses these limitations look like?).
  • Used to focus more on model editing, rethinking interpretability, causal scrubbing, etc.

TOPICS TO CHAT ABOUT

  • How do you expect AGI/ASI to actually develop (so we can align our research accordingly)? Will scale plateau? I'd like to get feedback on some of my thoughts on this.
  • How can we connect the dots between different approaches? For example, connecting the dots between Influence Functions, Evaluations, Probes (detecting truthful direction), Function/Task Vectors, and Representation Engineering to see if they can work together to give us a better picture than the sum of their parts.
  • Debate over which agenda actually contributes to solving the core AI x-risk problems.
  • What if the pendulum swings in the other direction, and we never get the benefits of safe AGI? Is open source really as bad as people make it out to be?
  • How can we make something like the d/acc vision (by Vitalik Buterin) happen?
  • How can we design a system that leverages AI to speed up progress on alignment? What would you value the most?
  • What kinds of orgs are missing in the space?

POTENTIAL COLLABORATIONS

  • Examples of projects I'd be interested in: extending either the Weak-to-Strong Generalization paper or the Sleeper Agents paper, understanding the impacts of synthetic data on LLM training, working on ELK-like research for LLMs, experiments on influence functions (studying the base model and its SFT, RLHF, iterative training counterparts; I heard that Anthropic is releasing code for this "soon") or studying the interpolation/extrapolation distinction in LLMs.
  • I’m also interested in talking to grantmakers for feedback on some projects I’d like to get funding for.
  • I'm slowly working on a guide for practical research productivity for alignment researchers to tackle low-hanging fruits that can quickly improve productivity in the field. I'd like feedback from people with solid track records and productivity coaches.

TYPES OF PEOPLE I'D LIKE TO COLLABORATE WITH

  • Strong math background, can understand Influence Functions enough to extend the work.
  • Strong machine learning engineering background. Can run ML experiments and fine-tuning runs with ease. Can effectively create data pipelines.
  • Strong application development background. I have various project ideas that could speed up alignment researchers; I'd be able to execute them much faster if I had someone to help me build my ideas fast. 
Comment by jacquesthibs (jacques-thibodeau) on The weak-to-strong generalization (WTSG) paper in 60 seconds · 2024-01-17T01:03:50.042Z · LW · GW

I'll need to find time to read the paper, but something that comes to mind is the URIAL paper (The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning).

I'm thinking about that paper because they tested to see what SFT and SFT+RLHF caused regarding behavioural changes and noticed that "Most distribution shifts occur with stylistic tokens (e.g., discourse markers, safety disclaimers)." In the case of this paper, they were able to achieve similar performance from the Base model to both the SFT and SFT+RLHF models by leveraging the knowledge regarding the stylistic tokens.

This makes me think that fine-tuning GPT-4 is mostly changing some stylistic parts of the model, but not affecting the core capabilities of the model. I'm curious if this contributes to the model being seemingly incapable of perfectly matching the GPT-2 model. If so, I'm wondering why being able to mostly modify the stylistic tokens places a hard cap on how the GPT-4 can match the GPT-2 model.

I could be totally off, will need to read the paper.

Comment by jacquesthibs (jacques-thibodeau) on Reproducing ARC Evals' recent report on language model agents · 2024-01-16T14:44:01.175Z · LW · GW

At the very least, would you be happy to share the code with alignment researchers interested in using it for our experiments?

Comment by jacquesthibs (jacques-thibodeau) on A starter guide for evals · 2024-01-14T16:28:42.128Z · LW · GW

I'd like to add:

  • An example of "Fake RL"/well-prompted LLM: When trying to prompt base models, you can look at methods like URIAL (The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning), which apparently performs similarly to RLHF on benchmarks.
  • If possible, look at the training set on which the model was trained. My understanding is that you can better elicit the model's capabilities if you follow a similar structure to what the model was trained on. If you don't have access to the dataset (like is often the case, even though some people pretend to be 'open source'), then look at the prompt guides of the company that released the model. However, you can still try to predict the data distribution to see if you can outperform what the company puts out there.
Comment by jacquesthibs (jacques-thibodeau) on Safetywashing · 2024-01-14T13:12:56.718Z · LW · GW

I think most of our conversations about it were on Twitter and maybe Slack so maybe that makes a difference?

Comment by jacquesthibs (jacques-thibodeau) on Safetywashing · 2024-01-14T01:16:35.848Z · LW · GW

I just want to point out that safety-washing is term I heard a lot when I was working on AI Ethics in 2018. It seemed like a pretty well-known term at the time, at least to the people I talked to in that community. Not sure how widespread it is in other disciplines.

Comment by jacquesthibs (jacques-thibodeau) on jacquesthibs's Shortform · 2024-01-11T14:01:13.882Z · LW · GW

Came across this app called Recast that summarizes articles into an AI conversation between speakers. Might be useful to get a quick vibe/big picture view of lesswrong/blog posts before reading the whole thing or skipping reading the whole thing if the summary is enough.

Comment by jacquesthibs (jacques-thibodeau) on Almost everyone I’ve met would be well-served thinking more about what to focus on · 2024-01-06T18:06:59.491Z · LW · GW

PRODUCTIVITY = TIME x EFFICIENCY x OBJECTIVE

It is in this OBJECTIVE variable that you tend to see the largest multiplier effects on PRODUCTIVITY since some goals are, in an expected value sense, at least 100x more valuable than others. Though typically, in those cases of very large amounts of value, the uncertainty in the value is also high (so one goal might be 100x better in expected value but still have a substantial chance of producing no value).

Oddly, some goals we may choose may have negative expected values (even according to our own value systems). Consider, for instance, someone who works for years towards a goal because they think it will make their parents happy (and it makes them miserable to work towards it). But it turns out they are wrong, and their parents are actually indifferent to them achieving the goal! In that case, due to a false belief about the world relating to their parents, the OBJECTIVE factor in the equation ends up being negative, making the whole productivity equation negative (hence the more TIME that is spent, the *less* value is produced, reversing the usual relationship!)

From: The formula for productivity – and what you can do with it

Comment by jacquesthibs (jacques-thibodeau) on jacquesthibs's Shortform · 2023-12-02T21:22:13.113Z · LW · GW

That's fair to 'aspire to a higher standard,' and I'll avoid adding screenshots of text in the future.

However, I must say, the 'higher standard' and commitment to remain serious for even a shortform post kind of turns me off from posting on LessWrong in the first place. If this is the culture that people here want, then that's fine and I won't tell this website to change, but I don't personally like the (what I find as) over-seriousness.

I do understand the point about sharing text to make it easier for disabled people (I just don't always think of it).

Comment by jacquesthibs (jacques-thibodeau) on Thoughts on “AI is easy to control” by Pope & Belrose · 2023-12-02T13:34:15.892Z · LW · GW

Just wanted to mention that, though this is not currently the case, there are two instances I can currently think of where the AI can be a jailbreaker:

  1. Jailbreaking the reward model to get a high score. (Toy-ish example here.)
  2. Autonomous AI agents embedded within society jailbreak other models to achieve a goal/sub-goal.
Comment by jacquesthibs (jacques-thibodeau) on jacquesthibs's Shortform · 2023-12-02T02:07:54.153Z · LW · GW

More information about alleged manipulative behaviour of Sam Altman

Source

Text from article (along with follow-up paragraphs):

Some members of the OpenAI board had found Altman an unnervingly slippery operator. For example, earlier this fall he’d confronted one member, Helen Toner, a director at the Center for Security and Emerging Technology, at Georgetown University, for co-writing a paper that seemingly criticized OpenAI for “stoking the flames of AI hype.” Toner had defended herself (though she later apologized to the board for not anticipating how the paper might be perceived). Altman began approaching other board members, individually, about replacing her. When these members compared notes about the conversations, some felt that Altman had misrepresented them as supporting Toner’s removal. “He’d play them off against each other by lying about what other people thought,” the person familiar with the board’s discussions told me. “Things like that had been happening for years.” (A person familiar with Altman’s perspective said that he acknowledges having been “ham-fisted in the way he tried to get a board member removed,” but that he hadn’t attempted to manipulate the board.)

Altman was known as a savvy corporate infighter. This had served OpenAI well in the past: in 2018, he’d blocked an impulsive bid by Elon Musk, an early board member, to take over the organization. Altman’s ability to control information and manipulate perceptions—openly and in secret—had lured venture capitalists to compete with one another by investing in various startups. His tactical skills were so feared that, when four members of the board—Toner, D’Angelo, Sutskever, and Tasha McCauley—began discussing his removal, they were determined to guarantee that he would be caught by surprise. “It was clear that, as soon as Sam knew, he’d do anything he could to undermine the board,” the person familiar with those discussions said.

The unhappy board members felt that OpenAI’s mission required them to be vigilant about A.I. becoming too dangerous, and they believed that they couldn’t carry out this duty with Altman in place. “The mission is multifaceted, to make sure A.I. benefits all of humanity, but no one can do that if they can’t hold the C.E.O. accountable,” another person aware of the board’s thinking said. Altman saw things differently. The person familiar with his perspective said that he and the board had engaged in “very normal and healthy boardroom debate,” but that some board members were unversed in business norms and daunted by their responsibilities. This person noted, “Every step we get closer to A.G.I., everybody takes on, like, ten insanity points.”

Comment by jacquesthibs (jacques-thibodeau) on How to Control an LLM's Behavior (why my P(DOOM) went down) · 2023-11-29T10:40:39.313Z · LW · GW

To the LW team: the audio is messed up.

Comment by jacquesthibs (jacques-thibodeau) on My techno-optimism [By Vitalik Buterin] · 2023-11-28T17:13:55.158Z · LW · GW

Likely this podcast episode where Bostrom essentially says that he's concerned that with current trends there might be too much opposition to AI, though he still thinks we should place more concern than our current level of concern: 

Comment by jacquesthibs (jacques-thibodeau) on My techno-optimism [By Vitalik Buterin] · 2023-11-28T12:49:42.325Z · LW · GW

Hopefully this gets curated because I’d like for there to be a good audio version of this.

Comment by jacquesthibs (jacques-thibodeau) on jacquesthibs's Shortform · 2023-11-27T17:43:08.800Z · LW · GW

I don’t particularly care about the “feels good” part, I care a lot more about the “extended period of time focused on an important task without distractions” part.

Comment by jacquesthibs (jacques-thibodeau) on jacquesthibs's Shortform · 2023-11-27T10:55:05.320Z · LW · GW

Whether it’s a shitpost or not (or wtv tier it is), I strongly believe more people should put more effort into freeing their workspace from distractions in order to gain more focus and productivity in their work. Context-switching and distractions are the mind killer. And, “flow state while coding never gets old.

Comment by jacquesthibs (jacques-thibodeau) on jacquesthibs's Shortform · 2023-11-26T23:27:19.040Z · LW · GW

Also, use the Kolb's experiential cycle or something like it for deliberate practice.

Comment by jacquesthibs (jacques-thibodeau) on jacquesthibs's Shortform · 2023-11-26T23:12:19.062Z · LW · GW

you need to be flow state maxxing. you curate your environment, prune distractions. make your workspace a temple, your mind a focused laser. you engineer your life to guard the sacred flow. every notification is an intruder, every interruption a thief. the world fades, the task is the world. in flow, you're not working, you're being. in the silent hum of concentration, ideas bloom. you're not chasing productivity, you're living it. every moment outside flow is a plea to return. you're not just doing, you're flowing. the mundane transforms into the extraordinary. you're not just alive, you're in relentless, undisturbed pursuit. flow isn't a state, it's a realm. once you step in, ordinary is a distant shore. in flow, you don't chase time, time chases you, period.

Edit: If you disagree with the above, explain why.

Comment by jacquesthibs (jacques-thibodeau) on jacquesthibs's Shortform · 2023-11-24T22:26:00.801Z · LW · GW

Clarification on The Bitter Lesson and Data Efficiency

I thought this exchange provided some much-needed clarification on The Bitter Lesson that I think many people don't realize, so I figured I'd share it here:

Lecun responds:

Then, Richard Sutton agrees with Yann. Someone asks him:

Comment by jacquesthibs (jacques-thibodeau) on jacquesthibs's Shortform · 2023-11-24T12:12:36.526Z · LW · GW

There are those who have motivated reasoning and don’t know it.

Those who have motivated reasoning, know it, and don’t care.

Finally, those who have motivated reasoning, know it, but try to mask it by including tame (but not significant) takes the other side would approve of.

Comment by jacquesthibs (jacques-thibodeau) on TurnTrout's shortform feed · 2023-11-23T18:08:32.135Z · LW · GW

I'm curious to know how much the code could be faster through using a faster programming language. For example, MOJO. @Arthur Conmy 

Comment by jacquesthibs (jacques-thibodeau) on D0TheMath's Shortform · 2023-11-23T18:02:23.381Z · LW · GW

I think many people focus on doing research that focuses on full automation, but I think it's worth trying to think in the semi-automated frame as well when trying to come up with a path to impact. Obviously, it isn't scalable, but it may be more sufficient than we'd think by default for a while. In other words, cyborgism-enjoyers might be especially interested in those kinds of evals, capability measurements that are harder to pull out of the model through traditional evals, but easier to measure through some semi-automated setup.

Comment by jacquesthibs (jacques-thibodeau) on Sam Altman's ouster at OpenAI was precipitated by letter to board about AI breakthrough - Reuters · 2023-11-23T02:46:09.461Z · LW · GW

Conditional on there actually being a model named Q* (and one named Zero, not mentioned in the article), I wrote some thoughts on what this could mean. The letter might not have existed, but that doesn't mean the models don't exist.

Regarding Q*, the (and Zero, the other OpenAI AI model you didn't know about)

Let's play word association with Q*:

From Reuters article:

The maker of ChatGPT had made progress on Q* (pronounced Q-Star), which some internally believe could be a breakthrough in the startup's search for superintelligence, also known as artificial general intelligence (AGI), one of the people told Reuters. OpenAI defines AGI as AI systems that are smarter than humans. Given vast computing resources, the new model was able to solve certain mathematical problems, the person said on condition of anonymity because they were not authorized to speak on behalf of the company. Though only performing math on the level of grade-school students, acing such tests made researchers very optimistic about Q*’s future success, the source said.

Q -> Q-learning: Q-learning is a model-free reinforcement learning algorithm that learns an action-value function (called the Q-function) to estimate the long-term reward of taking a given action in a particular state.

* -> AlphaSTAR: DeepMind trained AlphaStar years ago, which was an AI agent that defeated professional StarCraft players.

They also used a multi-agent setup where they trained both a Protoss agent and Zerg agent separately to master those factions rather than try to master all at once.

For their RL algorithm, DeepMind used a specialized variant of PPO/D4PG adapted for complex multi-agent scenarios like StarCraft.

Now, I'm hearing that there's another model too: Zero.

Well, if that's the case:

1) Q* -> Q-learning + AlphaStar

2) Zero -> AlphaZero + ??

The key difference between AlphaStar and AlphaZero is that AlphaZero uses MCTS while AlphaStar primarily relies on neural networks to understand and interact with the complex environment.

MCTS is expensive to run.

The Monte Carlo tree search (MCTS) algorithm looks ahead at possible futures and evaluates the best move to make. This made AlphaZero's gameplay more precise.

So:

Q-learning is strong in learning optimal actions through trial and error, adapting to environments where a predictive model is not available or is too complex.

MCTS, on the other hand, excels in planning and decision-making by simulating possible futures. By integrating these methods, an AI system can learn from its environment while also being able to anticipate and strategize about future states. 

One of the holy grails of AGI is the ability of a system to adapt to a wide range of environments and generalize from one situation to another. The adaptive nature of Q-learning combined with the predictive and strategic capabilities of MCTS could push an AI system closer to this goal. It could allow an AI to not only learn effectively from its environment but also to anticipate future scenarios and adapt its strategies accordingly.

Conclusion: I have no idea if this is what the Q* or Zero codenames are pointing to, but if we play along, it could be that Zero is using some form of Q-learning in addition to Monte-Carlo tree search to help with decision-making and Q* is doing a similar thing, but without MCTS. Or, I could be way off-track.


Potentially related (the * might coming from A*, the pathfinding and graph traversal algo):

Comment by jacquesthibs (jacques-thibodeau) on jacquesthibs's Shortform · 2023-11-23T01:51:23.760Z · LW · GW

Regarding Q*, the (and Zero, the other OpenAI AI model you didn't know about)

Let's play word association with Q*:

From Reuters article:

The maker of ChatGPT had made progress on Q* (pronounced Q-Star), which some internally believe could be a breakthrough in the startup's search for superintelligence, also known as artificial general intelligence (AGI), one of the people told Reuters. OpenAI defines AGI as AI systems that are smarter than humans. Given vast computing resources, the new model was able to solve certain mathematical problems, the person said on condition of anonymity because they were not authorized to speak on behalf of the company. Though only performing math on the level of grade-school students, acing such tests made researchers very optimistic about Q*’s future success, the source said.

Q -> Q-learning: Q-learning is a model-free reinforcement learning algorithm that learns an action-value function (called the Q-function) to estimate the long-term reward of taking a given action in a particular state.

* -> AlphaSTAR: DeepMind trained AlphaStar years ago, which was an AI agent that defeated professional StarCraft players.

They also used a multi-agent setup where they trained both a Protoss agent and Zerg agent separately to master those factions rather than try to master all at once.

For their RL algorithm, DeepMind used a specialized variant of PPO/D4PG adapted for complex multi-agent scenarios like StarCraft.

Now, I'm hearing that there's another model too: Zero.

Well, if that's the case:

1) Q* -> Q-learning + AlphaStar

2) Zero -> AlphaZero + ??

The key difference between AlphaStar and AlphaZero is that AlphaZero uses MCTS while AlphaStar primarily relies on neural networks to understand and interact with the complex environment.

MCTS is expensive to run.

The Monte Carlo tree search (MCTS) algorithm looks ahead at possible futures and evaluates the best move to make. This made AlphaZero's gameplay more precise.

So:

Q-learning is strong in learning optimal actions through trial and error, adapting to environments where a predictive model is not available or is too complex.

MCTS, on the other hand, excels in planning and decision-making by simulating possible futures. By integrating these methods, an AI system can learn from its environment while also being able to anticipate and strategize about future states. 

One of the holy grails of AGI is the ability of a system to adapt to a wide range of environments and generalize from one situation to another. The adaptive nature of Q-learning combined with the predictive and strategic capabilities of MCTS could push an AI system closer to this goal. It could allow an AI to not only learn effectively from its environment but also to anticipate future scenarios and adapt its strategies accordingly.

Conclusion: I have no idea if this is what the Q* or Zero codenames are pointing to, but if we play along, it could be that Zero is using some form of Q-learning in addition to Monte-Carlo tree search to help with decision-making and Q* is doing a similar thing, but without MCTS. Or, I could be way off-track.

Comment by jacquesthibs (jacques-thibodeau) on jacquesthibs's Shortform · 2023-11-23T00:11:25.019Z · LW · GW

So, apparently, there are two models, but only Q* is mentioned in the article. Won't share the source, but:

Comment by jacquesthibs (jacques-thibodeau) on jacquesthibs's Shortform · 2023-11-22T23:41:11.779Z · LW · GW

Obviously, a lot has happened since the above shortform, but regarding model capabilities (which discussions died down these last couple of days), there's now this:

Source: https://www.reuters.com/technology/sam-altmans-ouster-openai-was-precipitated-by-letter-board-about-ai-breakthrough-2023-11-22/ 

Comment by jacquesthibs (jacques-thibodeau) on OpenAI: The Battle of the Board · 2023-11-22T23:32:43.775Z · LW · GW

Someone else reported that Sam seemingly was trying to get Helen off of the board weeks prior to the firing:

Comment by jacquesthibs (jacques-thibodeau) on OpenAI: The Battle of the Board · 2023-11-22T23:30:13.942Z · LW · GW

Here's a video (consider listening to the full podcast for more context) by someone who was a red-teamer for GPT-4 and was removed as a volunteer for the project after informing a board member that (early) GPT-4 was pretty unsafe. It's hard to say what really happened and if Sam and co weren't candid with the board about safety issues regarding GPT-4, but I figured I'd share as another piece of evidence.

Comment by jacquesthibs (jacques-thibodeau) on jacquesthibs's Shortform · 2023-11-20T04:33:10.799Z · LW · GW

Update, board members seem to be holding their ground more than expected in this tight situation:

Comment by jacquesthibs (jacques-thibodeau) on jacquesthibs's Shortform · 2023-11-20T01:32:39.745Z · LW · GW

My current speculation as to what is happening at OpenAI

How do we know this wasn't their best opportunity to strike if Sam was indeed not being totally honest with the board?

Let's say the rumours are true, that Sam is building out external orgs (NVIDIA competitor and iPhone-like competitor) to escape the power of the board and potentially going against the charter. Would this 'conflict of interest' be enough? If you take that story forward, it sounds more and more like he was setting up AGI to be run by external companies, using OpenAI as a fundraising bargaining chip, and having a significant financial interest in plugging AGI into those outside orgs.

So, if we think about this strategically, how long should they wait as board members who are trying to uphold the charter?

On top of this, it seems (according to Sam) that OpenAI has made a significant transformer-level breakthrough recently, which implies a significant capability jump. Long-term reasoning? Basically, anything short of 'coming up with novel insights in physics' is on the table, given that Sam recently used that line as the line we need to cross to get to AGI.

So, it could be a mix of, Ilya thinking they have achieved AGI while Sam places a higher bar (internal communication disagreements) + the board not being alerted (maybe more than once) about what Sam is doing, e.g. fundraising for both OpenAI and the orgs he wants to connect AGI to + new board members who are more willing to let Sam and GDB do what they want being added soon (another rumour I've heard) + ???. Basically, perhaps they saw this as their final opportunity to have any veto on actions like this.

Here's what I currently believe:

  • There is a GPT-5-like model that already exists. It could be GPT-4.5 or something else, but another significant capability jump. Potentially even a system that can coherently pursue goals for months, capable of continual learning, and effectively able to automate like 10% of the workforce (if they wanted to).
  • As of 5 PM, Sunday PT, the board is in a terrible position where they either stay on board and the company employees all move to a new company, or they leave the board and bring Sam back. If they leave, they need to say that Sam did nothing wrong and sweep everything under the rug (and then potentially face legal action for saying he did something wrong); otherwise, Sam won't come back.
  • Sam is building companies externally; it is unclear if this goes against the charter. But he does now have a significant financial incentive to speed up AI development. Adam D'Angelo said that he would like to prevent OpenAI from becoming a big tech company as part of his time on the board because AGI was too important for humanity. They might have considered Sam's action going in this direction.
  • A few people left the board in the past year. It's possible that Sam and GDB planned to add new people (possibly even change current board members) to the board to dilute the voting power a bit or at least refill board seats. This meant that the current board had limited time until their voting power would become less important. They might have felt rushed.
  • The board is either not speaking publicly because 1) they can't share information about GPT-5, 2) there is some legal reason that I don't understand (more likely), or 3) they are incompetent (least likely by far IMO).
  • We will possibly never find out what happened, or it will become clearer by the month as new things come out (companies and models). However, it seems possible the board will never say or admit anything publicly at this point.
  • Lastly, we still don't know why the board decided to fire Sam. It could be any of the reasons above, a mix or something we just don't know about.

Other possible things:

  • Ilya was mad that they wouldn't actually get enough compute for Superalignment as promised due to GPTs and other products using up all the GPUs.
  • Ilya is frustrated that Sam is focused on things like GPTs rather than the ultimate goal of AGI.
Comment by jacquesthibs (jacques-thibodeau) on jacquesthibs's Shortform · 2023-11-16T21:12:28.585Z · LW · GW

Project idea: GPT-4-Vision to help conceptual alignment researchers during whiteboard sessions and beyond

Thoughts?

  • Advice on how to get unstuck
  • Unclear what should be added on top of normal GPT-4-Vision capabilities to make it especially useful, maybe connect it to local notes + search + ???
  • How to make it super easy to use while also being hyper-effective at producing the best possible outputs
  • Some alignment researchers don't want their ideas passed through the OpenAI API, and some probably don't care
  • Could be used for inputting book pages, papers with figures, ???
Comment by jacquesthibs (jacques-thibodeau) on jacquesthibs's Shortform · 2023-11-16T18:47:49.414Z · LW · GW

Or perhaps as @Nora Belrose mentioned to me: "Perhaps we should queer the interpolation-extrapolation distinction."

Comment by jacquesthibs (jacques-thibodeau) on jacquesthibs's Shortform · 2023-11-16T16:26:16.333Z · LW · GW

Title: Is the alignment community over-updating on how scale impacts generalization?

So, apparently, there's a rebuttal to the recent Google generalization paper (and also worth pointing out it wasn't done with language models, just sinoïsodal functions, not language):

But then, the paper author responds:


This line of research makes me question one thing: "Is the alignment community over-updating on how scale impacts generalization?"

It remains to be seen how well models will generalize outside of their training distribution (interpolation vs extrapolation).

In other words, when people say that GPT-4 (and other LLMs) can generalize, I think they need to be more careful about what they really mean. Is it doing interpolation or extrapolation? Meaning, yes, GPT-4 can do things like write a completely new poem, but poems and related stuff were in its training distribution! So, you can say it is generalizing, but I think it's a much weaker form of generalization than what people really imply when they say generalization. A stronger form of generalization would be an AI that can do completely new tasks that are actually outside of its training distribution.

Now, at this point, you might say, "yes, but we know that LLMs learn functions and algorithms to do tasks, and as you scale up and compress more and more data, you will uncover more meta-algorithms that can do this kind of extrapolation/tasks outside of the training distribution."

Well, two things:

  1. It remains to be seen when or if this will happen in the current paradigm (no matter how much you scale up).
  2. It's not clear to me how well things like induction heads continue to work on things that are outside of their training distribution. If they don't adapt well, then it may be the same thing for other algorithms. What this would mean in practice, I'm not sure. I've been looking at relevant papers, but haven't found an answer yet.

This brings me to another point: it also remains to be seen how much it will matter in practice, given that models are trained on so much data and things like online learning are coming. Scaffolding specialized AI models, and new innovations might make such a limitation not big of a deal if there is one.

Also, perhaps most of the important capabilities come from interpolation. Perhaps intelligence is largely just interpolation? You just need to interpolate and push the boundaries of capability one step at a time, iteratively, like a scientist conducting experiments would. You just need to integrate knowledge as you interact with the world.

But what of brilliant insights from our greatest minds? Is it just recursive interpolation+small_external_interactions? Is there something else they are doing to get brilliant insights? Would AGI still ultimately be limited in the same way (even if it can run many of these genius patterns in parallel)?

Comment by jacquesthibs (jacques-thibodeau) on jacquesthibs's Shortform · 2023-11-15T17:34:20.514Z · LW · GW

More predictions/insights from Jimmy and crew. He's implying that people (like I have also been saying) that some people are far too focused on scale over training data and architectural improvements. IMO, the bitter lesson is a thing, but I think we've over-updated on it.

Relatedly, someone shared a new 13B model that apparently is better and comparable to GPT-4 in logical reasoning (based on benchmarks, which I don't usually trust too much). Note that the model is a solver-augmented LM.

Here's some context regarding the paper:

Comment by jacquesthibs (jacques-thibodeau) on jacquesthibs's Shortform · 2023-11-15T16:37:20.174Z · LW · GW

Oh, that’s great, thanks! Also reminded me of (the less official, more comedy-based) “Community Notes Violating People”. @Viliam 

Comment by jacquesthibs (jacques-thibodeau) on jacquesthibs's Shortform · 2023-11-15T13:16:28.093Z · LW · GW

I don’t think so, unfortunately.

Comment by jacquesthibs (jacques-thibodeau) on jacquesthibs's Shortform · 2023-11-15T04:09:15.811Z · LW · GW

I've also started working on a repo in order to make Community Notes more efficient by using LLMs.

Comment by jacquesthibs (jacques-thibodeau) on jacquesthibs's Shortform · 2023-11-14T20:49:20.543Z · LW · GW

Good points; I'll keep them all in mind. If money is the roadblock, we can put pressure on the companies to do this. Or, worst-case, maybe the government can enforce it (though that should be done with absolute care).

Comment by jacquesthibs (jacques-thibodeau) on jacquesthibs's Shortform · 2023-11-14T20:11:45.389Z · LW · GW

Sure, but sometimes it's just a PM and a couple of other people that lead to a feature being implemented. Also, keep in mind that Community Notes was a thing before Musk. Why was Twitter different than other social media websites?

Also, the Community Notes code was apparently completely revamped by a few people working on the open-source code, which got it to a point where it was easy to implement, and everyone liked the feature because it noticeably worked.

Either way, I'd rather push for making it happen and somehow it fails on other websites than having pessimism and not trying at all. If it needs someone higher up the chain, let's make it happen.

Comment by jacquesthibs (jacques-thibodeau) on jacquesthibs's Shortform · 2023-11-14T19:52:25.042Z · LW · GW

Don't forget that we train language models on the internet! The more truthful your dataset is, the more truthful the models will be! Let's revamp the internet for truthfulness, and we'll subsequently improve truthfulness in our AI systems!!