Posts

The Best Essay (Paul Graham) 2024-03-11T19:25:42.176Z
Can we get an AI to do our alignment homework for us? 2024-02-26T07:56:22.320Z
What's the theory of impact for activation vectors? 2024-02-11T07:34:48.536Z
Notice When People Are Directionally Correct 2024-01-14T14:12:37.090Z
Are Metaculus AI Timelines Inconsistent? 2024-01-02T06:47:18.114Z
Random Musings on Theory of Impact for Activation Vectors 2023-12-07T13:07:08.215Z
Goodhart's Law Example: Training Verifiers to Solve Math Word Problems 2023-11-25T00:53:26.841Z
Upcoming Feedback Opportunity on Dual-Use Foundation Models 2023-11-02T04:28:11.586Z
On Having No Clue 2023-11-01T01:36:10.520Z
Is Yann LeCun strawmanning AI x-risks? 2023-10-19T11:35:08.167Z
Don't Dismiss Simple Alignment Approaches 2023-10-07T00:35:26.789Z
What evidence is there of LLM's containing world models? 2023-10-04T14:33:19.178Z
The Role of Groups in the Progression of Human Understanding 2023-09-27T15:09:45.445Z
The Flow-Through Fallacy 2023-09-13T04:28:28.390Z
Chariots of Philosophical Fire 2023-08-26T00:52:45.405Z
Call for Papers on Global AI Governance from the UN 2023-08-20T08:56:58.745Z
Yann LeCun on AGI and AI Safety 2023-08-06T21:56:52.644Z
A Naive Proposal for Constructing Interpretable AI 2023-08-05T10:32:05.446Z
What does the launch of x.ai mean for AI Safety? 2023-07-12T19:42:47.060Z
The Unexpected Clanging 2023-05-18T14:47:01.599Z
Possible AI “Fire Alarms” 2023-05-17T21:56:02.892Z
Google "We Have No Moat, And Neither Does OpenAI" 2023-05-04T18:23:09.121Z
Why do we care about agency for alignment? 2023-04-23T18:10:23.894Z
Metaculus Predicts Weak AGI in 2 Years and AGI in 10 2023-03-24T19:43:18.522Z
Wittgenstein's Language Games and the Critique of the Natural Abstraction Hypothesis 2023-03-16T07:56:18.169Z
The Law of Identity 2023-02-06T02:59:16.397Z
What is the risk of asking a counterfactual oracle a question that already had its answer erased? 2023-02-03T03:13:10.508Z
Two Issues with Playing Chicken with the Universe 2022-12-31T06:47:52.988Z
Decisions: Ontologically Shifting to Determinism 2022-12-21T12:41:30.884Z
Is Paul Christiano still as optimistic about Approval-Directed Agents as he was in 2018? 2022-12-14T23:28:06.941Z
How is the "sharp left turn defined"? 2022-12-09T00:04:33.662Z
What are the major underlying divisions in AI safety? 2022-12-06T03:28:02.694Z
AI Safety Microgrant Round 2022-11-14T04:25:17.510Z
Counterfactuals are Confusing because of an Ontological Shift 2022-08-05T19:03:46.925Z
Getting Unstuck on Counterfactuals 2022-07-20T05:31:15.045Z
Which AI Safety research agendas are the most promising? 2022-07-13T07:54:30.427Z
Is CIRL a promising agenda? 2022-06-23T17:12:51.213Z
Has there been any work on attempting to use Pascal's Mugging to make an AGI behave? 2022-06-15T08:33:20.188Z
Want to find out about our events? 2022-06-09T15:52:14.544Z
Want to find out about our events? 2022-06-09T15:48:37.789Z
AI Safety Melbourne - Launch 2022-05-29T16:37:52.538Z
The New Right appears to be on the rise for better or worse 2022-04-23T19:36:58.661Z
AI Alignment and Recognition 2022-04-08T05:39:36.015Z
Strategic Considerations Regarding Autistic/Literal AI 2022-04-06T14:57:11.494Z
Results: Circular Dependency of Counterfactuals Prize 2022-04-05T06:29:56.252Z
What are some ways in which we can die with more dignity? 2022-04-03T05:32:58.957Z
General Thoughts on Less Wrong 2022-04-03T04:09:35.771Z
Sydney AI Safety Fellowship Review 2022-04-02T07:11:45.130Z
Community Building: Micro vs. Macro 2022-04-02T07:10:57.216Z
Challenges with Breaking into MIRI-Style Research 2022-01-17T09:23:34.468Z

Comments

Comment by Chris_Leong on ejenner's Shortform · 2024-03-15T06:01:01.515Z · LW · GW

Doing stuff manually might provide helpful intuitions/experience for automating it?

Comment by Chris_Leong on Explaining the AI Alignment Problem to Tibetan Buddhist Monks · 2024-03-15T04:31:21.201Z · LW · GW

I would be very interested to know what the monks think about this.

Comment by Chris_Leong on How I turned doing therapy into object-level AI safety research · 2024-03-14T14:00:34.780Z · LW · GW

I think it's much easier to talk about boundaries than preferences because true boundaries don't really contradict between individuals


I'm quite curious about this. What if you're stuck on an island with multiple people and limited food?

Comment by Chris_Leong on 'Empiricism!' as Anti-Epistemology · 2024-03-14T09:33:40.364Z · LW · GW

Very Wittgensteinian:

“What is your aim in Philosophy?”

“To show the fly the way out of the fly-bottle” (Philosophical Investigations)

Comment by Chris_Leong on jeffreycaruso's Shortform · 2024-03-14T00:08:18.641Z · LW · GW

Oh, they're definitely valid questions. The problem is that the second question is rather vague. You need to either state what a good answer would look like or why existing answers aren't satisifying.

Comment by Chris_Leong on jeffreycaruso's Shortform · 2024-03-13T16:16:55.207Z · LW · GW

I downvoted this post. I claim it's for the public good, maybe you find this strange, but let me explain my reasoning.

You've come on Less Wrong, a website that probably has more discussion of this than any other website on the internet. If you want to find arguments, they aren't hard to find. It's a bit like walking into a library and saying that you can't find a book to read.

The trouble isn't that you literally can't find any book/arguments, it's that you've got a bunch of unstated requirements that you want satisfied. Now that's perfectly fine, it's good to have standards. At the same time, you've asked the question in a maximally vague way. I don't expect you to be able to list all your requirements. That's probably impossible and when it is. possible, it's often a lot of work. At the same time, I do believe that it's possible to do better than maximally vague.

The problem with maximally vague questions is that they almost guarantee that any attempt to provide an answer will be unsatisfying both for the person answering and the person receiving the answer. Worse, you've framed the question in such a way that some people will likely feel compelled to attempt to answer anyway, lest people who think that there is such a risk come off as unable to respond to critics.

If that's the case, downvoting seems logical. Why support a game where no-one wins?

Sorry if this comes off as harsh, that's not my intent. I'm simply attempting to prompt reflection.

Comment by Chris_Leong on Wei Dai's Shortform · 2024-03-11T02:20:53.174Z · LW · GW

I have access to Gemini 1.5 Pro. Willing to run experiments if you provide me with an exact experiment to run, plus cover what they charge me (I'm assuming it's paid, I haven't used it yet).

Comment by Chris_Leong on TurnTrout's shortform feed · 2024-03-05T03:25:50.750Z · LW · GW

“But also this person doesn't know about internal invariances in NN space or the compressivity of the parameter-function map (the latter in particular is crucial for reasoning about inductive biases), then I become extremely concerned”

Have you written about this anywhere?

Comment by Chris_Leong on Wei Dai's Shortform · 2024-03-02T00:02:01.561Z · LW · GW

Have you tried talking to professors about these ideas?

Comment by Chris_Leong on Bengio's Alignment Proposal: "Towards a Cautious Scientist AI with Convergent Safety Bounds" · 2024-03-01T11:23:28.154Z · LW · GW

Is there anyone who understand GFlowNets who can provide a high-level summary of how they work?

Comment by Chris_Leong on Counting arguments provide no evidence for AI doom · 2024-02-29T05:30:48.685Z · LW · GW

Nabgure senzr gung zvtug or hfrshy:

Gurer'f n qvssrerapr orgjrra gur ahzore bs zngurzngvpny shapgvbaf gung vzcyrzrag n frg bs erdhverzragf naq gur ahzore bs cebtenzf gung vzcyrzrag gur frg bs erdhverzragf.

Fvzcyvpvgl vf nobhg gur ynggre, abg gur sbezre.

Gur rkvfgrapr bs n ynetr ahzore bs cebtenzf gung cebqhpr gur rknpg fnzr zngurzngvpny shapgvba pbagevohgrf gbjneqf fvzcyvpvgl.

Comment by Chris_Leong on Counting arguments provide no evidence for AI doom · 2024-02-28T21:15:42.413Z · LW · GW

I wrote up my views on the principle of indifference here:

https://www.lesswrong.com/posts/3PXBK2an9dcRoNoid/on-having-no-clue

I agree that it has certain philosophical issues, but I don’t believe that this is as fatal to counting arguments as you believe.

Towards the end I write:

“The problem is that we are making an assumption, but rather than owning it, we're trying to deny that we're making any assumption at all, ie. "I'm not assuming a priori A and B have equal probability based on my subjective judgement, I'm using the principle of indifference". Roll to disbelieve.

I feel less confident in my post than when I wrote it, but it still feels more credible than the position articulated in this post.

Otherwise: this was an interesting post. Well done on identifying some arguments that I need to digest.

Comment by Chris_Leong on Benito's Shortform Feed · 2024-02-28T05:32:08.463Z · LW · GW

Maybe just say that you're tracking the possibility?

Comment by Chris_Leong on New LessWrong review winner UI ("The LeastWrong" section and full-art post pages) · 2024-02-28T02:57:08.730Z · LW · GW

Is there going to be a link to this from somewhere to make it accessible?

Comment by Chris_Leong on Can we get an AI to do our alignment homework for us? · 2024-02-27T13:17:34.908Z · LW · GW

I think an important crux here is whether you think that we can build institutions which are reasonably good at checking the quality of AI safety work done by humans

 

Why is this an important crux? Is it necessarily the case that if we can reliably check AI safety work done by humans that we we reliably check AI safety work done by Ai's which may be optimising against us? 

Comment by Chris_Leong on Can we get an AI to do our alignment homework for us? · 2024-02-26T23:54:40.452Z · LW · GW

Updated

Comment by Chris_Leong on Can we get an AI to do our alignment homework for us? · 2024-02-26T16:14:12.072Z · LW · GW

Second, it is also possible to robustly verify the outputs of a superhuman intelligence without superhuman intelligence.


Why do you believe that a superhuman intelligence wouldn't be able to deceive you by producing outputs that look correct instead of outputs that are correct?

Comment by Chris_Leong on My guess at Conjecture's vision: triggering a narrative bifurcation · 2024-02-26T07:52:05.744Z · LW · GW

I guess the main doubt I have with this strategy is that even if we shift the vast majority of people/companies towards more interpretable AI, there will still be some actors who pursue black-box AI. Wouldn't we just get screwed by those actors? I don't see how CoEm can be of equivalent power to purely black-box automation.

That said, there may be ways to integrate CoEm's into the Super Alignment strategy.

Comment by Chris_Leong on Mapping the semantic void: Strange goings-on in GPT embedding spaces · 2024-02-25T22:33:01.398Z · LW · GW

GPT-J token embeddings inhabit a zone in their 4096-dimensional embedding space formed by the intersection of two hyperspherical shells


You may want to update the TLDR if you agree with the comments that indicate that this might not be accurate.

Comment by Chris_Leong on Mapping the semantic void: Strange goings-on in GPT embedding spaces · 2024-02-25T22:29:32.785Z · LW · GW

If there's a 100 tokens for snow, it probably indicates that it's a particularly important concept for that language.

Comment by Chris_Leong on How well do truth probes generalise? · 2024-02-25T14:51:35.412Z · LW · GW

For Linear Tomography and Principle Component Analysis, I'm assuming that by unsupervised you mean that you don't use the labels for finding the vector, but that you do use them for determining which sign is true and which is false. If so, this might be worth clarifying in the table.

Comment by Chris_Leong on Communication Requires Common Interests or Differential Signal Costs · 2024-02-25T09:14:33.424Z · LW · GW

Agreed. Good counter-example.

I'm very curious as to whether Zac has a way of reformulating his claim to save it.

Comment by Chris_Leong on the gears to ascenscion's Shortform · 2024-02-24T01:30:15.500Z · LW · GW

Well done for writing this up! Admissions like this are hard often hard to write.

Have you considered trying to use any credibility from helping to cofound vast for public outreach purposes?

Comment by Chris_Leong on Do sparse autoencoders find "true features"? · 2024-02-23T22:56:21.636Z · LW · GW

Isn’t that just one batch?

Comment by Chris_Leong on Research Post: Tasks That Language Models Don’t Learn · 2024-02-23T14:56:05.178Z · LW · GW

Does GPT-4 directly handle the image input or is it converted to text by a separate model then fed into GPT-4?

Comment by Chris_Leong on Do sparse autoencoders find "true features"? · 2024-02-23T11:27:43.241Z · LW · GW

A potential approach to tackle this could be to aim to discover features in smaller batches. After each batch of discovered features finishes learning we could freeze them and only calculate the orthogonality regularisation within the next batch, as well as between the next batch and the frozen features. Importantly we wouldn’t need to apply the regularisation within the already discovered features.


Wouldn't this still be quadratic?

Comment by Chris_Leong on Research Post: Tasks That Language Models Don’t Learn · 2024-02-23T10:50:34.662Z · LW · GW

You state that GPT-4 is multi-modal, but my understanding was that it wasn't natively multi-modal. I thought that the extra features like images and voice input were hacked on - ie. instead of generating an image itself it generates a query to be sent to DALLE. Is my understanding here incorrect?

In any case, it could just be a matter of scale. Maybe these kinds of tasks are rare enough in terms of internet data that it doesn't improve the loss of the models very much to be able to model them? And perhaps the instruction fine-tuning focused on more practical tasks?

Comment by Chris_Leong on Task vectors & analogy making in LLMs · 2024-02-22T05:50:45.879Z · LW · GW

"Previous post" links to localhost.

Comment by Chris_Leong on O O's Shortform · 2024-02-18T04:01:34.017Z · LW · GW

I think it's helping people realise:

a) That change is happening crazily fast
b) That the change will have major societal consequences, even if it is just a period of adjustment
c) That the speed makes it tricky for society and governments to navigate these consequences

Comment by Chris_Leong on OpenAI's Sora is an agent · 2024-02-16T09:02:08.153Z · LW · GW

It's worth noting that there are media reports that OpenAI is developing agents that will use your phone or computer. I suppose it's not surprising that this would be their next step given how far a video generation model takes you towards this, although I do wonder how they expect these agents to operate with any reliability given the propensity of ChatGPT to hallucinate.

Comment by Chris_Leong on OpenAI's Sora is an agent · 2024-02-16T08:49:22.925Z · LW · GW

It seems like there should be a connection here with Karl Friston's active inference. After all, both you and his theory involve taking a predictive engine and using it to produce actions.

Comment by Chris_Leong on The case for more ambitious language model evals · 2024-02-15T02:50:56.488Z · LW · GW

IIRC, there was also evidence that Copilot was modulating code quality based on name ethnicity variations in code docs


You don't know where they heard that?

Comment by Chris_Leong on Dreams of AI alignment: The danger of suggestive names · 2024-02-11T07:39:27.961Z · LW · GW

I'm not saying that people can't ground it out. I'm saying that if you try to think or communicate using really verbose terms it'll reduce your available working memory which will limit your ability to think new thoughts.

Comment by Chris_Leong on Dreams of AI alignment: The danger of suggestive names · 2024-02-10T04:42:46.585Z · LW · GW

You can replace "optimal" with "artifact equilibrated under policy update operations"


I don't think most people can. If you don't like the connotations of existing terms, I think you need to come up with new terms and they can't be too verbose or people won't use them.

One thing that makes these discussions tricky is that the apt-ness of these names likely depends on your object-level position. If you hold the AI optimist position, then you likely feel these names are biasing people towards and incorrect conclusion. If you hold the AI pessimist position, you likely see many of these connotations as actually a positive, in terms to pointing people towards useful metaphors, even if people occasionally slip-up and reify the terms.

Also, have you tried having a moderated conversation with someone who disagrees with you? Sometimes that can help resolve communication barriers.

Comment by Chris_Leong on Transfer learning and generalization-qua-capability in Babbage and Davinci (or, why division is better than Spanish) · 2024-02-09T10:00:11.207Z · LW · GW

It might be useful to produce a bidirectional measure of similarity by taking the geometrical mean of the transference of A to B and of B to A.

Really cool results!

Comment by Chris_Leong on Believing In · 2024-02-08T07:37:59.921Z · LW · GW

This ties in nicely with Wittgenstein’s notion of language games. TLDR: Look at the role the phrase serves, rather than the exact words.

Comment by Chris_Leong on Why I think it's net harmful to do technical safety research at AGI labs · 2024-02-07T05:51:09.425Z · LW · GW

I heard via via

 

How did you hear this?

Comment by Chris_Leong on Red-teaming language models via activation engineering · 2024-02-05T07:12:31.848Z · LW · GW

One of the main challenges I see here is how to calibrate this. In other words, if I can't break a model despite adding an activation vector of strength x, what does this mean in terms of how safe we should consider the model to be? ie. How much extra adversarial prompting effort is that equivalent to or how should I modify my probabilities of the model being safe?

Comment by Chris_Leong on OpenAI report also finds no effect of current LLMs on viability of bioterrorism attacks · 2024-02-05T02:59:07.251Z · LW · GW

Gary Marcus has criticised the results here:

What [C] is referring to is a technique called Bonferroni correction, which statisticians have long used to guard against “fishing expeditions” in which a scientist tries out a zillion different post hoc correlations, with no clear a priori hypothesis, and reports the one random thing that sorta vaguely looks like it might be happening and makes a big deal of it, ignoring a whole bunch of other similar hypotheses that failed. (XKCD has a great cartoon about that sort of situation.)

But that’s not what is going on here, and as one recent review put it, Bonferroni should not be applied “routinely”. It makes sense to use it when there are many uncorrelated tests and no clear prior hypothesis, as in the XKCD cartion. But here there is an obvious a priori test: does using an LLM make people more accurate? That’s what the whole paper is about. You don’t need a Bonferroni correction for that, and shouldn’t be using it. Deliberately or not (my guess is not), OpenAI has misanalyzed their data1 in a way which underreports the potential risk. As a statistician friend put it “if somebody was just doing stats robotically, they might do it this way, but it is the wrong test for what we actually care about”.

In fact, if you simply collapsed all the measurements of accuracy, and did the single most obvious test here, a simple t-test, the results would (as Footnote C implies) be significant. A more sophisticated test would be an ANCOVA, which as another knowledgeable academic friend with statistical expertise put it, having read a draft of this essay, “would almost certainly support your point that an omnibus measure of AI boost (from a weighted sum of the five dependent variables) would show a massively significant main effect, given that 9 out of the 10 pairwise comparisons were in the same direction.

Also, there was likely an effect, but sample sizes were too small to detect this:

There were 50 experts; 25 with LLM access, 25 without. From the reprinted table we can see that 1 in 25 (4%) experts without LLMs succeeded in the formulation task, whereas 4 in 25 with LLM access succeeded (16%).

Comment by Chris_Leong on My thoughts on the Beff Jezos - Connor Leahy debate · 2024-02-04T06:52:41.273Z · LW · GW

If I'm being honest, I don't see Beff as worthy of debating Yoshua Bengio.

Comment by Chris_Leong on Because of LayerNorm, Directions in GPT-2 MLP Layers are Monosemantic · 2024-01-29T01:51:08.777Z · LW · GW

Also: It seems like there would be an easier way to get this observation that this post makes, ie. directly showing that kV and V get mapped to the same point by layer norm (excluding the epsilon).

Don't get me wrong, the circle is cool, but seems like it's a bit of a detour.

Comment by Chris_Leong on Because of LayerNorm, Directions in GPT-2 MLP Layers are Monosemantic · 2024-01-28T07:28:49.568Z · LW · GW

Just to check I understand this correctly: from what I can gather it seems that this shows that LayerNorm is monosemantic if your residual stream activation is just that direction. It doesn't show that it is monosemantic for the purposes of doing vector addition where we want to stack multiple monosemantic directions at once. That is, if you want to represent other dimensions as well, these might push the LayerNormed vector into a different spline. Am I correct here?

That said, maybe we can model the other dimensions as random jostling in such as way that it all cancels out if a lot of dimensions are activated?

Comment by Chris_Leong on Don't sleep on Coordination Takeoffs · 2024-01-28T00:00:36.552Z · LW · GW
  1. What do you see as the low-hanging co-ordination fruit?
  2. Raising the counter-culture movement seems strange. I didn’t really see them as focused on co-ordination.
Comment by Chris_Leong on The case for ensuring that powerful AIs are controlled · 2024-01-27T06:49:04.089Z · LW · GW

Do you think it is likely that techniques like RLHF result in over-developed persuasiveness relative to other capabilities? If so, do you think we can modify the training to make this less of an issue or that it is otherwise manageable?

Comment by Chris_Leong on Matthew Barnett's Shortform · 2024-01-27T04:08:21.576Z · LW · GW

Also: How are funding and attention "arbitrary" factors?

Comment by Chris_Leong on LLMs can strategically deceive while doing gain-of-function research · 2024-01-26T22:51:57.088Z · LW · GW

You mean where they said that it was unlikely to succeed?

Comment by Chris_Leong on LLMs can strategically deceive while doing gain-of-function research · 2024-01-26T07:59:40.848Z · LW · GW

Good on you for doing this research, but to me it's a lot less interesting because you had the supervisor say: "In theory you can send them fake protocol, or lie about the biosecurity risk level, but it's a gamble, they might notice it or they might not." Okay, they didn't explicitly say to lie, but they explicitly told the AI to consider that possibility.

Comment by Chris_Leong on This might be the last AI Safety Camp · 2024-01-26T06:05:12.648Z · LW · GW

Regardless of whether or not it's AI Safety Camp, I think it's important to have at least one intro-level research program, particularly because applications for programs like SERI MATS ask about previous research experience in the application.

I can see merit both in Oliver's views about the importance of nudging people down useful research directions and Linda's views on assuming that participants are adults. Still undecided on who I ultimately end up agreeing with, so would love to hear other people's opinions.

Comment by Chris_Leong on RAND report finds no effect of current LLMs on viability of bioterrorism attacks · 2024-01-26T02:56:07.068Z · LW · GW

Having just read through this, one key point that I haven't seen people mentioning is that the results are for LLM's that need to be jail-broken.

So these results are more relevant to the release of a model over an API rather than open-source, where you'd just fine-tune away the safeguards or download a model without safeguards in the first place.

Comment by Chris_Leong on We need a Science of Evals · 2024-01-23T23:38:28.229Z · LW · GW

I think it’s worth also raising the possibility of a Kuhnian scenario where the “mature science” is actually missing something and further breakthrough is required after that to move it into in a new paradigm.