Posts

TheManxLoiner's Shortform 2024-12-20T10:30:14.908Z
How to make evals for the AISI evals bounty 2024-12-03T10:44:45.700Z
Scattered thoughts on what it means for an LLM to believe 2024-11-06T22:10:29.429Z
AI as a powerful meme, via CGP Grey 2024-10-30T18:31:58.544Z
Distillation of 'Do language models plan for future tokens' 2024-06-27T20:57:34.351Z
How to build a data center, by Construction Physics 2024-06-10T17:38:17.366Z
AI Safety Institute's Inspect hello world example for AI evals 2024-05-16T20:47:44.346Z
My experience at ML4Good AI Safety Bootcamp 2024-04-13T10:55:46.621Z

Comments

Comment by TheManxLoiner on TheManxLoiner's Shortform · 2024-12-20T10:30:15.026Z · LW · GW

Should LessWrong have an anonymous mode? When reading a post or comments, is it useful to have the username or does that introduce bias?

I had this thought after reading this review of LessWrong: https://nathanpmyoung.substack.com/p/lesswrong-expectations-vs-reality

Comment by TheManxLoiner on Visual demonstration of Optimizer's curse · 2024-12-14T00:24:26.234Z · LW · GW

Sounds sensible to me!

Comment by TheManxLoiner on Visual demonstration of Optimizer's curse · 2024-12-13T19:10:00.257Z · LW · GW

What do we mean by ?

I think the setting is:

  • We have a true value function 
  • We have a process to learn an estimate of . We run this process once and we get 
  • We then ask an AI system to act so as to maximize  (its estimate of human values)

So in this context,  is just a fixed function measuring the error between the learnt values and true values.

I think confusion could be using the term  to represent both a single instance or the random variable/process.

Comment by TheManxLoiner on Deep Forgetting & Unlearning for Safely-Scoped LLMs · 2024-11-21T10:45:02.609Z · LW · GW

Thanks for this post! Very clear and great reference.

- You appear to use the term 'scope' in a particular technical sense. Could you give a one-line definition?
- Do you know if this agenda has been picked up since you made this post?

Comment by TheManxLoiner on Scattered thoughts on what it means for an LLM to believe · 2024-11-17T19:20:38.568Z · LW · GW

But in this Eiffel Tower example, I’m not sure what is correlating with what

The physical object Eiffel Tower is correlated with itself.
 

However, I think the basic ability of an LLM to correctly complete the sentence “the Eiffel Tower is in the city of…” is not very strong evidence of having the relevant kinds of dispositions.

It is highly predictive of the ability of the LLM to book flights to Paris, when I create an LLM-agent out of it and ask it to book a trip to see the Eiffel Tower.
 

I think the question about whether current AI systems have real goals and beliefs does indeed matter

I dont think we disagree here. To clarify, my belief is there are threat models / solutions that are not affected by whether the AI has 'real' beliefs, and there are other threats/solutions where it does matter.

I think CGP Grey perspective puts more weight on Definition 3.

I actually do not understand the distinction between Definition 2 and Definition 3. Don't need to resolve it here. I've editted post to include my uncertainty on this.

Comment by TheManxLoiner on Are we dropping the ball on Recommendation AIs? · 2024-11-17T19:02:29.831Z · LW · GW

Zvi's latest newsletter has a section on this topic! https://thezvi.substack.com/i/151331494/good-advice

Comment by TheManxLoiner on Emergence, The Blind Spot of GenAI Interpretability? · 2024-11-12T21:05:36.825Z · LW · GW

+1 to you continuing with this series.

Comment by TheManxLoiner on Automation collapse · 2024-11-12T20:40:36.805Z · LW · GW
  1. Pedantic point. You say "Automating AI safety means developing some algorithm which takes in data and outputs safe, highly-capable AI systems." I do not think semi-automated interpretability fits into this, as the output of interpretability (currently) is not a model but an explanation of existing models.
  2. Unclear why Level (1) does not break down into the 'empirical' vs 'human checking'.  In particular, how would this belief obtained: "The humans are confident the details provided by the AI systems don’t compromise the safety of the algorithm."
  3. Unclear (but good chance I just need to think more carefully through the concepts) why Level (3) does not collapse to Level (1) too, using same reasoning. Might be related to Martin's alternative framing.
Comment by TheManxLoiner on Are we dropping the ball on Recommendation AIs? · 2024-11-11T19:44:33.368Z · LW · GW

Couple of thoughts:
1. I recently found out about this new-ish social media platform. https://www.heymaven.com/.  Good chance they are researching alternative recommendation algorithms.
2. What particular actions do you think rationality/ea community could do that other big efforts have not done enough, e.g. projects by Tristan Harris or Jaron Lanier.

Comment by TheManxLoiner on Scattered thoughts on what it means for an LLM to believe · 2024-11-10T17:50:45.606Z · LW · GW

Thanks for the feedback! Have editted the post to include your remarks.

Comment by TheManxLoiner on AI as a powerful meme, via CGP Grey · 2024-11-03T15:12:15.492Z · LW · GW

The 'evolutionary pressures' being discussed by CGP Grey is not the direct gradient descent used to train an individual model. Instead, he is referring to the whole set of incentives we as a society put on AI models. Similar to memes - there is no gradient descent on memes.

(Apologies if you already understood this, but it seems your post and Steven Byrne's post focus on training of individual models)

Comment by TheManxLoiner on Which LessWrong/Alignment topics would you like to be tutored in? [Poll] · 2024-10-19T22:19:30.016Z · LW · GW

What is the status of this project? Are there any estimates of timelines?

Comment by TheManxLoiner on Distillation of 'Do language models plan for future tokens' · 2024-06-28T07:31:38.925Z · LW · GW

Totally agree! This is my big weakness right now - hopefully as I read more papers I'll start developing a taste and ability to critique.

Comment by TheManxLoiner on Some ML-Related Math I Now Understand Better · 2024-04-18T19:17:25.816Z · LW · GW

Huge thanks for writing this! Particularly liked the SVD intuition and how it can be used to understand properties of .  One small correction I think. You wrote:

  is simply  the projection along the vector 

I think  is projection along the vector , so  is projection on hyperplane perpendicular to 

Comment by TheManxLoiner on A Universal Emergent Decomposition of Retrieval Tasks in Language Models · 2023-12-30T16:44:48.895Z · LW · GW

Interesting ideas, and nicely explained! Some questions:

1) First notation: request patching means replacing the vector at activation A for R2 on C2 with vector at same activation A for R1 on C1.  Then the question: Did you do any analysis on the set of vectors A as you vary R and C?  Based on your results, I expect that the vector at A is similar if you keep R the same and vary C.

2) I found the success on the toy prompt injection surprising! My intuition up to that point was that R and C are independently represented to a large extent, and you could go from computing R2(C2) to R1(C2) by patching R1 from computation of R1(C1). But the success on preventing prompt injection means that corrupting C is somehow corrupting R too, meaning that C and R are actually coupled. What is your intuition here?

3) How robust do you think the results are if you make C and R more complex? E.g. C contains multiple characters who come from various countries but live in same city and R is 'Where does character Alice come from'?

Comment by TheManxLoiner on The Bat and Ball Problem Revisited · 2021-09-02T21:35:03.865Z · LW · GW

No need to apologise! I missed your response by even more time...

My instinct is that it is because of the relative size of the numbers, not the absolute size.

It might be an interesting experiment to see how the intuition varies based on the ratio of the total amount to the difference in amounts: "You have two items whose total cost is £1100 and the difference in price is £X. What is the price of the more expensive item?", where X can be 10p or £1 or £10 or £100 or £500 or £1000.

With X=10p, one possible instinct is 'that means they are basically the same price, so the more expensive item is £550 + 10p = £550.10.

 

Comment by TheManxLoiner on The Bat and Ball Problem Revisited · 2018-12-18T03:44:41.237Z · LW · GW

I have the same experience as you, drossbucket: my rapid answer to (1) was the common incorrect answer, but for (2) and (3) my intuition is well-honed.

A possible reason for this is that the intuitive but incorrect answer in (1) is a decent approximation to the correct answer, whereas the common incorrect answers in (2) and (3) are wildly off the correct answer. For (1) I have to explicitly do a calculation to verify the incorrectness of the rapid answer, whereas in (2) and (3) my understanding of the situation immediately rules out the incorrect answers.

Here are questions which might be similar to (I):

(4a) I booked seats J23 to J29 in a cinema. How many seats have I booked?

(4b) There is a 20m fence in which the fence posts are 2m apart. How many fence posts are there?

(4c) How many numbers are there in this list: 200,201,202,203,204,...,300.

(5) In 24 hours, how many times do the hour-hand and minute-hand of a standard clock overlap?

(6) You are in a race and you just overtake second place. What is your new position in the race?