Posts

London rationalish meetup @ Arkhipov 2024-11-28T19:30:17.893Z
AI Safety Camp 10 2024-10-26T11:08:09.887Z
Invitation to lead a project at AI Safety Camp (Virtual Edition, 2025) 2024-08-23T14:18:24.327Z
AISC9 has ended and there will be an AISC10 2024-04-29T10:53:18.812Z
AI Safety Camp final presentations 2024-03-29T14:27:43.503Z
Virtual AI Safety Unconference 2024 2024-03-13T13:54:03.229Z
Some costs of superposition 2024-03-03T16:08:20.674Z
This might be the last AI Safety Camp 2024-01-24T09:33:29.438Z
Funding case: AI Safety Camp 2023-12-12T09:08:18.911Z
AI Safety Camp 2024 2023-11-18T10:37:02.183Z
Projects I would like to see (possibly at AI Safety Camp) 2023-09-27T21:27:29.539Z
Apply to lead a project during the next virtual AI Safety Camp 2023-09-13T13:29:09.198Z
How teams went about their research at AI Safety Camp edition 8 2023-09-09T16:34:05.801Z
Virtual AI Safety Unconference (VAISU) 2023-06-13T09:56:22.542Z
AISC end of program presentations 2023-06-06T15:45:04.873Z
Project Idea: Lots of Cause-area-specific Online Unconferences 2023-02-06T11:05:27.468Z
AI Safety Camp, Virtual Edition 2023 2023-01-06T11:09:07.302Z
Why don't we have self driving cars yet? 2022-11-14T12:19:09.808Z
How I think about alignment 2022-08-13T10:01:01.096Z
Infohazard Discussion with Anders Sandberg 2021-03-30T10:12:45.901Z
AI Safety Beginners Meetup (Pacific Time) 2021-03-04T01:44:33.856Z
AI Safety Beginners Meetup (European Time) 2021-02-20T13:20:42.748Z
AISU 2021 2021-01-30T17:40:38.292Z
Online AI Safety Discussion Day 2020-10-08T12:11:56.934Z
AI Safety Discussion Day 2020-09-15T14:40:18.777Z
Online LessWrong Community Weekend 2020-08-31T23:35:11.670Z
Online LessWrong Community Weekend, September 11th-13th 2020-08-01T14:55:38.986Z
AI Safety Discussion Days 2020-05-27T16:54:47.875Z
Announcing Web-TAISU, May 13-17 2020-04-04T11:48:14.128Z
Requesting examples of successful remote research collaborations, and information on what made it work? 2020-03-31T23:31:23.249Z
Coronavirus Tech Handbook 2020-03-21T23:27:48.134Z
[Meta] Do you want AIS Webinars? 2020-03-21T16:01:02.814Z
TAISU - Technical AI Safety Unconference 2020-01-29T13:31:36.431Z
Linda Linsefors's Shortform 2020-01-24T13:08:26.059Z
1st Athena Rationality Workshop - Retrospective 2019-07-17T16:51:36.754Z
Learning-by-doing AI Safety Research workshop 2019-05-24T09:42:49.996Z
TAISU - Technical AI Safety Unconference 2019-05-21T18:34:34.051Z
The Athena Rationality Workshop - June 7th-10th at EA Hotel 2019-05-11T01:01:01.973Z
The Athena Rationality Workshop - June 7th-10th at EA Hotel 2019-05-10T22:08:03.600Z
The Game Theory of Blackmail 2019-03-22T17:44:36.545Z
Optimization Regularization through Time Penalty 2019-01-01T13:05:33.131Z
Generalized Kelly betting 2018-07-19T01:38:21.311Z
Non-resolve as Resolve 2018-07-10T23:31:15.932Z
Repeated (and improved) Sleeping Beauty problem 2018-07-10T22:32:56.191Z
Probability is fake, frequency is real 2018-07-10T22:32:29.692Z
The Mad Scientist Decision Problem 2017-11-29T11:41:33.640Z
Extensive and Reflexive Personhood Definition 2017-09-29T21:50:35.324Z
Call for cognitive science in AI safety 2017-09-29T20:35:16.738Z
The Virtue of Numbering ALL your Equations 2017-09-28T18:41:35.631Z
Suggested solution to The Naturalized Induction Problem 2016-12-24T16:03:03.000Z

Comments

Comment by Linda Linsefors on QEDDEQ's Shortform · 2024-12-17T22:25:11.612Z · LW · GW

I'm not surprised by this observation. In my experience rationalists also have more than base-rate of all sorts of gender non-conformity, including non-binary and trans people. And the trends are even stronger in AI Safety.

I think the explanation is:

  • High tolerans for this type of non-conformity
  • High autism which corelates with these things

- Relative to the rest of the population, people in this community prioritize other things (writing, thinking about existential risk, working on cool projects perhaps) over routine chores (getting a haircut)

I think that this is almost the correct explanation. We prioritise other things (writing, thinking about existential risk, working on cool projects perhaps) over caring about weather someone else got a haircut.

Comment by Linda Linsefors on Linda Linsefors's Shortform · 2024-12-17T22:14:43.098Z · LW · GW

What it's like to organise AISC

About once or twice per week this time of year someone emails me to ask: 

Please let me break rule X

My response:

No you're not allowed to break rule X. But here's a loop hole that lets you do the thing you want without technically breaking the rule. Be warned that I think using the loophole is a bad idea, but if you still want to, we will not stop you.

Because not leaving the loophole would be too restrictive for other reason, and I'm not going to not tell people all their options. 

The fact that this puts the responsibility back on them is a bonus feature I really like. Our participants are adults, and are allowed to make their own mistakes. But also, sometimes it's not a mistake, because there is no set of rules for all occasion, and I don't have all the context of their personal situation.

Comment by Linda Linsefors on Biological risk from the mirror world · 2024-12-17T21:47:52.799Z · LW · GW

Quote from the AI voiced podcast version of this post.

Such a lab, separated by more than 1 Australian Dollar from Earth, might provide sufficient protection for very dangerous experiments.

Comment by Linda Linsefors on Linda Linsefors's Shortform · 2024-11-19T19:23:28.910Z · LW · GW

Same data but in cronlogical order

10th-11th
* 20 total applications
* 4 (20%) Stop/Pause AI 
* 8 (40%) Mech-Interp and Agent Foundations 

12th-13th
* 18 total applications
* 2 (11%) Stop/Pause AI 
* 7 (39%) Mech-Interp and Agent Foundations 

15th-16th
* 45 total application
* 4 (9%) Stop/Pause AI
* 20 (44%) Mech-Interp and Agent Foundations 

Stop/Puase AI stays at 2-4 per week, while the others go from 7-8 to 20

One may point out that 2 to 4 is a doubling suggesting noisy data, and also going from 7-8 is also just a doubling and might not mean much. This could be the case. But we should expect higher notice for lower numbers. I.e. a doubling of 2 is less surprising than a (more than) doubling of 7-8.

Comment by Linda Linsefors on Linda Linsefors's Shortform · 2024-11-19T19:19:27.170Z · LW · GW

12th-13th
* 18 total applications
* 2 (11%) Stop/Pause AI 
* 7 (39%) Mech-Interp and Agent Foundations 

15th-16th
* 45 total application
* 4 (9%) Stop/Pause AI
* 20 (44%) Mech-Interp and Agent Foundations 

All applications
* 370 total
* 33 (12%) Stop/Pause AI
* 123 (46%) Mech-Interp and Agent Foundations 

Looking at the above data, is directionally correct for you hypothesis, but it doesn't look statisically significant to me. The numbers are pretty small, so could be a fluke.

So I decided to add some more data

 10th-11th
* 20 total applications
* 4 (20%) Stop/Pause AI 
* 8 (40%) Mech-Interp and Agent Foundations 

Looking at all of it, it looks like Stop/Pause AI are coming in at a stable rate, while Mech-Interp and Agent Foundations are going up a lot after the 14th.

 

Comment by Linda Linsefors on Linda Linsefors's Shortform · 2024-11-19T17:15:27.780Z · LW · GW

AI Safety interest is growing in Africa. 

AISC 25 (out of 370) applicants from Africa, with 9 from Kenya and 8 from Nigeria.

Numbers for all countries (people with multiple locations not included)
AISC applicants per country - Google Sheets

The rest looks more or less in-line with what I would expect.

Comment by Linda Linsefors on Linda Linsefors's Shortform · 2024-11-19T17:06:17.915Z · LW · GW

Sounds plausible. 

> This would predict that the ratio of technical:less-technical applications would increase in the final few days.

If you want to operationalise this in terms on project first choice, I can check.
 

Comment by Linda Linsefors on Linda Linsefors's Shortform · 2024-11-18T12:40:11.642Z · LW · GW

Side note: 
If you don't tell what time the application deadline is, lots of people will assume its anywhere-on-Earth, i.e. noon the next day in GMT. 

When I was new to organising I did not think of this, and kind of forgot about time zones. I noticed that I got a steady stream of "late" applications, that suddenly ended at 1pm (I was in GMT+1), and didn't know why.

Comment by Linda Linsefors on Linda Linsefors's Shortform · 2024-11-18T12:35:52.162Z · LW · GW

Every time I have an application form for some event, the pattern is always the same. Steady trickle of applications, and then a doubling on the last day.

And for some reason it still surprises me how accurate this model is. The trickle can be a bit uneven, but the doubling the last day is usually close to spot on.

This means that by the time I have a good estimate of what the average number of applications per day is, then I can predict what the final number will be. This is very useful, for knowing if I need to advertise more or not.

For the upcoming AISC, the trickle was a late skewed, which meant that an early estimate had me at around 200 applicants, but the final number of on-time application is 356. I think this is because we where a bit slow at advertising early on, but Remmelt made a good job sending out reminders towards the end.

Application deadline was Nov 17. 
At midnight GMT before Nov 17 we had 172 application. 
At noon GMT Nov 18 (end of Nov 17 anywhere-on-Earth) we had 356 application 

The doubling rule predicted 344, which is only 3% off

Yes, I count the last 36 hours as "the last day". This is not cheating since that's what I always done (approximately [1]), since starting to observe this pattern. It's the natural thing to do when you live at or close to GMT, or at least if your brain works like mine. 

  1. ^

    I've always used my local midnight as the divider. Sometimes that has been Central European Time, and sometimes there is daylight saving time. But it's all pretty close.

Comment by Linda Linsefors on Seven lessons I didn't learn from election day · 2024-11-15T15:22:16.723Z · LW · GW

If people are ashamed to vote for Trump, why would they let their neighbours know?

Comment by Linda Linsefors on Notes on Resolve · 2024-11-08T13:18:56.077Z · LW · GW

Linda Linsefors of the Center for Applied Rationality


Hi, thanks for the mention.

But I'd like to point out that I never worked for CFAR in any capacity. I have attended two CFAR workshops. I think that calling me "of the Center for Applied Rationality" is very misleading, and I'd prefer it if you remove that part, or possibly re-phrase it.

Comment by Linda Linsefors on AI Safety Camp 10 · 2024-10-30T14:59:58.046Z · LW · GW

You can find their prefeed contact info in each document in the Team section.

Comment by Linda Linsefors on AI Safety Camp 10 · 2024-10-29T01:04:33.565Z · LW · GW

Yes there are, sort of...

You can apply to as many projects as you want, but you can only join one team. 

The reasons for this is: When we've let people join more than one team in the past, they usually end up not having time for both and dropping out of one of the projects.

What this actually means:

When you join a team you're making a promise to spend 10 or more hours per week on that project. When we say you're only allowed to join one team, what we're saying is that you're only allowed to make this promise to one project.

However, you are allowed to help out other teams with their projects, even if you're not officially on the team.

Comment by Linda Linsefors on AI Safety Camp 10 · 2024-10-27T11:39:57.508Z · LW · GW

@Samuel Nellessen 
Thanks for answering Gunnars question.

But also, I'm a bit nervous that posting their email here directly in the comments is too public, i.e. easy for spam-bots to find. 

Comment by Linda Linsefors on AI Safety Camp 10 · 2024-10-27T11:27:59.037Z · LW · GW

If the research lead want to be contactable, their contact info is in their projekt document, under the "Team" section. Most (or all, I'm not sure) research leads have some contact info.

Comment by Linda Linsefors on [Intuitive self-models] 3. The Homunculus · 2024-10-04T08:15:14.817Z · LW · GW

The way I understand it the homunculus is part of self. So if you put the wanting in the homunculus, it's also inside self. I don't know about you, but my self concept has more than wanting. To be fair, he homunculus concept is also a bit richer than wanting (I think?) but less encompassing than the full self (I think?).

Comment by Linda Linsefors on [Intuitive self-models] 3. The Homunculus · 2024-10-04T00:23:40.694Z · LW · GW

Based on Steve's response to one of my comments, I'm now less sure.

Comment by Linda Linsefors on [Intuitive self-models] 3. The Homunculus · 2024-10-03T22:45:16.341Z · LW · GW

Reading this post is so strange. I've already read the draft, so it's not even new to me, but still very strange.

I do not recognise this homunculus concept you describe. 

Other people reading this, do you experience yourself like that? Do you resonate with the intuitive homunculus concept as described in the post?

I my self have a unified self (mostly). But that's more or less where the similarity ends.



For example when I read:

in my mind, I think of goals as somehow “inside” the homunculus. In some respects, my body feels like “a thing that the homunculus operates”, like that little alien-in-the-head picture at the top of the post, 

my intuitive reaction is astonishment. Like, no-one really think of themselves like that, right? It's obviously just a metaphor, right?

But that was just my first reaction. I know enough about human mind variety to absolutely believe that Steve has this experience, even though it's very strange to me.

Comment by Linda Linsefors on [Intuitive self-models] 3. The Homunculus · 2024-10-03T22:33:17.928Z · LW · GW

Similarly, as Johnstone points out above, for most of history, people didn’t know that the brain thinks thoughts! But they were forming homunculus concepts just like us.

 

Why do you assume they where forming homunculus concepts? Since it's not veridical, they might have a very different self model. 

I'm from the same culture as you and I claim I don't have homunculus concept, or at least not one that matches what you describe in this post.

 

Comment by Linda Linsefors on [Intuitive self-models] 3. The Homunculus · 2024-10-03T22:12:42.355Z · LW · GW

I don't think what Steve is calling "the homonculus" is the same as the self. 

Actually he says so:

The homunculus, as I use the term, is specifically the vitalistic-force-carrying part of a broader notion of “self”

It's part of the self model but not all of it.

Comment by Linda Linsefors on [Intuitive self-models] 3. The Homunculus · 2024-10-03T22:04:52.647Z · LW · GW

(Neuroscientists obviously don’t use the term “homunculus”, but when they talk about “top-down versus bottom-up”, I think they’re usually equating “top-down” with “caused by the homunculus” and “bottom-up” with “not caused by the homunculus”.)


I agree that the homunculus-theory is wrong and bad, but I still think there is something to top-down vs bottom-up. 

It's related to what you write later

Another part of the answer is that positive-valence S(X) unlocks a far more powerful kind of brainstorming / planning, where attention-control is part of the strategy space. I’ll get into that more in Post 8.

I think conscious control (aka top-down) is related to conscious thoughts (in the global work space theory sense) which is related to using working memory, to unlock more serial compute.

Comment by Linda Linsefors on ... Wait, our models of semantics should inform fluid mechanics?!? · 2024-09-24T15:11:23.067Z · LW · GW

That said, if those sorts of concepts are natural in our world, then it’s kinda weird that human minds weren’t already evolved to leverage them.


A counter possibility to this that comes to mind:

There might be concepts that is natural in our world, but which are only useful for a mind with much more working memory, or other compute recourses than the human mind. 

If weather simulations use concepts that are confusing and un-intuitive for most humans, this would be evidence for something like this. Weather is something that we encounter a lot, and is important for humans, especially historically. If we haven't developed some natural weather concept, it's not for lack of exposure or lack of selection pressure, but for some other reason. That other reason could be that we're not smart enough to use the concept.

Comment by Linda Linsefors on Linda Linsefors's Shortform · 2024-09-23T16:17:41.776Z · LW · GW

Yesterday was the official application deadline for leading a project at the next AISC. This means that we just got a whole host of project proposals. 

If you're interested in giving feedback and advise to our new research leads, let me know. If I trust your judgment, I'll onboard you as an AISC advisor.

Also, it's still possible to send us a late AISC project proposals. However we will prioritise people how applied in time when giving support and feedback. Further more, we'll prioritise less late applications over more late applications. 

Comment by Linda Linsefors on [Intuitive self-models] 1. Preliminaries · 2024-09-20T17:07:41.292Z · LW · GW

I tried it and it works for me too.

For me the dancer was spinning contraclockwise and would not change. With your screwing trick I could change rotation, and where now stably stuck in the clockwise direction. Until I screwed in the other direction. I've now done this back and forth a few times.

Comment by Linda Linsefors on Invitation to lead a project at AI Safety Camp (Virtual Edition, 2025) · 2024-08-23T14:21:47.948Z · LW · GW

At this writing www.aisafety.camp goes to our new website while aisafety.camp goes to our old website. We're working on fixing this.

If you want to spread information about AISC, please make sure to link to our new webpage, and not the old one. 

Comment by Linda Linsefors on Linda Linsefors's Shortform · 2024-08-17T18:40:43.513Z · LW · GW

Thanks!

Comment by Linda Linsefors on Linda Linsefors's Shortform · 2024-08-15T13:08:40.234Z · LW · GW

I have two hypothesises for what is going on. I'm leaning towards 1, but very unsure. 

1)

king - man + woman = queen

is true for word2vec embeddings but not in LLaMa2 7B embeddings because word2vec has much fewer embedding dimensions. 

  • LLaMa2 7B has 4096 embedding dimensions.
  • This paper uses a variety of word2vec with 50, 150 and 300 embedding dimensions.

Possibly when you have thousands of embedding dimensions, these dimensions will encode lots of different connotations of these words. These connotations will probably not line up with the simple relation [king - man + woman = queen], and therefore we get [king - man + woman  queen] for high dimensional embeddings.

2)

king - man + woman = queen

Isn't true for word2vec either. If you do it with word2vec embeddings you get more or less the same result I did with LLaMa2 7B. 

(As I'm writing this, I'm realising that just getting my hands on some word2vec embeddings and testing this for myself, seems much easier than to decode what the papers I found is actually saying.)

Comment by Linda Linsefors on Linda Linsefors's Shortform · 2024-08-15T12:47:37.603Z · LW · GW

"▁king" - "▁man" + "▁woman"  "▁queen" (for LLaMa2 7B token embeddings)

I tired to replicate the famous "king" - "man" + "woman" = "queen" result from word2vec using LLaMa2 token embeddings. To my surprise it dit not work. 

I.e, if I look for the token with biggest cosine similarity to "▁king" - "▁man" + "▁woman" it is not "▁queen".

Top ten cosine similarly for

  • "▁king" - "▁man" + "▁woman"
    is ['▁king', '▁woman', '▁King', '▁queen', '▁women', '▁Woman', '▁Queen', '▁rey', '▁roi', 'peror']
  • "▁king" + "▁woman"
    is ['▁king', '▁woman', '▁King', '▁Woman', '▁women', '▁queen', '▁man', '▁girl', '▁lady', '▁mother']
  • "▁king"
    is ['▁king', '▁King', '▁queen', '▁rey', 'peror', '▁roi', '▁prince', '▁Kings', '▁Queen', '▁König']
  • "▁woman"
    is ['▁woman', '▁Woman', '▁women', '▁man', '▁girl', '▁mujer', '▁lady', '▁Women', 'oman', '▁female']
  • projection of "▁queen" on span( "▁king", "▁man", "▁woman")
    is ['▁king', '▁King', '▁woman', '▁queen', '▁rey', '▁Queen', 'peror', '▁prince', '▁roi', '▁König']

"▁queen" is the closest match only if you exclude any version of king and woman. But this seems to be only because "▁queen" is already the 2:nd closes match for "▁king". Involving "▁man" and "▁woman" is only making things worse.

I then tried looking up exactly what the word2vec result is, and I'm still not sure.

Wikipedia sites Mikolov et al. (2013). This paper is for embeddings from RNN language models, not word2vec, which is ok for my purposes, because I'm also not using word2vec. More problematic is that I don't know how to interpret how strong their results are. I think the relevant result is this

We see that the RNN vectors capture significantly more syntactic regularity than the LSA vectors, and do remarkably well in an absolute sense, answering more than one in three questions correctly.

which don't seem very strong. Also I can't find any explanation of what LSA is. 

I also found this other paper which is about word2vec embeddings and have this promising figure

But the caption is just a citation to this third paper, which don't have that figure! 

I've not yet read the two last papers in detail, and I'm not sure if or when I'll get back to this investigation.

If someone knows more about exactly what the word2vec embedding results are, please tell me. 

Comment by Linda Linsefors on Self-Other Overlap: A Neglected Approach to AI Alignment · 2024-08-12T15:37:13.517Z · LW · GW

I don't think seeing it as a one dimensional dial, is a good picture here. 

The AI has lots and lots of sub-circuits, and many* can have more or less self-other-overlap. For “minimal self-other distinction while maintaining performance” to do anything, it's sufficient that you can increase self-other-overlap in some subset of these, without hurting performance.

* All the circuits that has to do with agent behaviour, or beliefs.

Comment by Linda Linsefors on Linda Linsefors's Shortform · 2024-08-08T11:32:15.533Z · LW · GW

Cross posted comment from Hold Off On Proposing Solutions — LessWrong

. . . 

I think the main important lesson is to not get attached to early ideas. Instead of banning early ideas, if anything comes up, you can just write tit down, and set it aside. I find this easier than a full ban, because it's just an easier move to make for my brain. 

(I have a similar problem with rationalist taboo. Don't ban words, instead require people to locally define their terms for the duration of the conversation. It solves the same problem, and it isn't a ban on though or speech.)

The other important lesson of the post, is that, in the early discussion, focus on increasing your shared understanding of the problem, rather than generating ideas. I.e. it's ok for ideas to come up (and when they do you save them for later). But generating ideas is not the goal in the beginning. 

Hm, thinking about it, I think the mechanism of classical brainstorming (where you up front think of as many ideas as you can) is to exhaust all the trivial, easy to think of, ideas, as fast as you can, and then you're forced to think deeper to come up with new ideas. I guess that's another way to do it. But I think this is method is both ineffective and unreliable, since it only works though a secondary effect.

. . .

It is interesting to comparing the advise in this post with the Game Tree of Aliment or Builder/Breaker Methodology also here. I've seen variants of this exercise popping lots of places in the AI Safety community. Some of them pare probably inspired by each other, but I'm pretty sure (80%) that this method have been invented several times independently.

I think that GTA/BBM works for the same reason the advice in the post works. It also solves the problem of not getting attached, and also as you keep breaking your ideas and explore new territory, you expand your understanding or the problem. I think an active ingrediens in this method is that the people playing this game knows that alignment is hard, and go in expecting their first several ideas to be terrible. You know the exercise is about noticing the flaws in your plans, and learn from your mistakes. Without this attitude, I don't think it would work very well.

Comment by Linda Linsefors on Hold Off On Proposing Solutions · 2024-08-08T11:29:08.172Z · LW · GW

I think the main important lesson is to not get attached to early ideas. Instead of banning early ideas, if anything comes up, you can just write tit down, and set it aside. I find this easier than a full ban, because it's just an easier move to make for my brain. 

(I have a similar problem with rationalist taboo. Don't ban words, instead require people to locally define their terms for the duration of the conversation. It solves the same problem, and it isn't a ban on though or speech.)

The other important lesson of the post, is that, in the early discussion, focus on increasing your shared understanding of the problem, rather than generating ideas. I.e. it's ok for ideas to come up (and when they do you save them for later). But generating ideas is not the goal in the beginning. 

Hm, thinking about it, I think the mechanism of classical brainstorming (where you up front think of as many ideas as you can) is to exhaust all the trivial, easy to think of, ideas, as fast as you can, and then you're forced to think deeper to come up with new ideas. I guess that's another way to do it. But I think this is method is both ineffective and unreliable, since it only works though a secondary effect.

. . .

It is interesting to comparing the advise in this post with the Game Tree of Aliment or Builder/Breaker Methodology also here. I've seen variants of this exercise popping lots of places in the AI Safety community. Some of them pare probably inspired by each other, but I'm pretty sure (80%) that this method have been invented several times independently.

I think that GTA/BBM works for the same reason the advice in the post works. It also solves the problem of not getting attached, and also as you keep breaking your ideas and explore new territory, you expand your understanding or the problem. I think an active ingrediens in this method is that the people playing this game knows that alignment is hard, and go in expecting their first several ideas to be terrible. You know the exercise is about noticing the flaws in your plans, and learn from your mistakes. Without this attitude, I don't think it would work very well.

Comment by Linda Linsefors on Linda Linsefors's Shortform · 2024-08-07T09:56:53.703Z · LW · GW

I'm reading In-context Learning and Induction Heads (transformer-circuits.pub)

This already strongly suggests some connection between induction heads and in-context learning, but beyond just that, it appears this window is a pivotal point for the training process in general: whatever's occurring is visible as a bump on the training curve (figure below). It is in fact the only place in training where the loss is not convex (monotonically decreasing in slope).

I can see the bump, but it's not the only one. The two layer graph has a second similar bump, which also exists in the one layer model, and I think I can also see it very faintly in the three level model. Did they ignore the second bump because it only exists in small models, while their bump continues to exist in bigger models?

Comment by Linda Linsefors on Linda Linsefors's Shortform · 2024-08-02T15:57:18.840Z · LW · GW

I've recently crossed into being considered senior enough as an organiser, such that people are asking me for advise on how to run their events. I'm enjoying giving out advise, and it also makes me reflet on event design in new ways.

I think there are two types of good events. 

  • Purpose driven event design.
  • Unconference type events

I think there is a continuum between these two types, but also think that if you plot the best events along this continuum, you'll find a bimodal distribution. 

Purpose driven event design

When you organise one of these, you plan a journey for your participant. Everything is woven into a specific goal that is active by the end of the event. Everything fits together.

The Art of Gathering is a great manual for this type of event

Unconference type events

These can defiantly have a purpose (e.g. exchanging of ideas) but the purpose will be less precise than for the previous type, and more importantly, the purpose does not strongly drive the event design

There will be designed elements around the edges, e.g. the opening and ending. But most of the event design just goes into supporting the unconference structure, which is not very purpose specific. For most of the event, the participants will not follow a shared journey, currented by the organisers, instead everyone is free to pick their own adventure. 

Some advise from The Art of Gathering works for unconference type events, e.g. the importance of pre-event communication, opening and ending. But a lot of the advise don't work, which is why I noticed this division in the first place.

Strengths and weaknesses of each type

  • Purpose driven events are more work to do, because you actually have to figure out the event design, and then you probably also have to run the program. With unconferences, you can just run the standard unconference format, on what ever theme you like, and let your participants do most of the work of running the program.
  • An unconference don't require you to know what the specific purpose of the event is. You can just bring together an interesting group of people and see what happens. That's how you get Burning Man or LWCW.
  • However if you have a specific purpose you want to active, you're much more likely to succeed if you actually design the event for that purpose.
  • There are lots of things that a unconferences can't do at all. It's a very broadly applicable format, but not infinitely so.
Comment by Linda Linsefors on Linda Linsefors's Shortform · 2024-08-01T12:35:45.988Z · LW · GW

I feel a bit behind on everything going on in alignment, so for the next weeks (or more) I'll focus on catching up on what ever I find interesting. I'll be using my short form, to record my though. 

I make no promises that reading this is worth anyone's time.

Linda's alignment reading adventures part 1

What to focus on?

I do have some opinions on what aliment directions are more or less promising. I'll probably venture in other directions too, but my main focus is going to be around what I expect an alignment solution to look like. 

  1. I think that to have an aligned AI it is necessary (but not sufficient) that we have shared abstractions/ontology/concepts/ (what ever you want to call it) with the AI. 
  2. I think the way to make progress on the above is to understand what ontology/concepts/abstraction our current AIs are using, and the process that shapes these abstraction. 
  3. I think the way to do this is though mech-interp, mixed with philosophising and theorising. Currently I think the mech-interp part (i.e. look at what is actually going on in a network) is the bottleneck, since I think that philosophising with out data (i.e. agent foundations) has not made much progress lately. 

Conclusion: 

  • I'll mainly focus on reading up on mech-interp and related areas such as dev-interp. I've started on the interp section of Lucius's aliment reading list.
  • I should also read some John Wentworth, since his plan is pretty close to the path I think is most promising.

Feel free to though other recommendations at me.

Some though on things I read so far

I just read 

I really liked Understanding and controlling a maze-solving policy network. It's a good experiment and a good writeup. 

But also, how interesting is this. Basically they removed the cheese observation, it made the agent act as if there where no cheese. This is not some sophisticated steering technique that we can use to align the AIs motivation.

I discussed this with Lucius who pointed out, that the interesting result is that: The the cheese location information is linearly separable from other information, in the middle of the network. I.e. it's not scrambled in a completely opaque way.

Which brings me to Book Review: Design Principles of Biological Circuits

Alon’s book is the ideal counterargument to the idea that organisms are inherently human-opaque: it directly demonstrates the human-understandable structures which comprise real biological systems. 

Both these posts are evidence for the hypothesis that we should expect evolved networks to be modular, in a way that is possible for us to decode. 

By "evolved" I mean things in the same category as natural selection and gradient decent. 

Comment by Linda Linsefors on Linda Linsefors's Shortform · 2024-07-31T12:38:00.970Z · LW · GW

I agree that the reason EAs are usually not tracking favours is that we are (or assume we are) mission aligned. I picked the pay-it-forward framing, because it fitted better with other situation where I expected people not to try to balance social legers. But you're right that there are situations that are mission aligned that are not well described as paying it forward, but some other shared goal.

Another situation where there is no social ledger, is when someone is doing their job. (You talk about a company where people are passionate about the mission. But most employs are not passionate about the mission, and still don't end up owing each other favours for doing their job.)

I personally think that the main benefit of mental health professionals (e.g. psychologies, coaches, etc) is that you get to have a one sided relationship, where you get to talk about all your problem, and you don't owe them anything in return, because you're paying them instead. (Or sometimes the healthcare system, is paying, or your employer. The important thing is that they get paid, and helping you is their job.)

(I much rather talk to a friend about my problems, it just works better, since they know me. But when I do this I owe them. I need to track what cost I'm imposing them, and make sure it's not more than I have the time and opportunity to re-pay.)

Related: Citation form Sady Porn from Scott Alexanders's review

The desire to display gigawatt devotion with zero responsibility is the standard maneuver of our times, note the trend of celebrity soundbite social justice, or children’s fascination with doing the extra credit more than the regular credit, and as a personal observation this is exactly what’s wrong with medical students and nurses. They’ll spend hours talking with a patient about their lives and feelings while fluffing their pillow to cause it to be true that they are devoted - they chose to act, chose to love - while acts solely out of ordinary duty are devalued if not completely avoided.

(I'm pretty sure the way to read Sady Porn (or Scott's review of it) is not to treat any of the text as evidence for anything, but as suggestions of things that may be interesting to pay attention to.)

Comment by Linda Linsefors on Understanding and controlling a maze-solving policy network · 2024-07-31T12:18:48.441Z · LW · GW

In the real network, there are a lot more than two activations. Our results involve a 32,768-dimensional cheese vector, subtracted from about halfway through the network:


Did you try other locations in the network?

I would expect it to work pretty much anywhere, and I'm interested to know if my prediction is correct.  

I'm pretty sure that what happens is (as you also suggest) that the agent stops seeing the cheese. 

Imagine you did the cheese subtraction on the input layer (i.e. the pixel values of the maze). In this case this just trivially removed the cheese from the picture, resulting in behaviour that is identical to no cheese. So I expect something similar to happen to later layer, as long as what the network is mostly doing is just de-coding the image. So at what ever layer this trick stops working, this should mean that the agent has started planing it's moves.

Comment by Linda Linsefors on Understanding and controlling a maze-solving policy network · 2024-07-31T11:16:44.706Z · LW · GW
Comment by Linda Linsefors on Understanding and controlling a maze-solving policy network · 2024-07-31T11:01:21.434Z · LW · GW
Comment by Linda Linsefors on Linda Linsefors's Shortform · 2024-07-30T15:10:26.571Z · LW · GW

Thoughts on social ledgers

 

Some data points:

1)

“someone complimented me out of the blue, and it was a really good compliment, and it was terrible, because maybe I secretly fished for it in some way I can’t entirely figure out, and also now I feel like I owe them one, and I never asked for this, and I’m so angry!”

From Book Review: Sadly, Porn - by Scott Alexander (astralcodexten.com)

2)

A blogpost I remember but can't find. The author talks about the benefits of favour economy. E.g. his friend could help him move at much lower total cost, than the market price for moving. This is because the market has much higher transaction cots than favours among friends. The blogpost also talks about how you only get to participate in the favour economy (and get it's benefits) if you understand that you're expected to return the favours, i.e. keep track of the social ledger, and make sure to pay your depts. Actually you should be overpaying when returning a favour, and now they owe you, and then they over pay you back, resulting in a virtual cycle of helping each other out. The author mentions  being in a mutually beneficial "who can do the biggest favour" competition with his neighbour. 

3)

And EA friend who told me that they experienced over and over in the EA community that they helped people (e.g. letting them stay at their house for free) but where denied similar help in return.

4)

A youtuber (who where discussing this topic) mentioned their grand mother who always gives everyone in the family lots of food to take home, and said that she would probably be offended if offered something in return for this help.

 

Todays thoughts:

I think there are pros and cons that comes with internally tracking a social ledger for favours. I also think there are things that makes this practice better or worse. 

On receiving compliments

In the past, when ever I got a compliment, I felt obligated to give one back, so instead of taking it in, I started thinking as fast I as could what I could compliment back. At some point (not sure when) I decided this was dumb. Now if I get a compliment I instead take a moment to take it in, and thank the other person. I wish others would do this more. If I give someone a compliment, I don't want them to feel obligated to complement me back.

On paying forward instead of paying back

Some people experience the EA community as being super generous when it comes to helping each other out. But other have the opposite experience (see point 3 above). I've personally experienced both. Partly this is because EA is large and diverse, but partly I think it comes from thronging out the social ledger. I think the typical EA attitude is that you're not expected to pay back favours, you're expected to pay it forward, into what ever cause area you're working on. People are not even attempting to balancing the ledger of inter-personal favours, and therefore there will be imbalance, some people receiving more and some giving more.

I think there are lots of other pay-it-forward communities. For example any system where the older "generation" is helping the younger "generation". This can be parents raising their kids, or a marshal arts club where the more experienced are helping train the less experienced, etc. 

I think a many of these are pretty good at paying the favour givers in status points.  In comparison I expect EA to  not be very good at giving status for helping out others in the community, because EA has lots of other things to allocate status points to too. 

How to relate to the ledger

I think the social ledger defiantly have it's use, even among EA types. You need friends you can turn to for favours, which is only possible in the long term if you also return the favours, or you'll end up draining your friends. 

In theory we could have grant supported community builders act as support for everyone else, with no need to pay it back. But I rather not. I do not want to trust the stability of my social network to the whims of the EA funding system.

To have a healthy positive relation with the social ledger, 

  • You have to be ok with being in debt. It's ok to owe someone a favour. Being the sort of person that pays back favours, don't mean you have to do it instantly. You are allowed to wait for your friend to need help from you, and not worry about it in the mean time. You're also allowed to more or less actively look out for things to do for them, both is ok. Just don't stress over it.
  • Have some slack, at least some times. I have very uneven amount of slack in my life, so I try to be helpful to others when I have more, and less so when I have less. But someone who never have slack will never be able to help others back.
  • You can't be too nit-picky about it. The social ledger is not an exact thing. As a rule of thumb, if both you and your friend individually get more out of your friendship than you put in, then things are good. If you can't balance it such that you both get more out than you put in, then you're not playing a positive sum game, and you should either fix this, or just hang out less. It's supposed to be a positive sum game.
  • You can't keep legers with everyone. Or maybe you can? I can't. That's too many people to remember. My brain came with some pre-installed software to keep track of the social leger. This program also decides who to track or not, and it works well enough. I think it turns on when someone has done a favour that reaches some minimum threshold? 

There are also obviously situations where keeping track of some leger don't make sense, and you should not do it, e.g. one-off encounters. And sometimes paying forward instead of paying back is the correct social norm.

Give your favours freely when it's cheap for you to do so, and when you don't mind if the favour is not paid back. Sometimes someone will return the favour.

The pre installed brain soft where 

This probably varies a lot from person to person, and as all the mental traits it will be due to some mix of genetics and culture, and random other stuff.

For me, I'm not choosing whether to keep a social ledger or not, my brain just does that. I'm choosing how to interact with it and tweak it's function, by choosing what to pay attention to.

The way I experience this is that I feel gratitude when I'm in debt, and resentment when someone else have too much unacknowledged debt to me.

  • The gratitude makes me positively disposed to that person and more eager to help them.
  • The resentment flaggs to me that something is wrong. I don't need them to instantly re-pay me, but I need them to acknowledge the debt and that they intend to pay it back. If not, this is a sign that I should stop investing in that person.
    • If they disagree that I've helped them more than they helped me, then we're not playing a positive sum game, and we should stop.
    • If they agree that I helped them more, but also, they don't intend to return the favour... hm...? Does this ever happen? I could imagine this happening if the other person has zero slack. I think there are legit reasons to not be able to return favours, especially favours you did not ask for, but I also expect to not attempt to not want to build a friendship in this situation.

       

...thanks for reading my ramblings. I'll end here, since I'm out of time, and I think I've hit diminishing return on thinking about this...

Comment by Linda Linsefors on Towards more cooperative AI safety strategies · 2024-07-25T13:40:44.569Z · LW · GW

Lightcone is banned from receiving any kind of OpenPhil funding


Why?

The reason for the ban is pretty crux-y. Are Lighitcone banned because OpenPhil dislikes you, or because you're too close so that would be a conflict of interests, or something else.

Comment by Linda Linsefors on 80,000 hours should remove OpenAI from the Job Board (and similar EA orgs should do similarly) · 2024-07-24T10:05:53.482Z · LW · GW

A statement I could truthfully say:

"As a AI safety community member, I predict I and others will be uncomfortable with 80k if this is where things end up settling, because of disagreeing. I could be convinced otherwise, but it would take extraordinary evidence at this point. If my opinions stay the same and 80k's also are unchanged, I expect this make me hesitant to link to and recommend 80k, and I would be unsurprised to find others behaving similarly."

But you did not say it (other than as a response to me). Why not? 

I'd be happy for you to take the discussion with 80k and try to change their behaviour. This is not the first time I told them that if they list a job, a lot of people will both take it as an endorsement, and trut 80k that this is a good job to apply for. 

As far as I can tell 80k is in complete denial on the large influence they have on many EAs, especially local EA community builders. They have a lot of trust, mainly for being around for so long. So when ever they screw up like this, it causes enormous harm. Also since EA have such a large growth rate (at any given time most EAs are new EAs), the community is bad at tracking when 80k does screw up, so they don't even loose that much trust.

On my side, I've pretty much given up on them caring at all about what I have to say. Which is why I'm putting so litle effort into how I word things. I agree my comment could have been worded better (with more effort), and I have tried harder in the past. But I also have to say that I find the level of extreme politeness, lot's of EA show towards high status orgs, to be very off-putting, so I never been able to imitate that style. 

Again, if you can do better, please do so. I'm serious about this.

Someone (not me) had some success at getting 80k to listen, over at the EA forum version of this post. But more work is needed.

(FWIW, I'm not the one who downvoted you)

 

Comment by Linda Linsefors on 80,000 hours should remove OpenAI from the Job Board (and similar EA orgs should do similarly) · 2024-07-24T09:40:25.190Z · LW · GW

I also have limited capacity. 

Comment by Linda Linsefors on The Intense World Theory of Autism · 2024-07-17T10:33:17.600Z · LW · GW

Actually, it's probably true if you don't confound for intelligence. 

Autism is negatively correlated with intelligence, and if you're not very smart, everything gets harder. But I think it's wrong to see low intelligence as part of Autism. And even if you disagree, it's weird to classify a general intelligence problem as a specific social deficit problem.

But if you compare high functioning Autists with neurotypicals, in realistic enough settings, I'm convinced autists will be better at understanding autists than neurotypicals are at understanding autists. Although "realistic enough" might require the autists enough time to interact to spot each other as same type of person.

I don't put a lot of weight here on academic studies, over just all of my life experience. But in case you do: I did here of a study where autist worked better with other autists than neurotypicals with neurotypicals. I don't have the liks, sorry. Just my memory of someone I trust telling me about it.

The reason I don't trust academic studies on this topic, is that is really really hard to do them well, so most of them are not done well. 

Comment by Linda Linsefors on The Intense World Theory of Autism · 2024-07-17T10:18:38.077Z · LW · GW

only about equal to them at understanding fellow autists

I do not believe this.

Comment by Linda Linsefors on The Intense World Theory of Autism · 2024-07-15T14:36:54.515Z · LW · GW

Empathy is a useful tool, I use it too to generate initial guesses about people. But I'm also aware that it's untrustworthy. In my experience it's common for neurotypicals to fail at this last step.

Comment by Linda Linsefors on The Intense World Theory of Autism · 2024-07-15T14:30:51.547Z · LW · GW

Neurotypicals are more accurate for other neurotypicals. Autists are more accurate for other autists. 

Since there are more neurotypicals (are there? or just more people pretending to be?*) neurotypicals are statistically more often correct**. But I would still not claim that that is a higher level of social skill. Having higher skill level at a more common task, is not the same thing as having higher over all skill. This detail is very important for understanding autism. 

* Not saying that autism is the majority. But there are more neurotypes out there, and probably lots that are better at masking than autists.

** In a statistically representative environment. When autists are free to self segregate, we no longer have problems. This is the main point actually. If it was just a one dimensional skill issue, concentrating lots of autists, would go terribly, since no one have social skills, but instead is's great. All the people I get along with the best are other autists (officially or self diagnosed).

that's, IMO, the mechanism by which empathy works

Empathy is a very unreliable source of information though. E.g. I feel empathy with my plushies.

Comment by Linda Linsefors on 80,000 hours should remove OpenAI from the Job Board (and similar EA orgs should do similarly) · 2024-07-05T08:59:50.061Z · LW · GW

Temporarily deleted since I misread Eli's comment. I might re-post

Comment by Linda Linsefors on 80,000 hours should remove OpenAI from the Job Board (and similar EA orgs should do similarly) · 2024-07-05T08:53:52.967Z · LW · GW
Comment by Linda Linsefors on 80,000 hours should remove OpenAI from the Job Board (and similar EA orgs should do similarly) · 2024-07-04T21:25:56.633Z · LW · GW

However, we don’t conceptualize the board as endorsing organisations.

It don't matter how you conceptualize it. It matters how it looks, and it looks like an endorsement. This is not an optics concern. The problem is that people who trust you will see this and think OpenAI is a good place to work.

Non-infosec safety work

  • These still seem like potentially very strong roles with the opportunity to do very important work. We think it’s still good for the world if talented people work in roles like this! 

How can you still think this after the whole safety team quit? They clearly did not think these roles where any good for doing safety work.

Edit: I was wrong about the whole team quitting. But given everything, I still stand by that these jobs should not be there with out at leas a warning sign. 

 

As a AI safety community builder, I'm considering boycotting 80k (i.e. not link to you and reccomend people not to trust your advise) until you at least put warning labels on your job board. And I'll reccomend other community builders to do the same.

I do think 80k means well, but I just can't reccomend any org with this level of lack of judgment. Sorry.

Comment by Linda Linsefors on SAE feature geometry is outside the superposition hypothesis · 2024-06-25T20:55:53.709Z · LW · GW

This post reminds me of the Word2vec algebra.

E.g. "kitten" - "cat" + "dog"  "puppy"

I expect that this will be true for LLM token embeddings too. Have anyone checked this? 

I also expect something similar to be true for internal LLM representations too, but that this might be harder to verify. However, maybe not, if you have interpretable SAE vectors?