Kabir Kumar's Shortform

kabir-kumar

Kabir Kumar's Shortform

post by Kabir Kumar (kabir-kumar) · 2024-11-03T17:03:01.824Z · LW · GW · 32 comments

32 comments

32 comments

Comments sorted by top scores.

comment by Kabir Kumar (kabir-kumar) · 2025-02-13T19:36:13.182Z · LW(p) · GW(p)

Trying to put together a better explainer for the hard part of alignment, while not having a good math background https://docs.google.com/document/d/1ePSNT1XR2qOpq8POSADKXtqxguK9hSx_uACR8l0tDGE/edit?usp=sharing
Please give feedback!

comment by Kabir Kumar (kabir-kumar) · 2025-04-14T14:27:20.891Z · LW(p) · GW(p)

prob not gonna be relatable for most folk, but i'm so fucking burnt out on how stupid it is to get funding in ai safety. the average 'ai safety funder' does more to accelerate funding for capabilities than safety, in huge part because what they look for is Credentials and In-Group Status, rather than actual merit.
And the worst fucking thing is how much they lie to themselves and pretend that the 3 things they funded that weren't completely in group, mean that they actually aren't biased in that way.

At least some VCs are more honest that they want to be leeches and make money off of you.

Replies from: Mitchell_Porter

↑ comment by Mitchell_Porter · 2025-04-14T18:05:54.270Z · LW(p) · GW(p)

Who or what is the "average AI safety funder"? Is it a private individual, a small specialized organization, a larger organization supporting many causes, an AI think tank for which safety is part of a capabilities program...?

Replies from: kabir-kumar

↑ comment by Kabir Kumar (kabir-kumar) · 2025-04-14T19:36:56.808Z · LW(p) · GW(p)

all of the above, then averaged :p

Replies from: Mitchell_Porter

↑ comment by Mitchell_Porter · 2025-04-20T07:49:17.975Z · LW(p) · GW(p)

I asked because I'm pretty sure that I'm being badly wasted (i.e. I could be making much more substantial contributions to AI safety), but I very rarely apply for support, so I thought I'd ask for information about the funding landscape from someone who has been exploring it.

And by the way, your brainchild AI-Plans is a pretty cool resource. I can see it being useful for e.g. a frontier AI organization which thinks they have an alignment plan, but wants to check the literature to know what other ideas are out there.

comment by Kabir Kumar (kabir-kumar) · 2025-03-10T13:19:55.521Z · LW(p) · GW(p)

it's so unnecessarily hard to get funding in alignment.

they say 'Don't Bullshit' but what that actually means is 'Only do our specific kind of bullshit'.

and they don't specify because they want to pretend that they don't have their own bullshit

Replies from: Dagon, evalu

↑ comment by Dagon · 2025-03-10T13:57:42.730Z · LW(p) · GW(p)

This seems generally applicable. Any significant money transaction includes expectations, both legible and il-, which some participants will classify as bullshit. Those holding the expectations may believe it to be legitimately useful, or semi-legitimately necessary due to lack of perfect alignment.

If you want to specify a bit, we can probably guess at why it's being required.

Replies from: kabir-kumar

↑ comment by Kabir Kumar (kabir-kumar) · 2025-03-10T16:02:26.338Z · LW(p) · GW(p)

What I liked about applying for VC funding was the specific questions.

"How is this going to make money?"

"What proof do you have this is going to make money"

and it being clear the bullshit that they wanted was numbers, testimonials from paying customers, unambiguous ways the product was actually better, etc. And then standard bs about progress, security, avoiding weird wibbly wobbly talk, 'woke', 'safety', etc.

With Alignment funders, they really obviously have language they're looking for as well, or language that makes them more and less willing to put more effort into understanding the proposal. Actually, they have it more than the VCs. But they act as if they don't.

↑ comment by evalu · 2025-03-10T22:01:16.953Z · LW(p) · GW(p)

Have you felt this from your own experience trying to get funding, or from others, or both? Also, I'm curious what you think is their specific kind of bullshit, and if there's things you think are real but others thought to be bullshit.

Replies from: kabir-kumar

↑ comment by Kabir Kumar (kabir-kumar) · 2025-03-11T12:54:27.715Z · LW(p) · GW(p)

Both. Not sure, its something like lesswrong/EA speak mixed with the VC speak.

Replies from: kabir-kumar

↑ comment by Kabir Kumar (kabir-kumar) · 2025-04-17T21:35:54.607Z · LW(p) · GW(p)

If I knew the specific bs, I'd be better at making successful applications and less intensely frustrated.

comment by Kabir Kumar (kabir-kumar) · 2025-01-20T13:57:40.966Z · LW(p) · GW(p)

this might basically be me, but I'm not sure how exactly to change for the better. theorizing seems to take time and money which i don't have.

comment by Kabir Kumar (kabir-kumar) · 2025-01-16T12:29:20.865Z · LW(p) · GW(p)

Thinking about judgement criteria for the coming ai safety evals hackathon (https://lu.ma/xjkxqcya )
These are the things that need to be judged:
1. Is the benchmark actually measuring alignment (the real, scale, if we dont get this fully right right we die, problem)
2. Is the way of Deceiving the benchmark to get high scores actually deception, or have they somehow done alignment?

Both of these things need:
- a strong deep learning & ml background (ideally, muliple influential papers where they're one of the main authors/co-authors, or doing ai research at a significant lab, or they have, in the last 4 years)
- a good understanding of what the real alignment problem actually means - can judge this by looking at their papers, activity on lesswrong, alignmentforum, blog, etc
- a good understanding of evals/benchmarks (1 great or two pretty good papers/repos/works on this, ideally for alignment)

Do these seem loose? Strict? Off base?

comment by Kabir Kumar (kabir-kumar) · 2025-04-14T14:16:24.157Z · LW(p) · GW(p)

the average ai safety funder does more to accelerate capabilities than they do safety, in part due to credentialism and looking for in group status.

comment by Kabir Kumar (kabir-kumar) · 2024-12-05T00:13:12.734Z · LW(p) · GW(p)

ok, options.
- Review of 108 ai alignment plans
- write-up of Beyond Distribution - planned benchmark for alignment evals beyond a models distribution, send to the quant who just joined the team who wants to make it
- get familiar with the TPUs I just got access to
- run hhh and it's variants, testing the idea behind Beyond Distribution, maybe make a guide on itr
- continue improving site design

- fill out the form i said i was going to fill out and send today
- make progress on cross coders - would prob need to get familiar with those tpus
- writeup of ai-plans, the goal, the team, what we're doing, what we've done, etc
- writeup of the karma/voting system
- the video on how to do backprop by hand
- tutorial on how to train an sae

think Beyond Distribution writeup. he's waiting and i feel bad.

comment by Kabir Kumar (kabir-kumar) · 2024-11-03T17:03:02.193Z · LW(p) · GW(p)

btw, thoughts on this for 'the alignment problem'?
"A robust, generalizable, scalable, method to make an AI model which will do set [A] of things as much as it can and not do set [B] of things as much as it can, where you can freely change [A] and [B]"

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-11-04T13:55:09.460Z · LW(p) · GW(p)

Freely changing an AGIs goals is corrigibility, which is a huge advantage if you can get it. See Max Harms' corrigibility sequence and my "instruction-following AGI is easier...."

The question is how a reliably get such a thing. Goalcrafting is one part of the problem, and I agree that those are good goals; the other and larger part is technical alignment, getting those desired goals to really work that way in the particular first AGI we get.

Replies from: kabir-kumar

↑ comment by Kabir Kumar (kabir-kumar) · 2024-11-04T14:49:15.408Z · LW(p) · GW(p)

Yup, those are hard. Was just thinking of a definition for the alignment problem, since I've not really seen any good ones.

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-11-04T16:13:13.970Z · LW(p) · GW(p)

I'd say you're addressing the question of goalcrafting or selecting alignment targets.

I think you've got the right answer for technical alignment goals; but the question remains of what human would control that AGI. See my "if we solve alignment, do we all die anyway" for the problems with that scenario.

Spoiler alert; we do all die anyway if really selfish people get control of AGIs. And selfish people tend to work harder at getting power.

But I do think your goal defintion is a good alignment target for the technical work. I don't think there's a better one. I do prefer instruction following or corriginlilty by the definitions in the posts I linked above because they're less rigid, but they're both very similar to your definition.

Replies from: kabir-kumar

↑ comment by Kabir Kumar (kabir-kumar) · 2024-11-05T01:27:59.334Z · LW(p) · GW(p)

I pretty much agree. I prefer rigid definitions because they're less ambiguous to test and more robust to deception. And this field has a lot of deception.

comment by Kabir Kumar (kabir-kumar) · 2025-02-19T22:03:42.838Z · LW(p) · GW(p)

in general, when it comes to things which are the 'hard part of alignment', is the crux
```
a flawless method of ensuring the AI system is pointed at and will always continue to be pointed at good things
```
?
the key part being flawless - and that seeming to need a mathematical proof?

comment by Kabir Kumar (kabir-kumar) · 2025-01-25T01:37:33.525Z · LW(p) · GW(p)

Thoughts on this?

### Limitations of HHH and other Static Dataset benchmarks

A Static Dataset is a dataset which will not grow or change - it will remain the same. Static dataset type benchmarks are inherently limited in what information they will tell us about a model. This is especially the case when we care about AI Alignment and want to measure how 'aligned' the AI is.

### Purpose of AI Alignment Benchmarks

When measuring AI Alignment, our aim is to find out exactly how close the model is to being the ultimate 'aligned' model that we're seeking - a model whose preferences are compatible with ours, in a way that will empower humanity, not harm or disempower it.

### Difficulties of Designing AI Alignment Benchmarks
What preferences those are, could be a significant part of the alignment problem. This means that we will need to frequently make sure we know what preferences we're trying to measure for and re-determine if these are the correct ones to be aiming for.

### Key Properties of Aligned Models

These preferences must be both robustly and faithfully held by the model:
Robustness:
- They will be preserved over unlimited iterations of the model, without deterioration or deprioritization.
- They will be robust to external attacks, manipulations, damage, etc of the model.
Faithfulness:
- The model 'believes in', 'values' or 'holds to be true and important' the preferences that we care about .
- It doesn't just store the preferences as information of equal priority to any other piece of information, e.g. how many cats are in Paris - but it holds them as its own, actual preferences.

Comment on the Google Doc here: https://docs.google.com/document/d/1PHUqFN9E62_mF2J5KjcfBK7-GwKT97iu2Cuc7B4Or2w/edit?usp=sharing

This is for the AI Alignment Evals Hackathon: https://lu.ma/xjkxqcya by AI-Plans

comment by Kabir Kumar (kabir-kumar) · 2025-01-05T13:55:38.433Z · LW(p) · GW(p)

I'm looking for feedback on the hackathon page
mind telling me what you think?
https://docs.google.com/document/d/1Wf9vju3TIEaqQwXzmPY--R0z41SMcRjAFyn9iq9r-ag/edit?usp=sharing

comment by Kabir Kumar (kabir-kumar) · 2025-01-04T01:00:32.555Z · LW(p) · GW(p)

https://kkumar97.blogspot.com/2025/01/pain-of-writing.html

comment by Kabir Kumar (kabir-kumar) · 2024-12-27T23:32:04.217Z · LW(p) · GW(p)

I'd like some feedback on my theory of impact for my currently chosen research path

**End goal**: Reduce x-risk from AI and risk of human disempowerment.
for x-risk:
- solving AI alignment - very important,
- knowing exactly how well we're doing in alignment, exactly how close we are to solving it, how much is left, etc seems important.
- how well different methods work,
- which companies are making progress in this, which aren't, which are acting like they're making progress vs actually making progress, etc
- put all on a graph, see who's actually making the line go up

- Also, a way that others can use to measure how good their alignment method/idea is, easily
so there's actually a target and a progress bar for alignment - seems like it'd make alignment research a lot easier and improve the funding space - and the space as a whole. Improving the quality and quantity of research.

- Currently, it's mostly a mixture of vibe checks, occasional benchmarks that test a few models, jailbreaks, etc
- all almost exclusively on the end models as a whole - which have many, many differences that could be contributing to the differences in the different 'alignment measurements'
by having a method that keeps things controlled as much as possible and just purely measures the different post training methods, this seems like a much better way to know how we're doing in alignment
and how to prioritize research, funding, governence, etc

On Goodharting the Line - will also make it modular, so that people can add their own benchmarks, and highlight people who redteam different alignment benchmarks.

Replies from: dtch1997

↑ comment by Daniel Tan (dtch1997) · 2024-12-28T09:05:48.853Z · LW(p) · GW(p)

What is the proposed research path and its theory of impact? It’s not clear from reading your note / generally seems too abstract to really offer any feedback

comment by Kabir Kumar (kabir-kumar) · 2024-12-10T03:20:42.414Z · LW(p) · GW(p)

I think this is a really good opportunity to work on a topic you might not normally work on, with people you might not normally work with, and have a big impact: https://lu.ma/sjd7r89v

I'm running the event because I think this is something really valuable and underdone.

comment by Kabir Kumar (kabir-kumar) · 2024-11-16T16:00:01.375Z · LW(p) · GW(p)

give better names to actual formal math things, jesus christ.

comment by Kabir Kumar (kabir-kumar) · 2024-12-23T16:41:18.038Z · LW(p) · GW(p)

I'm finally reading The Sequences and it screams midwittery to me, I'm sorry.

Compare this:
to Jaynes:

Jaynes is better organized, more respectful to the reader, more respectful to the work he's building on and more useful

Replies from: nathan-helm-burger

↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-12-23T18:15:58.105Z · LW(p) · GW(p)

The Sequences highly praise Jaynes and recommend reading his work directly.

The Sequences aren't trying to be a replacement, they're trying to be a pop sci intro to the style of thinking. An easier on-ramp. If Jaynes already seems exciting and comprehensible to you, read that instead of the Sequences on probability.

Replies from: kabir-kumar

↑ comment by Kabir Kumar (kabir-kumar) · 2024-12-23T19:55:38.682Z · LW(p) · GW(p)

Fair enough. Personally, so far, I've found Jaynes more comprehensible than The Sequences.

Replies from: nathan-helm-burger

↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-12-23T20:21:25.150Z · LW(p) · GW(p)

I think most people with a natural inclination towards math probably would feel likewise.

Kabir Kumar's Shortform

Contents

32 comments