Kabir Kumar's Shortform

post by Kabir Kumar (kabir-kumar) · 2024-11-03T17:03:01.824Z · LW · GW · 14 comments

Contents

14 comments

14 comments

Comments sorted by top scores.

comment by Kabir Kumar (kabir-kumar) · 2024-12-05T00:13:12.734Z · LW(p) · GW(p)

ok, options. 
- Review of 108 ai alignment plans
- write-up of Beyond Distribution - planned benchmark for alignment evals beyond a models distribution, send to the quant who just joined the team who wants to make it
- get familiar with the TPUs I just got access to
- run hhh and it's variants, testing the idea behind Beyond Distribution, maybe make a guide on itr 
- continue improving site design

- fill out the form i said i was going to fill out and send today
- make progress on cross coders - would prob need to get familiar with those tpus
- writeup of ai-plans, the goal, the team, what we're doing, what we've done, etc
- writeup of the karma/voting system
- the video on how to do backprop by hand
- tutorial on how to train an sae

think Beyond Distribution writeup. he's waiting and i feel bad. 

comment by Kabir Kumar (kabir-kumar) · 2024-11-03T17:03:02.193Z · LW(p) · GW(p)

btw, thoughts on this for 'the alignment problem'?
"A robust, generalizable, scalable,  method to make an AI model which will do set [A] of things as much as it can and not do set [B] of things as much as it can, where you can freely change [A] and [B]"

Replies from: Seth Herd
comment by Seth Herd · 2024-11-04T13:55:09.460Z · LW(p) · GW(p)

Freely changing an AGIs goals is corrigibility, which is a huge advantage if you can get it. See Max Harms' corrigibility sequence and my "instruction-following AGI is easier...."

The question is how a reliably get such a thing. Goalcrafting is one part of the problem, and I agree that those are good goals; the other and larger part is technical alignment, getting those desired goals to really work that way in the particular first AGI we get.

Replies from: kabir-kumar
comment by Kabir Kumar (kabir-kumar) · 2024-11-04T14:49:15.408Z · LW(p) · GW(p)

Yup, those are hard. Was just thinking of a definition for the alignment problem, since I've not really seen any good ones.

Replies from: Seth Herd
comment by Seth Herd · 2024-11-04T16:13:13.970Z · LW(p) · GW(p)

I'd say you're addressing the question of goalcrafting or selecting alignment targets.

I think you've got the right answer for technical alignment goals; but the question remains of what human would control that AGI. See my "if we solve alignment, do we all die anyway" for the problems with that scenario.

Spoiler alert; we do all die anyway if really selfish people get control of AGIs. And selfish people tend to work harder at getting power.

But I do think your goal defintion is a good alignment target for the technical work. I don't think there's a better one. I do prefer instruction following or corriginlilty by the definitions in the posts I linked above because they're less rigid, but they're both very similar to your definition.

Replies from: kabir-kumar
comment by Kabir Kumar (kabir-kumar) · 2024-11-05T01:27:59.334Z · LW(p) · GW(p)

I pretty much agree. I prefer rigid definitions because they're less ambiguous to test and more robust to deception. And this field has a lot of deception.

comment by Kabir Kumar (kabir-kumar) · 2024-12-27T23:32:04.217Z · LW(p) · GW(p)

I'd like some feedback on my theory of impact for my currently chosen research path

**End goal**: Reduce x-risk from AI and risk of human disempowerment. 
for x-risk: 
- solving AI alignment - very important, 
- knowing exactly how well we're doing in alignment, exactly how close we are to solving it, how much is left, etc seems important.
 - how well different methods work, 
 - which companies are making progress in this, which aren't, which are acting like they're making progress vs actually making progress, etc
 - put all on a graph, see who's actually making the line go up

- Also, a way that others can use to measure how good their alignment method/idea is, easily 
so there's actually a target and a progress bar for alignment - seems like it'd make alignment research a lot easier and improve the funding space - and the space as a whole. Improving the quality and quantity of research.

- Currently, it's mostly a mixture of vibe checks, occasional benchmarks that test a few models, jailbreaks, etc
- all almost exclusively on the end models as a whole - which have many, many differences that could be contributing to the differences in the different 'alignment measurements'
by having a method that keeps things controlled as much as possible and just purely measures the different post training methods, this seems like a much better way to know how we're doing in alignment
and how to prioritize research, funding, governence, etc

On Goodharting the Line - will also make it modular, so that people can add their own benchmarks, and highlight people who redteam different alignment benchmarks.

Replies from: dtch1997
comment by Daniel Tan (dtch1997) · 2024-12-28T09:05:48.853Z · LW(p) · GW(p)

What is the proposed research path and its theory of impact? It’s not clear from reading your note / generally seems too abstract to really offer any feedback

comment by Kabir Kumar (kabir-kumar) · 2024-12-10T03:20:42.414Z · LW(p) · GW(p)

I think this is a really good opportunity to work on a topic you might not normally work on, with people you might not normally work with, and have a big impact: https://lu.ma/sjd7r89v 

I'm running the event because I think this is something really valuable and underdone.

comment by Kabir Kumar (kabir-kumar) · 2024-11-16T16:00:01.375Z · LW(p) · GW(p)

give better names to actual formal math things, jesus christ. 

comment by Kabir Kumar (kabir-kumar) · 2024-12-23T16:41:18.038Z · LW(p) · GW(p)

I'm finally reading The Sequences and it screams midwittery to me, I'm sorry. 

Compare this:
to Jaynes:


Jaynes is better organized, more respectful to the reader, more respectful to the work he's building on and more useful
 

Replies from: nathan-helm-burger
comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-12-23T18:15:58.105Z · LW(p) · GW(p)

The Sequences highly praise Jaynes and recommend reading his work directly.

The Sequences aren't trying to be a replacement, they're trying to be a pop sci intro to the style of thinking. An easier on-ramp. If Jaynes already seems exciting and comprehensible to you, read that instead of the Sequences on probability.

Replies from: kabir-kumar
comment by Kabir Kumar (kabir-kumar) · 2024-12-23T19:55:38.682Z · LW(p) · GW(p)

Fair enough. Personally, so far, I've found Jaynes more comprehensible than The Sequences.

Replies from: nathan-helm-burger
comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-12-23T20:21:25.150Z · LW(p) · GW(p)

I think most people with a natural inclination towards math probably would feel likewise.