Kabir Kumar's Shortform
post by Kabir Kumar (kabir-kumar) · 2024-11-03T17:03:01.824Z · LW · GW · 14 commentsContents
14 comments
14 comments
Comments sorted by top scores.
comment by Kabir Kumar (kabir-kumar) · 2024-12-05T00:13:12.734Z · LW(p) · GW(p)
ok, options.
- Review of 108 ai alignment plans
- write-up of Beyond Distribution - planned benchmark for alignment evals beyond a models distribution, send to the quant who just joined the team who wants to make it
- get familiar with the TPUs I just got access to
- run hhh and it's variants, testing the idea behind Beyond Distribution, maybe make a guide on itr
- continue improving site design
- fill out the form i said i was going to fill out and send today
- make progress on cross coders - would prob need to get familiar with those tpus
- writeup of ai-plans, the goal, the team, what we're doing, what we've done, etc
- writeup of the karma/voting system
- the video on how to do backprop by hand
- tutorial on how to train an sae
think Beyond Distribution writeup. he's waiting and i feel bad.
comment by Kabir Kumar (kabir-kumar) · 2024-11-03T17:03:02.193Z · LW(p) · GW(p)
btw, thoughts on this for 'the alignment problem'?
"A robust, generalizable, scalable, method to make an AI model which will do set [A] of things as much as it can and not do set [B] of things as much as it can, where you can freely change [A] and [B]"
↑ comment by Seth Herd · 2024-11-04T13:55:09.460Z · LW(p) · GW(p)
Freely changing an AGIs goals is corrigibility, which is a huge advantage if you can get it. See Max Harms' corrigibility sequence and my "instruction-following AGI is easier...."
The question is how a reliably get such a thing. Goalcrafting is one part of the problem, and I agree that those are good goals; the other and larger part is technical alignment, getting those desired goals to really work that way in the particular first AGI we get.
Replies from: kabir-kumar↑ comment by Kabir Kumar (kabir-kumar) · 2024-11-04T14:49:15.408Z · LW(p) · GW(p)
Yup, those are hard. Was just thinking of a definition for the alignment problem, since I've not really seen any good ones.
Replies from: Seth Herd↑ comment by Seth Herd · 2024-11-04T16:13:13.970Z · LW(p) · GW(p)
I'd say you're addressing the question of goalcrafting or selecting alignment targets.
I think you've got the right answer for technical alignment goals; but the question remains of what human would control that AGI. See my "if we solve alignment, do we all die anyway" for the problems with that scenario.
Spoiler alert; we do all die anyway if really selfish people get control of AGIs. And selfish people tend to work harder at getting power.
But I do think your goal defintion is a good alignment target for the technical work. I don't think there's a better one. I do prefer instruction following or corriginlilty by the definitions in the posts I linked above because they're less rigid, but they're both very similar to your definition.
Replies from: kabir-kumar↑ comment by Kabir Kumar (kabir-kumar) · 2024-11-05T01:27:59.334Z · LW(p) · GW(p)
I pretty much agree. I prefer rigid definitions because they're less ambiguous to test and more robust to deception. And this field has a lot of deception.
comment by Kabir Kumar (kabir-kumar) · 2024-12-27T23:32:04.217Z · LW(p) · GW(p)
I'd like some feedback on my theory of impact for my currently chosen research path
**End goal**: Reduce x-risk from AI and risk of human disempowerment.
for x-risk:
- solving AI alignment - very important,
- knowing exactly how well we're doing in alignment, exactly how close we are to solving it, how much is left, etc seems important.
- how well different methods work,
- which companies are making progress in this, which aren't, which are acting like they're making progress vs actually making progress, etc
- put all on a graph, see who's actually making the line go up
- Also, a way that others can use to measure how good their alignment method/idea is, easily
so there's actually a target and a progress bar for alignment - seems like it'd make alignment research a lot easier and improve the funding space - and the space as a whole. Improving the quality and quantity of research.
- Currently, it's mostly a mixture of vibe checks, occasional benchmarks that test a few models, jailbreaks, etc
- all almost exclusively on the end models as a whole - which have many, many differences that could be contributing to the differences in the different 'alignment measurements'
by having a method that keeps things controlled as much as possible and just purely measures the different post training methods, this seems like a much better way to know how we're doing in alignment
and how to prioritize research, funding, governence, etc
On Goodharting the Line - will also make it modular, so that people can add their own benchmarks, and highlight people who redteam different alignment benchmarks.
Replies from: dtch1997↑ comment by Daniel Tan (dtch1997) · 2024-12-28T09:05:48.853Z · LW(p) · GW(p)
What is the proposed research path and its theory of impact? It’s not clear from reading your note / generally seems too abstract to really offer any feedback
comment by Kabir Kumar (kabir-kumar) · 2024-12-10T03:20:42.414Z · LW(p) · GW(p)
I think this is a really good opportunity to work on a topic you might not normally work on, with people you might not normally work with, and have a big impact:
https://lu.ma/sjd7r89v
I'm running the event because I think this is something really valuable and underdone.
comment by Kabir Kumar (kabir-kumar) · 2024-11-16T16:00:01.375Z · LW(p) · GW(p)
give better names to actual formal math things, jesus christ.
comment by Kabir Kumar (kabir-kumar) · 2024-12-23T16:41:18.038Z · LW(p) · GW(p)
I'm finally reading The Sequences and it screams midwittery to me, I'm sorry.
Compare this:
to Jaynes:
Jaynes is better organized, more respectful to the reader, more respectful to the work he's building on and more useful
↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-12-23T18:15:58.105Z · LW(p) · GW(p)
The Sequences highly praise Jaynes and recommend reading his work directly.
The Sequences aren't trying to be a replacement, they're trying to be a pop sci intro to the style of thinking. An easier on-ramp. If Jaynes already seems exciting and comprehensible to you, read that instead of the Sequences on probability.
Replies from: kabir-kumar↑ comment by Kabir Kumar (kabir-kumar) · 2024-12-23T19:55:38.682Z · LW(p) · GW(p)
Fair enough. Personally, so far, I've found Jaynes more comprehensible than The Sequences.
Replies from: nathan-helm-burger↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-12-23T20:21:25.150Z · LW(p) · GW(p)
I think most people with a natural inclination towards math probably would feel likewise.