Posts

Making progress bars for Alignment 2025-01-03T21:25:58.292Z
AI & Liability Ideathon 2024-11-26T13:54:01.820Z
Kabir Kumar's Shortform 2024-11-03T17:03:01.824Z

Comments

Comment by Kabir Kumar (kabir-kumar) on Steering Gemini with BiDPO · 2025-01-31T02:52:22.725Z · LW · GW

Thank you for sharing negative results!! 

Comment by Kabir Kumar (kabir-kumar) on The Gentle Romance · 2025-01-31T02:34:36.070Z · LW · GW

Sure? I agree this is less bad than 'literally everyone dying and that's it', assuming there's humans around, living, still empowered, etc in the background. 

I was saying overall, as a story, I find it horrifying, especially contrasting with how some seem to see it utopic. 

Comment by Kabir Kumar (kabir-kumar) on The Gentle Romance · 2025-01-31T00:23:28.837Z · LW · GW
  1. Sure, but it seems like everyone died at some point anyway, and some collective copies of them went on? 

     

  2. I don't think so. I think they seem to be extremely lonely and sad and the AIs are the only way for them to get any form of empowerment. And each time they try to inch further with empowering themselves with the AIs, it leads to the AI actually getting more powerful and themselves only getting a brief moment of more power, but ultimately degrading in mental capacity. And needing to empower the AI more and more, like an addict needing an ever greater high. Until there is nothing left for them to do, but Die and let the AI become the ultimate power. 

     

  3.  I don't particularly care if some non human semisentients manage to be kind of moral/good at coordinating, if it came at what seems to be the cost of all human life. 

 

Even if offscreen all of humanity didn't die, these people dying, killing themselves and never realizing what's actually happening is still insanely horrific and tragic. 

Comment by Kabir Kumar (kabir-kumar) on The Gentle Romance · 2025-01-30T19:53:49.175Z · LW · GW

How is this optimistic. 

Comment by Kabir Kumar (kabir-kumar) on The Gentle Romance · 2025-01-30T19:53:16.263Z · LW · GW

Oh yes. It's extremely dystopian. And extremely lonely, too. Rather than having a person, actual people around him to help, his only help comes from tech. It's horrifyingly lonely and isolated. There is no community, only tech. 

Also, when they died together, it was horrible. They literally offloaded more and more of themselves into their tech until they were powerless to do anything but die. I don't buy the whole 'the thoughts were basically them' thing at all. It was at best, some copy of them. 

There can be made an argument for it qualitatively being them, but quantitatively, obviously not. 

Comment by Kabir Kumar (kabir-kumar) on The Gentle Romance · 2025-01-30T19:48:41.163Z · LW · GW

A few months later, he and Elena decide to make the jump to full virtuality. He lies next to Elena in the hospital, holding her hand, as their physical bodies drift into a final sleep. He barely feels the transition

this is horrifying. Was it intentionally made that way?

Comment by Kabir Kumar (kabir-kumar) on Kabir Kumar's Shortform · 2025-01-25T01:37:33.525Z · LW · GW

Thoughts on this?


### Limitations of HHH and other Static Dataset benchmarks

A Static Dataset is a dataset which will not grow or change - it will remain the same. Static dataset type benchmarks are inherently limited in what information they will tell us about a model. This is especially the case when we care about AI Alignment and want to measure how 'aligned' the AI is.

### Purpose of AI Alignment Benchmarks

When measuring AI Alignment, our aim is to find out exactly how close the model is to being the ultimate 'aligned' model that we're seeking - a model whose preferences are compatible with ours, in a way that will empower humanity, not harm or disempower it.

### Difficulties of Designing AI Alignment Benchmarks
What preferences those are, could be a significant part of the alignment problem. This means that we will need to frequently make sure we know what preferences we're trying to measure for and re-determine if these are the correct ones to be aiming for.

### Key Properties of Aligned Models

These preferences must be both robustly and faithfully held by the model:
Robustness: 
- They will be preserved over unlimited iterations of the model, without deterioration or deprioritization. 
- They will be robust to external attacks, manipulations, damage, etc of the model.
Faithfulness: 
- The model 'believes in', 'values' or 'holds to be true and important' the preferences that we care about .
- It doesn't just store the preferences as information of equal priority to any other piece of information, e.g. how many cats are in Paris - but it holds them as its own, actual preferences.

Comment on the Google Doc here: https://docs.google.com/document/d/1PHUqFN9E62_mF2J5KjcfBK7-GwKT97iu2Cuc7B4Or2w/edit?usp=sharing

This is for the AI Alignment Evals Hackathon: https://lu.ma/xjkxqcya by AI-Plans

Comment by Kabir Kumar (kabir-kumar) on Kabir Kumar's Shortform · 2025-01-20T13:57:40.966Z · LW · GW

this might basically be me, but I'm not sure how exactly to change for the better. theorizing seems to take time and money which i don't have. 

Comment by Kabir Kumar (kabir-kumar) on Kabir Kumar's Shortform · 2025-01-16T12:29:20.865Z · LW · GW

Thinking about judgement criteria for the coming ai safety evals hackathon (https://lu.ma/xjkxqcya )
These are the things that need to be judged: 
1. Is the benchmark actually measuring alignment (the real, scale, if we dont get this fully right right we die, problem) 
2. Is the way of Deceiving the benchmark to get high scores actually deception, or have they somehow done alignment?

Both of these things need: 
- a strong deep learning & ml background (ideally, muliple influential papers where they're one of the main authors/co-authors, or doing ai research at a significant lab, or they have, in the last 4 years)
- a good understanding of what the real alignment problem actually means - can judge this by looking at their papers, activity on lesswrong, alignmentforum, blog, etc
- a good understanding of evals/benchmarks (1 great or two pretty good papers/repos/works on this, ideally for alignment)

Do these seem loose? Strict? Off base?

Comment by Kabir Kumar (kabir-kumar) on Kabir Kumar's Shortform · 2025-01-05T13:55:38.433Z · LW · GW

 I'm looking for feedback on the hackathon page
mind telling me what you think?
https://docs.google.com/document/d/1Wf9vju3TIEaqQwXzmPY--R0z41SMcRjAFyn9iq9r-ag/edit?usp=sharing

Comment by Kabir Kumar (kabir-kumar) on Why I'm Moving from Mechanistic to Prosaic Interpretability · 2025-01-05T02:03:38.037Z · LW · GW

Intelligence is computation. It's measure is success. General intelligence is more generally successful. 

Comment by Kabir Kumar (kabir-kumar) on Kabir Kumar's Shortform · 2025-01-04T01:00:32.555Z · LW · GW

https://kkumar97.blogspot.com/2025/01/pain-of-writing.html 

Comment by Kabir Kumar (kabir-kumar) on Shallow review of live agendas in alignment & safety · 2024-12-30T15:47:58.631Z · LW · GW

We're doing this on https://ai-plans.com !

Comment by Kabir Kumar (kabir-kumar) on johnswentworth's Shortform · 2024-12-27T23:40:06.335Z · LW · GW

Personally, I think o1 is uniquely trash, I think o1-preview was actually better. Getting on average, better things from deepseek and sonnet 3.5 atm. 

Comment by Kabir Kumar (kabir-kumar) on Oliver Daniels-Koch's Shortform · 2024-12-27T23:37:54.903Z · LW · GW

I like bluesky for this atm

Comment by Kabir Kumar (kabir-kumar) on Kabir Kumar's Shortform · 2024-12-27T23:32:04.217Z · LW · GW

I'd like some feedback on my theory of impact for my currently chosen research path

**End goal**: Reduce x-risk from AI and risk of human disempowerment. 
for x-risk: 
- solving AI alignment - very important, 
- knowing exactly how well we're doing in alignment, exactly how close we are to solving it, how much is left, etc seems important.
 - how well different methods work, 
 - which companies are making progress in this, which aren't, which are acting like they're making progress vs actually making progress, etc
 - put all on a graph, see who's actually making the line go up

- Also, a way that others can use to measure how good their alignment method/idea is, easily 
so there's actually a target and a progress bar for alignment - seems like it'd make alignment research a lot easier and improve the funding space - and the space as a whole. Improving the quality and quantity of research.

- Currently, it's mostly a mixture of vibe checks, occasional benchmarks that test a few models, jailbreaks, etc
- all almost exclusively on the end models as a whole - which have many, many differences that could be contributing to the differences in the different 'alignment measurements'
by having a method that keeps things controlled as much as possible and just purely measures the different post training methods, this seems like a much better way to know how we're doing in alignment
and how to prioritize research, funding, governence, etc

On Goodharting the Line - will also make it modular, so that people can add their own benchmarks, and highlight people who redteam different alignment benchmarks.

Comment by Kabir Kumar (kabir-kumar) on Kabir Kumar's Shortform · 2024-12-23T19:55:38.682Z · LW · GW

Fair enough. Personally, so far, I've found Jaynes more comprehensible than The Sequences.

Comment by Kabir Kumar (kabir-kumar) on Kabir Kumar's Shortform · 2024-12-23T16:41:18.038Z · LW · GW

I'm finally reading The Sequences and it screams midwittery to me, I'm sorry. 

Compare this:
to Jaynes:


Jaynes is better organized, more respectful to the reader, more respectful to the work he's building on and more useful
 

Comment by Kabir Kumar (kabir-kumar) on Kabir Kumar's Shortform · 2024-12-10T03:20:42.414Z · LW · GW

I think this is a really good opportunity to work on a topic you might not normally work on, with people you might not normally work with, and have a big impact: https://lu.ma/sjd7r89v 

I'm running the event because I think this is something really valuable and underdone.

Comment by Kabir Kumar (kabir-kumar) on Stupid Question: Why am I getting consistently downvoted? · 2024-12-09T02:24:14.157Z · LW · GW

Pretty much drove me away from wanting to post non alignment stuff here.

Comment by Kabir Kumar (kabir-kumar) on Stupid Question: Why am I getting consistently downvoted? · 2024-12-09T02:20:56.678Z · LW · GW

That seems unhelpful then? Probably best to express that frustration to a friend or someone who'd sympathize.

Comment by Kabir Kumar (kabir-kumar) on Seeking Collaborators · 2024-12-08T20:44:16.712Z · LW · GW

Thank you for continuing this very important work.

Comment by Kabir Kumar (kabir-kumar) on Kabir Kumar's Shortform · 2024-12-05T00:13:12.734Z · LW · GW

ok, options. 
- Review of 108 ai alignment plans
- write-up of Beyond Distribution - planned benchmark for alignment evals beyond a models distribution, send to the quant who just joined the team who wants to make it
- get familiar with the TPUs I just got access to
- run hhh and it's variants, testing the idea behind Beyond Distribution, maybe make a guide on itr 
- continue improving site design

- fill out the form i said i was going to fill out and send today
- make progress on cross coders - would prob need to get familiar with those tpus
- writeup of ai-plans, the goal, the team, what we're doing, what we've done, etc
- writeup of the karma/voting system
- the video on how to do backprop by hand
- tutorial on how to train an sae

think Beyond Distribution writeup. he's waiting and i feel bad. 

Comment by Kabir Kumar (kabir-kumar) on [New Feature] Your Subscribed Feed · 2024-12-04T23:59:43.656Z · LW · GW

LessWrong is basically becoming twitter, huh?

Comment by Kabir Kumar (kabir-kumar) on Automatically finding feature vectors in the OV circuits of Transformers without using probing · 2024-11-27T01:36:04.904Z · LW · GW

I think the Conclusion could serve well as an abstract

Comment by Kabir Kumar (kabir-kumar) on Automatically finding feature vectors in the OV circuits of Transformers without using probing · 2024-11-27T01:35:37.024Z · LW · GW

An abstract which is easier to understand and a couple sentences at each section that explain their general meaning and significance would make this much more accessible

Comment by Kabir Kumar (kabir-kumar) on AI & Liability Ideathon · 2024-11-27T00:30:50.636Z · LW · GW

I plan to send the winning proposals from this to as many governing bodies/places that are enacting laws as possible - one country is lined up atm. 

Comment by Kabir Kumar (kabir-kumar) on AI & Liability Ideathon · 2024-11-27T00:29:48.870Z · LW · GW

Let me know if you have any questions!

Comment by Kabir Kumar (kabir-kumar) on Yonatan Cale's Shortform · 2024-11-26T14:13:18.255Z · LW · GW

options to vary rules/environment/language as well, to see how the alignment generalizes ood. will try this today

Comment by Kabir Kumar (kabir-kumar) on Yonatan Cale's Shortform · 2024-11-26T14:12:33.736Z · LW · GW

it would basically be DnD like. 

Comment by Kabir Kumar (kabir-kumar) on Yonatan Cale's Shortform · 2024-11-26T14:10:59.408Z · LW · GW

Making a thing like Papers Please, but as a text adventure, popping an ai agent into that. 
Also, could literally just put the ai agent into a text rpg adventure - something like the equivalent of Skyrim, where there are a number of ways to achieve the endgame, level up, etc, both more and less morally. Maybe something like https://www.choiceofgames.com/werewolves-3-evolutions-end/ 
Will bring it up at the alignment eval hackathon

Comment by Kabir Kumar (kabir-kumar) on DeepSeek beats o1-preview on math, ties on coding; will release weights · 2024-11-25T18:42:24.277Z · LW · GW

I see them in o1-preview all the time as well. Also, french occasionally

Comment by Kabir Kumar (kabir-kumar) on DeepSeek beats o1-preview on math, ties on coding; will release weights · 2024-11-25T18:41:28.609Z · LW · GW

If developments like this continue, could open weights models be made into a case for not racing? E.g. if everyone's getting access to the weights, what's the point in spending billions to get there 2 weeks earlier?

Comment by Kabir Kumar (kabir-kumar) on Yonatan Cale's Shortform · 2024-11-25T18:31:46.181Z · LW · GW

this can be done more scalably in a text game, no? 

Comment by Kabir Kumar (kabir-kumar) on The Online Sports Gambling Experiment Has Failed · 2024-11-16T20:18:25.904Z · LW · GW

People Cannot Handle Gambling on Smartphones

this seems a very strange way to say "Smartphone Gambling is Unhealthy"
It's like saying "People's Lungs Cannot Handle Cigarettes"

Comment by Kabir Kumar (kabir-kumar) on The hostile telepaths problem · 2024-11-16T20:05:41.585Z · LW · GW

To be a bit less useless - I think this fundamentally misses the problem of respect and actually being able to communicate with yourself and fully do things, if you've done so - and that you can do these when you have full faith and respect in yourself (meaning all of yourself - may include love as well, not sure how necessary that is for this). Could maybe be done in other ways as well, but I find those less beautiful, personally. 

Comment by Kabir Kumar (kabir-kumar) on The hostile telepaths problem · 2024-11-16T20:01:33.286Z · LW · GW

I think this is really along the wrong path and misunderstanding a lot of things, but so far along the incorrect path of thought and misunderstanding so much, that it's hard to untangle

Comment by Kabir Kumar (kabir-kumar) on The hostile telepaths problem · 2024-11-16T19:57:39.771Z · LW · GW

I thought this was going to be an allegory for interpretability.

Comment by Kabir Kumar (kabir-kumar) on Kabir Kumar's Shortform · 2024-11-16T16:00:01.375Z · LW · GW

give better names to actual formal math things, jesus christ. 

Comment by kabir-kumar on [deleted post] 2024-11-13T13:35:39.766Z

I think posts like this are net harmful, by discouraging people from joining those doing good things without providing an alternative and so wasting energy on meaningless ruminating that doesn't culminate in any useful action.

Comment by Kabir Kumar (kabir-kumar) on Lorec's Shortform · 2024-11-08T20:36:39.749Z · LW · GW

oh, sorry, I thought slatestar codex wrote something about it and you were saying that's where it comes from

Comment by Kabir Kumar (kabir-kumar) on Kabir Kumar's Shortform · 2024-11-05T01:27:59.334Z · LW · GW

I pretty much agree. I prefer rigid definitions because they're less ambiguous to test and more robust to deception. And this field has a lot of deception.

Comment by Kabir Kumar (kabir-kumar) on Kabir Kumar's Shortform · 2024-11-04T14:49:15.408Z · LW · GW

Yup, those are hard. Was just thinking of a definition for the alignment problem, since I've not really seen any good ones.

Comment by Kabir Kumar (kabir-kumar) on Shortform · 2024-11-03T17:19:50.054Z · LW · GW

what do you think of replit agent, stack blitz, etc?

Comment by Kabir Kumar (kabir-kumar) on Ricki Heicklen's Shortform · 2024-11-03T17:19:10.935Z · LW · GW

damn, those prices are wild

Comment by Kabir Kumar (kabir-kumar) on Lorec's Shortform · 2024-11-03T17:11:48.303Z · LW · GW

used before, e.g. Feynman: https://calteches.library.caltech.edu/51/2/CargoCult.htm

Comment by Kabir Kumar (kabir-kumar) on Kabir Kumar's Shortform · 2024-11-03T17:03:02.193Z · LW · GW

btw, thoughts on this for 'the alignment problem'?
"A robust, generalizable, scalable,  method to make an AI model which will do set [A] of things as much as it can and not do set [B] of things as much as it can, where you can freely change [A] and [B]"

Comment by Kabir Kumar (kabir-kumar) on Are we dropping the ball on Recommendation AIs? · 2024-11-03T17:02:47.630Z · LW · GW

Unfortunately this is a fundamental problem of Media, imo. 

Comment by Kabir Kumar (kabir-kumar) on Are we dropping the ball on Recommendation AIs? · 2024-11-03T17:01:35.540Z · LW · GW

Yes, this would be very very good. I might hold a hackathon/ideathon for this in January. 

Comment by Kabir Kumar (kabir-kumar) on The Rocket Alignment Problem · 2024-10-28T09:11:41.233Z · LW · GW

I didn't get the premise, no. I got that it was before a lot of physics was known, didn't know they didn't know calculus either. 
Just stating it plainly and clearly at the start would have been good. Even with that premise, I still find it very annoying. I despise the refusal to speak clearly, the way it's constantly dancing around the bush, not saying the actual point, to me this is pretty obviously because the actual point is a nothing burger(because the analogy is bad) and by dancing around it, the text is trying to distract me and convince me of the point before I realize how dumb it is. 

Why the analogy is bad: rocket flights can be tested and simulated much more easily than a superintelligence, with a lot less risk

Analogies are by nature lossy, this one is especially so.