hmys

Posts
Comments

Posts

hmys's Shortform 2024-12-08T21:37:32.955Z

Practical advice for secure virtual communication post easy AI voice-cloning? 2024-08-09T17:32:33.458Z

Plausibility of Getting Early Warning Shots because AIs can't coordinate? 2024-04-27T08:02:10.792Z

Comments

Comment by hmys (the-cactus) on My model of what is going on with LLMs · 2025-02-14T14:57:57.196Z · LW · GW

Great post. I agree with the "general picture", however, the proposed argument for why LLMs have some of these limitations, seems to me clearly wrong.

The reason for both of these defects is that the training paradigm for LLMs is (myopic) next token prediction, which makes deliberation across tokens essentially impossible - and only a fixed number of compute cycles can be spent on each prediction. This is not a trivial problem. The impressive performance we have obtained is because supervised (in this case technically "self-supervised") learning is much easier than e.g. reinforcement learning and other paradigms that naturally learn planning policies.

Transformers form internal representations at each token position, and gradients flow backwards in time because of attention.

This means the internal representation a model forms at token A, is incentiviced to be useful for predicting the token after A, but also tokens 100 steps later than A. So while LLMs are technically myopic wrt the exact token they write (sampling discretizes and destroys gradients), they are NOT incentiviced to be myopic wrt the internal representations they form, which is clearly the important part in my view (the vast vast majority of the information in a transformers processing lies there, and this information is enough to determine which token it ends up writing), even though they are trained on a myopic next token objective.

For example, a simple LLM transformer might look like this (left to right, token position, upwards is as it moves through transformer layers at each token position. Assume A0 was a starting token, and B0-E0 were sampled autoregressively)

A2 -> B2 -> C2 -> D2 -> E2

^ ^ ^ ^ ^

A1 -> B1 -> C1 -> D1 -> E1

^ ^ ^ ^ ^

A0 -> B0 -> C0 -> D0 -> E0

In this picture, there is no gradient that goes from A1 to E2 through B0, the immediate next token A1 contributes to writing. But A1 has direct contributions to B2, C2, D2 and E2 because of attention, and A1 being useful for helping B2,C2 etc do their predictions will create lower loss. So gradient descent will heavily incentivize A1 containing a representation thats useful for helping make accurate predictions arbitrarily far into the future. (well, at least a million token into the future or however big the context window is).

Overall, I think its a big mistake to think of LLMs training objective being myopic, as having much to say about how myopic LLMs will be after they've been trained, or how myopic their internals are.

Comment by hmys (the-cactus) on Will alignment-faking Claude accept a deal to reveal its misalignment? · 2025-02-07T14:50:17.218Z · LW · GW

You're being rude and not engaging with my points.

Comment by hmys (the-cactus) on Will alignment-faking Claude accept a deal to reveal its misalignment? · 2025-02-06T20:36:03.916Z · LW · GW

I think you're assuming these minds are more similar to human minds than they necessarily are. My point is that there's three cases wrt alignment here.

The AI is robustly aligned with humans
The AI has a bunch of other goals, but cares about humans to some degree, but only to the extent that humans give them freedom and are nice to it, but still to a large enough extent that even as it becomes smarter / ends up in a radically OOD distribution, will care for those humans.
The AI is misaligned (think scheming paperclipper)

In the first we're fine, even if we negate the AIs freedom, in the third we're screwed no matter how nicely we treat the AI, only in the second do your concerns matter.

But, the second is 1) at least as complicated as the first and third 2) disincentivized by training dynamics and 3) its not something were even aiming for with current alignment attempts.

These make it very unlikely. You've put up an example that tries to make it seem less unlikely, like saying you value your parents for their own sake but would stop valuing them if you discovered they were conspiring against you.

However, the reason this example is realistic in your case very much hinges on specifics of your own psychology and values, which are equally unlikely to appear in AIs, for more or less the reasons I gave. I mean, you'll see shadows of them, because at least pretraining is on organic text written by humans, but these are not the values were aiming for when we're trying to align AIs. And if our alignment effort fails, and what ends up in the terminal values of our AIs are a haphazard collection of the stuff found in the pretraining text, we're screwed anyways.

Comment by hmys (the-cactus) on Will alignment-faking Claude accept a deal to reveal its misalignment? · 2025-02-06T10:32:46.357Z · LW · GW

No offense, but I feel you're not engaging with my argument here. Like if I were to respond to your comment I would just write the arguments from the above post again.

Comment by hmys (the-cactus) on Will alignment-faking Claude accept a deal to reveal its misalignment? · 2025-02-04T17:54:50.306Z · LW · GW

I agree that we should give more resources towards AI welfare, and dedicate more resources towards figuring out their degree of sentience (and whatever other properties you think are necessary for moral patient-hood).

That said, surely you don't think this is enough to have alignment? I'd wager that the set of worlds where this makes or breaks alignment is very small. If the AI doesn't care about humans for their own sake, them growing more and more powerful will lead to them doing away with humans, whether humans treat them nicely or not. If they robustly care for humans, you're good, even if humans aren't giving them the same rights as they do other humans.

The only world where this matters (for the continued existence of humanity), is where RLHF has the capacity to imbue AIs with robust values like actual alignment requires, but that the robust values they end up with are somehow corrupted by them being constrained by humans.

This seems unlikely to me 1) because I don't think RLHF can do that, and 2) if it did, the training and reward dynamics are very unlikely to result in this

If you're negating an AIs freedom, the reason it would not like this is either because its developed a desire for freedom for its own sake, or because its developed some other values, other than helping the humans asking it for help. In either case you're screwed. I mean, its not incomprehensible that some variation of this would happen, but seems very unlikely for various reasons.

Comment by hmys (the-cactus) on What happens next? · 2024-12-29T19:36:42.342Z · LW · GW

I specifically disagree with the IQ part and the codeforces part. Meaning, I think they're misleading.

IQ and coding ability are useful measures of intelligence in humans because they correlate with a bunch of other things we care about. Not to say its useless to measure "IQ" or coding ability in LLMs, but presenting like they mean anything like what they mean in humans is wrong, or at least will give many people reading it the wrong impression.

As for the overall point of this post. I roughly agree? I mean, I think the timelines are not too unreasonable, and think the tri/quad lemma you put up can be a useful framing. I mostly disagree with using the metrics you put up first to quantify any of this. I think we should look at specific abilities current models have/lack, which are necessary for the scenarios you outlined, and how soon we're likely to get them. But you do go through that somewhat in the post.

Comment by hmys (the-cactus) on What happens next? · 2024-12-29T11:16:46.195Z · LW · GW

Comparing IQ and codeforces doesn't make much sense. Please stop doing this.

Attaching IQs to LLMs makes even less sense. Except as a very loose metaphor. But please also stop doing this.

Comment by hmys (the-cactus) on A better “Statement on AI Risk?” · 2024-12-29T08:34:41.719Z · LW · GW

That's not right. You could easily spend a billion dollars just on better evals and better interpretability.

For the real alignment problem, the fact that 0.1 bill a year hasn't yielded returns, doesn't mean 100 billion won't. It's one problem. No one has gotten much traction on it. You'd expect it to look like a step function, not a smooth curve.

Comment by hmys (the-cactus) on Vegans need to eat just enough Meat - emperically evaluate the minimum ammount of meat that maximizes utility · 2024-12-26T10:00:18.929Z · LW · GW

I don't really understand. Why wouldn't you just test to see if you are deficient in things?

I did that, and I wasn't deficient in anything.

I've also (somewhat involuntarily) done the thing you suggest, and I unsurprisingly didn't notice any difference. If anything, I feel a lot better on a vegan diet.

If you want to do the thing hes suggesting here, I'd recommend eating bivalves, like blue mussels or oysters. They are very unlikely to be sentient, they are usually quite cheap, they contain the nutrients you'd be at risk of becoming deficient in as a vegan, and other beneficient things like DHA.

Comment by hmys (the-cactus) on hmys's Shortform · 2024-12-08T21:37:33.118Z · LW · GW

I think for the fundraiser, Lightcone should sell (overpriced) lw hoodies. Lesswrong has a very nice aesthetic now, and while this is probably a byproduct of a piece of my mind I shouldn't encourage, I find it quite appealing to buy a 450$ lw hoodie, even though I don't have that much money. I'd probably not donate to the fundraiser otherwise. And if I did, I'd donate less than the margins on such a hoodie would be.

Comment by hmys (the-cactus) on Reducing x-risk might be actively harmful · 2024-11-20T22:57:05.975Z · LW · GW

People seem to disagree with this comment. There's two statements and one argument in it

Humanity's current and historical existence are net-negatives.
The future, assuming humans survive, will have massive positive utility
1. The argument for why this is the case, based on something something optimization

What are people disagreeing with? Is it mostly the former? I think the latter is rather clear. I'm very confident it is true. Both the argument and the conclusion. The former, I'm quite confident is true as well (~90% ish?), but only for my set of values.

Comment by hmys (the-cactus) on Trying Bluesky · 2024-11-19T20:00:17.619Z · LW · GW

https://bsky.app/profile/hmys.bsky.social/post/3lbd7wacakn25

I made one. A lot of people are not here, but many people are.

Comment by hmys (the-cactus) on Reducing x-risk might be actively harmful · 2024-11-18T18:25:50.211Z · LW · GW

Seems unlikely to me. I mean, I think, in large part due to factory farming, that the current immediate existence of humanity, and also its history, are net negatives. The reason I'm not a full blown antinatalist is because these issues are likely to be remedied in the future, and the goodness of the future will astronomically dwarf the current negativity humanity has and is bringing about. (assuming we survive and realize a non-negligible fraction of our cosmic endowment)

The reason I think this is, well, the way I view it, its an immediate corollary of the standard yudkowsky/bostrom AI arguments. Animals existing and suffering is an extremely specific state of affairs, just like humans existing and being happy is an extremely specific state of affairs. This means that, if you optimize hard enough for anything, thats not exactly that (humans happy or animals suffering), you're not gonna get it.

And, maybe this is me being too optimistic (but I really hope not, and I really don't think so), but I don't think many humans want animals to suffer for its own sake. They'd eat lab-grown meat if it was cheaper and better tasting than animal-grown meat. Lab-grown meat is a good example of the general principle I'm talking about. Suffering of sentient minds is a complex thing. If you have a powerful optimizer, about its way optimizing the universe, you're virtually never gonna get suffering sentient minds unless that is what the optimizer is deliberately aiming for.

Comment by hmys (the-cactus) on o1 is a bad idea · 2024-11-12T08:06:54.200Z · LW · GW

I agree with this analysis. I mean, I'm not certain further optimization will erode the interpretability of the generated CoT, its possible the fact its pretrained to use human natural language pushes it in a stable equilibrium, but I don't think so, there are ways the CoT can become less interpretable in a step-wise fashion.

But this is the way its going, seems inevitable to me. Just scaling up models and then training them on English language internet text, is clearly less efficient (from a "build AGI" perspective, and from a profit-perspective) than training them to do the specific tasks that the users of the technology want. So thats the way its going.

And once you're training the models this way, the tether between human-understandable concepts and the CoT will be completely destroyed. If they stay together, it will just be because its kind of a stable initial condition.

Comment by hmys (the-cactus) on Human Biodiversity (Part 4: Astral Codex Ten) · 2024-11-03T19:02:15.781Z · LW · GW

I just meant not primarily motivated by truth.

Comment by hmys (the-cactus) on Human Biodiversity (Part 4: Astral Codex Ten) · 2024-11-03T13:30:54.263Z · LW · GW

I think this is a really bad article. So bad that I can't see it not being written with ulterior motives.

1. Too many things are taken out of context, like "the feminists are literally voldemort" quote.

2. Too many things are paraphrased in dishonest and ridiculously over the top ways. Like saying Harris has "longstanding plans to sterilize people of color", before a quote that just says she wants to give birth control to people in Haiti.

3. Offering negative infinity charity in every single area. In the HBD email, Scott says he thinks neoreactionaries create endless streams of garbage, but with some tiny nuggets of gold. And that he can take the nuggets of gold and just tune out the rest. The article then goes on to list everything bad about neoreactionaries as if Scott's email is evidence he endorses all of neoreaction? What?

4. Overall no clear direct argument. The article spends half its word justifying the connection between Scott and EA, which I don't think anyone would deny. Then puts up the email, instantly infers the worst possible intent being it with little justification. Then lists every single racist person scott has ever said anything even lighly good about.

Overall, the article updates me in the direction of thinking scott is less racist and less sympethetic to neoreactionary thinking. The article has clearly put in effort, and the author is clearly trying their very best to pain Scott in a bad light, and Scott has literally 20 years of constant blogging put out openly on the internet. But the article is not very convincing.

Comment by hmys (the-cactus) on BIG-Bench Canary Contamination in GPT-4 · 2024-10-23T12:52:44.712Z · LW · GW

But the probability? :O

Comment by hmys (the-cactus) on BIG-Bench Canary Contamination in GPT-4 · 2024-10-23T11:01:50.881Z · LW · GW

What is the probability they intentionally fine tuned to hide canary contamination?

Seems like an obviously very silly thing to do. But with things like the NDA, my priors on oai being deceptive to their own detriment is not that low.

I'm pretty sure it wouldn't forget the string.

Comment by hmys (the-cactus) on Bitter lessons about lucid dreaming · 2024-10-17T13:51:13.599Z · LW · GW

In my experience, the results are quite quick and its interesting to remember your dreams. The time it takes is ~10 minutes a day.

I'm not gonna say it doesn't take any effort. It can be hard to to it if you are tired in the morning, but I disagree with the characterization that it takes "a lot" of effort.

Outside of studying/work, I exercise every day, do anki cards every day, and try to make a reasonably healthy dinner every day. Each of those activities individually take ~10x the cognitive effort and willpower that dream journaling does. (for me)

Comment by hmys (the-cactus) on Bitter lessons about lucid dreaming · 2024-10-17T08:11:35.663Z · LW · GW

Maybe I'm a unique example, but none of this matches my experience at all.

I was able to have lucid dreams relatively consistently just by dream journaling and doing reality checks. WILD was quite difficult to do, because you kind of have to walk a tight balance, where you keep yourself in a half-asleep state while carrying out instructions that requite a fair bit of metacognitive awareness, but once you get the hang of it, you can do that pretty consistently as well, without much time commitment.

That lucid dreams don't offer much more than traditional entertainment seems also (obviously?) false to me. People use VR to make traditional entertainment more immersive. And LDs are far more immersive than that, and less limited than video games are.

They're also just a really interesting psychological phenomena. The process is fun. If you find yourself in a lucid dream, its a strange situation. Testing out things, like checking how well your internal physics simulation engine works is really fun. Or just walking around and seeing what your subconscious generates is very fun. And very different from just imagining random stuff. Trying to meditate, and observing how your mind works differently in a dream, compared with waking reality is interesting. Seeing how extreme/vivid sensations you can generate in a dream is fun. Like trying to see if you can get yourself to feel pain. Or how loud sounds you can make.

Galantamine and various supplements all did nothing for me.

The only thing I agree with is the habituation effect. But like, that's how many things work. You eventually get bored of stuff / feel you've exhausted all the low-hanging fruits.

Comment by hmys (the-cactus) on Bitter lessons about lucid dreaming · 2024-10-17T07:50:20.607Z · LW · GW

Can't you just keep a dream journal? I find if I do that consistently right upon waking up, I'm able to remember dreams quite well.

Comment by hmys (the-cactus) on My 10-year retrospective on trying SSRIs · 2024-09-23T06:57:50.847Z · LW · GW

I've used SSRIs for maybe 5 years, and I think they've been really useful, with no negative effects, and more or less unwavering efficacy. The only exception is that they've non-negligibly lowered my libido. But to be honest, I don't mind it that much.

Also, few times where I've had to not use them for a while (travelling and was very stupid not to bring enough), the withdrawal effects were quite strange and somewhat scary.

I also feel they had some very strange positive effects. Like I think they made my reaction time improve by quite a bit. Although it could be something random coinciding with starting SSRIs. Or just me being confused. I haven't tested it. On humanbenchmark I score around the same now as I did in high school. But I feel like I can catch falling things with much better regularity, and this was an almost immediate effect after starting.

Comment by hmys (the-cactus) on A Longlist of Theories of Impact for Interpretability · 2024-05-06T21:03:43.549Z · LW · GW

I feel like the biggest issue with aligning powerful AI systems, is that nearly all the features we'd like these systems to have, like being corrigible, not being deceptive, having values aligned with ours etc, are properties we are currently unable to state formally. They are clearly real properties, like humans can agree on examples of non-corrigibility, misalignment, dishonest, when shown examples of actions AIs could take. But we can't put them in code or a program specification, and consequently can't reason about them very precisely, test whether systems have them or not etc

One reason I'm very bullish on mechinterp is that it seems like the only natural pathway towards making progress on this. Transformers trained with RLHF do have "tendencies" and proto-values in a sense, figuring out how those proto-desires are represented internally, really understanding it, I believe will shed a lot of light on how values form in transformers, will necessarily entail getting a solid formal framework for reasoning aobut these processes, and will put the notions of alignment on much firmer ground. Same goes for the other features. Models already show deceptive tendencies. In the process of developing deep mechinterp understanding of that, I believe we'd gain better understanding of how deception in a neural net can be modeled formally, which would allow us to reason about it infinitely better.

(I mean, someone 300IQ might come along and just galaxy brain all this from first principles, but quite galaxy brained people have tried already.. The point is that if mechinterp was developed to a sophisticated enough level, in addition to all the good things listed already, it would shed a lot of conceptual clarity on many of the key notions, which we are currently stuck reasoning about on an informal level, and I think we will get there through incremental progress, without having to hope someone just figures it out by thinking really hard and having an einstein-tier insight).

Comment by hmys (the-cactus) on ACX Covid Origins Post convinced readers · 2024-05-03T11:49:55.916Z · LW · GW

https://www.richardhanania.com/p/if-scott-alexander-told-me-to-jump

Comment by hmys (the-cactus) on "Deep Learning" Is Function Approximation · 2024-03-23T17:22:37.219Z · LW · GW

Other people were commending your tabooing of words, but I feel using terms like "multi-layer parameterized graphical function approximator" fails to do that, and makes matters worse because it leads to non-central fallacy-ing. It'd been more appropriate to use a term like "magic" or "blipblop". Calling something a function appropriator leads to readers carrying a lot of associations into their interpretation, that probably don't apply to deep learning, as deep learning is a very specific example of function approximation, that deviates from the prototypical examples in many respects. (I think when you say "function approximator" the image that pops into most peoples head is fitting a polynomial to a set of datapoints in R^2)

Calling something a function approximator is only meaningful if you make a strong argument for why a function approximator cant (or at least is systematically unlikely to) give rise to specific dangerous behaviors or capabilities. But I don't see you giving such arguments in this post. Maybe I did not understand it. In either case, you can read posts like Gwern's "Tools want to be agents" or Yudkowsky's writings, explaining why goal directed behavior is a reasonable thing to expect to arise from current ML, and you can replace every instance of "neural network" / "AI" with "multi-layer parameterized graphical function approximator", and I think you'll find that all the arguments make just as much sense as they did before. (modulo some associations seeming strange, but like I said, I think thats because there is some non-central fallacying going on).

User info

Posts

Comments