yo-cuddles-1

Posts
Comments

Posts

Comments

Comment by yo-cuddles on So how well is Claude playing Pokémon? · 2025-03-10T13:36:42.977Z · LW · GW

No, sorry, that's not a typo that's a linguistic norm that i probably assumed was more common than it actually is

Me and the people I talk with have used the words "mumble" and "babble" to describe LLM reasoning. Sort of like human babble, see https://www.lesswrong.com/posts/i42Dfoh4HtsCAfXxL/babble

Comment by yo-cuddles on So how well is Claude playing Pokémon? · 2025-03-10T00:44:04.144Z · LW · GW

There's an improvement in LLM's I've seen that is important but has wildly inflated people's expectations beyond what's reasonable:

LLM's have hit a point in some impressive tests where they don't reliably fail past the threshold of being unrecoverable. They are conservative enough that they can do search on a problem, fail a million times until they mumble into an answer.

I'm going to try writing something of at least not-embarrassing quality about my thoughts on this but I am really confused by people's hype around this sort of thing, this feels like directed randomness

Comment by yo-cuddles on A Bear Case: My Predictions Regarding AI Progress · 2025-03-06T03:54:22.986Z · LW · GW

Gotcha, you didn't sound OVER confident so I assumed it was much-less-than-certain, still refreshingly concrete

Comment by yo-cuddles on A Bear Case: My Predictions Regarding AI Progress · 2025-03-06T01:05:05.898Z · LW · GW

Ah, okay.

I'll throw in my moderately strong disagreement for future bayes points, respect for the short term, unambiguous prediction!

Comment by yo-cuddles on How Much Are LLMs Actually Boosting Real-World Programmer Productivity? · 2025-03-05T22:34:49.943Z · LW · GW

This is not going to be a high quality answer, sorry in advance.

I noticed this with someone in my office who is learning robotic process automation: people are very bad at measuring their productivity, they are better at seeing certain kinds of gains and certain kinds of losses. I know someone who swears emphatically that they are many times as productive but have become almost totally unreliable. He's in denial over it, and a couple people now have openly told me they try to remove him from workflows for all the problems he causes.

I think the situation is like this:

If you finish a task very quickly using automated methods, that feels viscerally great and, importantly, is very visible. If your work then incurs time costs later, you might not be able to trace that extra cost to the "automated" tasks you set up earlier, double so if those costs are absorbed by other people catching what you missed and correcting your mistakes, or doing the things that used to be done when you were doing it manually.

I imagine it is hard to track a bug and know, for certain, that you had to waste that time because you used an LLM instead of just doing it yourself. You don't know who else had to waste time fixing your problem because LLM code is spaghetti, or at least you don't feel it in your bones in the same way you feel increases in your output, you don't get to see the counterfactual project where things just went better in intangible ways because you didn't outsource your thinking to gpt. Few people notice, after the fact, how many problems they incurred because of a specific thing they did.

I think LLM usage is almost ubiquitous at this point, if it were conveying big benefits it would show more clearly. If everyone is saying they are 2x more productive (which is kinda low by some testimonies) then it is probably the case that they are just oblivious to the problems they are causing for themselves because they're just less visible.

Comment by yo-cuddles on A Bear Case: My Predictions Regarding AI Progress · 2025-03-05T21:44:32.758Z · LW · GW

By "solve", what do you mean? Like, provably secure systems, create a AAA game from scratch, etc?

I feel like any system that could do that would implicitly have what the OP says these systems might lack, but you seem to be in half agreeance with them. Am I misunderstanding something?

Comment by yo-cuddles on A Bear Case: My Predictions Regarding AI Progress · 2025-03-05T21:39:16.127Z · LW · GW

Definitely! However, there is more money and "hype" in the direction of wanting these to scale into AGI.

Hype and anti-hype don't cancel each other out, if someone invests a billion dollars into LLM's, someone else can't spend negative 1 billion and it cancels out: the billion dollar spender is the one moving markets, and getting a lot of press attention.

We have Yudkowsky going on destiny, I guess?

Comment by yo-cuddles on The Game Board has been Flipped: Now is a good time to rethink what you’re doing · 2025-02-18T22:32:38.631Z · LW · GW

I think there's some miscommunication here, on top of a fundamental disagreement on whether more compute takes us to AGI.

On miscommunication, we're not talking about the lowering cost per flop, we're talking about a world where openai either does or does not have a price war eating it's margins.

On fundamental disagreement, I assume you don't take very seriously the idea that AI labs are seeing a breakdown of scaling laws? No problem if so, reality should resolve that disagreement relatively soon!

Comment by yo-cuddles on The Game Board has been Flipped: Now is a good time to rethink what you’re doing · 2025-02-03T03:20:02.072Z · LW · GW

This is actually a good use case, which fits with what gpt does well, where very cheap tokens help!

Pending some time for people to pick at it to test it's limits, this might be really good. My instinct is legal research, case law etc. will be the test of how good it is, if it does well this might be it's foothold into real commercial use that actually generates profit.

My prediction is that we will be glad this exists. It will not be "phd level", a phrase which defaces all who utter it, but it will save some people a lot of time and effort

Where I think we disagree: This will likely not elicit a Jevon's-paradox scenario where we will collectively spend much more money on LLM tokens despite their decreased cost, Killer app this is not.

My prediction is that low level users will use this infrequently because Google (or vanilla chatGPT) is sufficient, what they are looking for is not a report but a webpage and one likely at the top of their search already. Even if it would save them time, they will never use it so often that their first instinct would be deep research and not Google, they will not recognize where deep research would be better and won't change their habits even if they do. On the far end, some grad students will use this to get them started but it will not do the work of actually doing the research. Besides pay walls disrupting things and limits to important physical media, there is a high likelihood that this won't replace any of the actual research grad students (or lawyers/paralegals etc) will have to do. The number of hours they spend won't be much effected, the range of users who will find much value will be few and they probably won't use it every day.

I expect that, by token usage, deep research will not be a big part of what people use chatGPT for. If I'm wrong I predict it's because law professions found a use for it.

I will see everyone in 1 year (if we're alive) to see if this pans out!

Comment by yo-cuddles on The Game Board has been Flipped: Now is a good time to rethink what you’re doing · 2025-02-03T00:21:38.885Z · LW · GW

Also, Amodei needs to cool it. There's a reading of the things he's been saying lately that could be taken as sane but a plausible reading that makes him look like a buffoon. Credibility is a scarce resource

Comment by yo-cuddles on The Game Board has been Flipped: Now is a good time to rethink what you’re doing · 2025-02-03T00:18:30.842Z · LW · GW

I feel like this comes down a lot to intuition, all I can say is gesture at the thinning distance between marginal cost and prices, wave my hand in the direction of discount rates and the valuation of Openai and ask... Are you sure?

The demand curve on this seems textbook inelastic at current margins. slashing the price of milk by 10x would have us cleaning our driveways with it, slashing the price of eggs would have us using crushed eggshells as low grade building material. A 10x decrease in the price per token of AI is barely even noticed, in fact in some markets outside of programming the consumer interest is down during that same window. This an example of a low margin good with little variation in quality descending into a price war. Maybe LLM's have a long ways left to grow and can scale to agi (maybe, maybe not) but if we're looking just at the market this doesn't look like something Jevon's paradox applies to at all, people are just saying words and if you switched out Jevon for piglet they'd make as much sense imo

The proposal just seems ridiculous to me, right? Who right now is standing on the sidelines with a killer AI app that could rip up the market if only tokens were a bit cheaper? There isn't, the bottleneck is and always has been quality, the ability for LLM's to be less-wrong-so-dang-always. Jevon's paradox seems to be filling the role of a magic word in these conversations, it's involved despite being out of place.

Sorry if this is invective at all, you're mostly explaining a point of view so I'm not frustrated in your direction, but people are making little sense to me right now.

Comment by yo-cuddles on The Game Board has been Flipped: Now is a good time to rethink what you’re doing · 2025-01-30T00:49:33.017Z · LW · GW

I'm kinda opposite on the timelines thing? This is probably a timeline delayer even if I thought LLM's scaled to AGI, which I don't but let's play along.

If a Pharma company could look at another company's product and copy it and release it for free with no consequences, but the product they release itself could only be marginally improved without massive investment, what does that do to the landscape?

It kills the entire industry. This HURTS anyone trying to fundraise, reckless billions will have a harder time finding their way into the hands of developers because many investors will not be happy with the possibility (already demonstrated at least twice) that someone could just read the outputs of your API available model and eat your lunch, and releases so that people have a less restricted, more customizable and cheaper alternative they can run on their own hardware. Expanding that view, how many services will hose this model for cheaper than Openai will host gpt?

Want proof? Openai has problems running a profit on its services yet has effectively cut prices (or otherwise given away more for less money) twice since deepseek came out. Is Openai so grateful for the free research that deepseek produced that they would rather have that than (probably) billions of dollars in lost revenue, added cost and thinner investment?

Being more speculative, the way in which models have converged on being basically interchangeable should be a red flag that the real growth is plateaued. Goods competing mostly via price is a sign that there's uniformity in quality, that they're mostly interchangeable.

Real model growth seems to have been discarded in favor of finding ways to make a model stare at a problem for hours at a time and come up with answers that are... Maybe human passable if the problem is easy and it's accustomed to it? All it'll cost is a million dollars per run. If that sounds like it's just brute forcing the problem, it's because it is.

Where is the real advancement? The only real advancement is inference time scaling and it doesn't look like this last reach has gotten us close to AGI. The reasoning models are less flexible, not more, the opposite of what you would think if they were actually reasoning, best case is that the reasoning is an excuse to summon a magic token or remove a toxic token.

Am I crazy? Why would this accelerate your timeline?

Comment by yo-cuddles on o3, Oh My · 2025-01-02T23:30:10.539Z · LW · GW

Can you make some noise in the direction of the shockingly low numbers it gets on early arc-2 benchmarks? This feels like pretty open and shut proof that it doesn't generalize, no?

The fact that the model was trained on 75 percent of the training set feels like they ghetto rigged a test set and RL'd the thing to success. If the <30% score on the second test ends up being true, I feel like that should inform our guesses at what it's actually doing heavily away from genuine intelligence and towards a brute force search for verifiable answers.

The frontier tests just feel unconvincing. Chances are, there are well known problem structures with well known solution structures, it's just plugging and chugging. Mathematicians who have looked at some sample problems have indicated that both tier 1 and tier 2 problems have solutions they know by reflex, which implies these o3 results are not indicative of anything super interesting

This just feels like a nothingburger and I'm waiting for someone to tell me why my doubts are misplaced, convincingly

Comment by yo-cuddles on o3 · 2024-12-26T14:56:52.158Z · LW · GW

Thanks for the reply! Still trying to learn how to disagree properly so let me know if I cross into being nasty at all:

I'm sure they've gotten better, o1 probably improved more from its heavier use of intermediate logic, compute/runtime and such, but that said, at least up till 4o it looks like there has been improvements in the model itself, they've been getting better

They can do incredibly stuff in well documented processes but don't survive well off the trodden path. They seem to string things together pretty well so I don't know if I would say there's nothing else going on besides memorization but it seems to be a lot of what it's doing, like it's working with building blocks of memorized stuff and is learning to stack them using the same sort of logic it uses to chain natural language. It fails exactly in the ways you'd expect if that were true, and it has done well in coding exactly as if that were true. The fact that the swe benchmark is giving fantastic scores despite my criticism and yours means those benchmarks are missing a lot and probably not measuring the shortfalls they historically have

See below: 4 was scoring pretty well in code exercises like codeforces that are toolbox oriented and did super well in more complex problems on leetcode... Until the problems were outside of its training data, in which case it dropped from near perfect to not being able to do much worse.

https://x.com/cHHillee/status/1635790330854526981?t=tGRu60RHl6SaDmnQcfi1eQ&s=19

This was 4, but I don't think o1 is much different, it looks like they update more frequently so this is harder to spot in major benchmarks, but I still see it constantly.

Even if I stop seeing it myself, I'm going to assume that the problem is still there and just getting better at hiding unless there's a revolutionary change in how these models work. Catching lies up to this out seems to have selected for better lies

Comment by yo-cuddles on o3 · 2024-12-26T04:55:27.500Z · LW · GW

I would say that, barring strong evidence to the contrary, this should be assumed to be memorization.

I think that's useful! LLM's obviously encode a ton of useful algorithms and can chain them together reasonably well

But I've tried to get those bastards to do something slightly weird and they just totally self destruct.

But let's just drill down to demonstrable reality: if past SWE benchmarks were correct, these things should be able to do incredible amounts of work more or less autonomously and get all the LLM SWE replacements we've seen have stuck to highly simple, well documented takes that don't vary all that much. The benchmarks here have been meaningless from the start and without evidence we should assume increments on them is equally meaningless

The lying liar company run by liars that lie all the time probably lied here and we keep falling for it like Wiley Coyote

Comment by yo-cuddles on What are the strongest arguments for very short timelines? · 2024-12-23T22:29:14.432Z · LW · GW

Can we bet karma?

Edit: sarcasm

Comment by yo-cuddles on What are the strongest arguments for very short timelines? · 2024-12-23T22:24:29.530Z · LW · GW

Hmm, mixed agree/disagree. Scale probably won't work, algorithms probably would, but I don't think it's going to be that quick.

Namely, I think that the company struggling with fixed capital costs could accomplish much more, much quicker using the salary expenses of the top researchers they already have they'd have done it or gave it a good try at least

I'm 5 percent that a serious switch to algorithms would result in AGI in 2 years. You might be more well read than me on this so I'm not quite taking side bets right now!

Comment by yo-cuddles on What are the strongest arguments for very short timelines? · 2024-12-23T21:00:18.301Z · LW · GW

I think the algorithm progress is doing some heavy lifting in this model. I think if we had a future textbook on agi we could probably build one but AI is kinda famous for minor and simple things just not being implemented despite all the parts being there

See ReLU activations and sigmoid activations.

If we're bottlenecking at algorithms alone is there a reason that isn't a really bad bottleneck?

Comment by yo-cuddles on We are in a New Paradigm of AI Progress - OpenAI's o3 model makes huge gains on the toughest AI benchmarks in the world · 2024-12-23T20:53:13.541Z · LW · GW

I haven't had warm receptions when critiquing points, which has frustratingly left me with bad detection for when I'm being nasty, so if I sound thorny it's not my intent.

Somewhere I think you might have misstepped is the frontier math questions: the quotes you've heard are almost certainly about tier 3 questions, the hardest ones meant for math researchers in training. The mid tier is for grad student level problems and tier 1 is bright high schooler to undergrad level problems

Tier 1: 25% of the test

Tier 2: 50% of the test

Tier 3: 25%

O3 got 25%, probably answering none of the hard questions and suspiciously matching almost exactly the proportion of easy questions. From some, there seems to be disagreement about whether tier 2 questions are consistently harder than tier 1 questions

Regardless, some (especially easier) problems are of the sort that can be verified and have explicitly been said to have instantaneously recognizable solutions. This is not an incredibly flattering picture of o3

THIS IS THE END OF WHERE I THINK YOU WERE MISTAKEN, TEXT PAST THIS IS MORE CONJECTURE AND/OR NOT DIRECTLY REFUTING YOU

the ARC test looks like it was taken by an overfit model. If the test creators are right, then the arc test for an 85 percent off a tuned model and probably spamming conclusions that it could verify, it trained on 75 percent of the questions from what I understand so one of that score seems like memorization and a mildly okay score on the 25 percent that was held as test data.

And this part is damning: the arc-2 test which is the success of to the first one, made by the same people, gets a 95 percent pass rate form humans (so easier than the 85 percent pass rate of the first test) but o3's score dropped to a 30%, a 55% drop and now 65% below human on a similar test made by the same people.

Let me be clear: if that isn't VERY inaccurate, then this is irrefutably a cooked test and o3 is overfit to the point of invalidating the results for any kind of generalizability.

There are other problems, like the fact that this pruning search method is really, really bad for some problems and that it seems to ride on validation being somewhat easy in order to work at all but that's not material to the benchmarks

I can cite sources if these are important points, not obviously incorrect, etc, I might write my first post about it if I'm digging that much!

Comment by yo-cuddles on o3 · 2024-12-23T20:26:10.084Z · LW · GW

Ah wait was reading it wrong. I thought each time was an order of magnitude, that looks to be standard notation for log scale. Mischief managed

Comment by yo-cuddles on o3 · 2024-12-23T20:00:13.130Z · LW · GW

Maybe a dumb question, but those log scale graphs have uneven ticks on the x axis, is there a reason they structured it like that beyond trying to draw a straight line? I suspect there is a good reason and it's not dishonesty but this does look like something one would do if you wanted to exaggerate the slope

Comment by yo-cuddles on o3 · 2024-12-23T19:46:42.759Z · LW · GW

I do not have a gauge for how much I'm actually bringing to this convo, so you should weigh my opinion lightly, however:

I believe your third point kinda nails it. There are models for gains from collective intelligence (groups of agents collaborating) and the benefits of collaboration bottleneck hard on your ability to verify which outputs from the collective are the best, and even then the dropoff happens pretty quick the more agents collaborate.

10 people collaborating with no communication issues and accurate discrimination between good and bad ideas are better than a lone person on some tasks, 100 moreso

You do not see jumps like that moving from 1,000 to 1,000,000 unless you set unrealistic variables.

I think inference time probably works in a similar way: dependent on discrimination between right and wrong answers and steeply falling off as inference time increases

My understanding is that o3 is similar to o1 but probably with some specialization to make long chains of thought stay coherent? The cost per token from leaks I've seen is the same as o1, it came out very quickly after o1 and o1 was bizarrely better at math and coding than 4o

Apologies if this was no help, responding with the best intentions

Comment by yo-cuddles on o3 · 2024-12-21T23:31:05.942Z · LW · GW

I sense that my quality of communication diminishes past this point, I should get my thoughts together before speaking too confidently

I believe you're right we do something similar to the LLM's (loosely, analogously), see

https://www.lesswrong.com/posts/i42Dfoh4HtsCAfXxL/babble

(I need to learn markdown)

My intuition is still LLM pessimistic, I'd be excited to see good practical uses, this seems like tool ai and that makes my existential dread easier to manage!

Comment by yo-cuddles on o3 · 2024-12-21T22:30:47.030Z · LW · GW

Thank you for the warm reply, it's nice and also good feedback I didn't do anything explicitly wrong with my post

It will be VERY funny if this ends up being essentially the o1 model with some tinkering to help it cycle questions multiple times to verify the best answers, or something banal like that. Wish they didn't make us wait so long to test that :/

Comment by yo-cuddles on o3 · 2024-12-21T07:04:39.699Z · LW · GW

small nudge: the questions have difficulty tiers of 25% easy, 50% medium, and 25% hard with easy being undergrad/IMO difficulty and hard being the sort you would give to a researcher in training.

The 25% accuracy gives me STRONG indications that it just got the easy ones, and the starkness of this cutoff makes me think there is something categorically different about the easy ones that make them MUCH easier to solve, either being more close ended, easy to verify, or just leaked into the dataset in some form.

Comment by yo-cuddles on o3 · 2024-12-21T06:59:39.890Z · LW · GW

First post, feel free to meta-critique my rigor on this post as I am not sure what is mandatory, expected, or superfluous for a comment under a post. Studying computer science but have no degree, yet. Can pull the specific citation if necessary, but...

these benchmarks don't feel genuine.

Chollet indicated in his piece:

Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training)

The model was tuned to the ARC AGI Test, got a great score but then faceplanted on a reasoning test apparently easier for humans than the first one, a test that couldn't have been adversarially designed to stump o3. I would have expected that this would immediately expose it as being horrifically overfit but few people seem to be honing in on that so maybe I don't understand something?

Second, the Frontiers math problems appear to be structured in a way that 25% of the problems are the sort that clever high schoolers with bright futures in mathematics would be able to answer if they studied, International Math Olympiad or undergrad level questions. We don't know EXACTLY which questions it answered right but it's almost exactly 25% and I suspect that these are almost entirely from the lower tier questions, and I wouldn't be surprised to hear that the questions (or similar questions) were in the training set. Perhaps the lowest tier questions were formulated with a sort of operational philosophy that didn't prioritize guarding against leaked data?

Third, the codeforces ELO doesn't mean anything. I just can't take this seriously, unless someone thinks that existing models are already mid tier competitive SWE's? Similar benchmark, similarly meaningless, dismissed until someone shows me that these can actually deliver on what other models have provenly exaggerated on.

Fourth, the cost is obscene: thousands of dollars per task, with indications that this model is very similar to o1 going by cost per token, this looks less like a strong model with way more parameters and more like the same model doing a mind boggling amount of thinking. This is probably something like funsearch, a tool bag of programs that an LLM then combines and tries to gauge the effectiveness of, brute forced until it can get an answer it can verify. This seems useful, but this would only work on close-ended questions with answers that are easy to verify, either way this wouldn't really be intelligence of the kind that I had imagined looking for.

This would PERFECTLY explain the failure on the ARC-AGI2 benchmark: the bag of tools it would need would be different, it wasn't tuned to the new test and came with the wrong tool bag. Maybe this could be fixed, but if my model of how this AI works is right then the complexity of tasks would increase by O(n!) with n being the number of "tools" it needs. I'm probably wrong here but something LIKE that is probably true.

Lecun also seems to be confident on threads that this is NOT an LLM, that this is something that uses an LLM but that something else is going on. Perfectly matched my "oh, this is funsearch" intuition. My caveat is that this might all be handled "in house" in an LLM, but the restrictions on what this could do seem very real.

Am a critically wrong on enough points here that I should seriously rethink my intuition?

User info

Posts

Comments