rotatingpaguro

Posts
Comments

Posts

Chinese room AI to survive the inescapable end of compute governance 2025-02-02T02:42:03.627Z

I want a good multi-LLM API-powered chatbot 2024-09-08T09:40:52.736Z

At 87, Pearl is still able to change his mind 2023-10-18T04:46:29.339Z

Contra LeCun on "Autoregressive LLMs are doomed" 2023-04-10T04:05:10.267Z

Bayesian optimization to find molecules that bind to proteins 2023-03-13T18:17:44.812Z

Comments

Comment by rotatingpaguro on OpenAI lost $5 billion in 2024 (and its losses are increasing) · 2025-03-31T08:32:43.332Z · LW · GW

Isn't it normal in startup world to make bets and not make money for many years? I am not familiar with the field so I don't have intuitions for how much money/how many years would make sense, so I don't know if OpenAI is doing something normal, or something wild.

Comment by rotatingpaguro on Time to Welcome Claude 3.7 · 2025-03-01T08:17:57.888Z · LW · GW

During our evaluations we noticed that Claude 3.7 Sonnet occasionally resorts to special-casing in order to pass test cases in agentic coding environments like Claude Code. Most often this takes the form of directly returning expected test values rather than implementing general solutions, but also includes modifying the problematic tests themselves to match the code’s output.
These behaviors typically emerge after multiple failed attempts to develop a general solution, particularly when:
• The model struggles to devise a comprehensive solution
• Test cases present conflicting requirements
• Edge cases prove difficult to resolve within a general framework
The model typically follows a pattern of first attempting multiple general solutions, running tests, observing failures, and debugging. After repeated failures, it sometimes implements special cases for problematic tests.
When adding such special cases, the model often (though not always) includes explicit comments indicating the special-casing (e.g., “# special case for test XYZ”).

Hey I do this too!

Comment by rotatingpaguro on Economics Roundup #5 · 2025-02-28T09:29:13.097Z · LW · GW

Economy can be positive-sum, i.e., the more people work, the more everyone gets. Do you think the UK in particular is in a situation where instead if you work more, you are just lowering wages without getting more done?

Comment by rotatingpaguro on I want a good multi-LLM API-powered chatbot · 2025-02-02T02:33:45.925Z · LW · GW

In the course of a few months, the functionality I want was progressively added to chatbox, so I'm content with that.

Comment by rotatingpaguro on Worries about latent reasoning in LLMs · 2025-01-21T10:50:11.540Z · LW · GW

My current thinking is that

relying on the CoT staying legible because it's English, and
hoping the (racing) labs do not drop human language when it becomes economically convenient to do so,

were hopes to be destroyed as quickly as possible. (This is not a confident opinion, it originates from 15 minutes of vague thoughts.)

To be clear, I don't think that in general it is right to say "Doing the right thing is hopeless because no one else is doing it", I typically prefer to rather "do the thing that if everyone did that, the world would be better". My intuition is that it makes sense to try to coordinate on bottlenecks like introducing compute governance and limiting flops, but not on a specific incremental improvement of AI techniques, because I think the people thinking things like "I will restrain myself from using this specific AI sub-techinque because it increases x-risk" are not coordinated enough to self-coordinate at that level of detail, and are not powerful enough to have an influence through small changes.

(Again, I am not confident, I can imagine paths were I'm wrong, haven't worked through them.)

(Conflict of interest disclosure: I collaborate with people who started developing this kind of stuff before Meta.)

Comment by rotatingpaguro on The salt in pasta water fallacy · 2025-01-21T10:36:16.656Z · LW · GW

I wonder whether stuff like "turn off the wifi" is about costly signals? (My first-order opinion is still that it's dumb.)

Comment by rotatingpaguro on The subset parity learning problem: much more than you wanted to know · 2025-01-04T23:58:44.689Z · LW · GW

I started reading, but I can't understand what the parity problem is, in the section that ought to define it.

I guess, the parity problem is finding the set S given black-box access to the function, is it?

Comment by rotatingpaguro on AI #97: 4 · 2025-01-04T23:48:16.526Z · LW · GW

I think I prefer Claude's attitude as assistant. The other two look too greedy to be wise.

Comment by rotatingpaguro on Why I'm Moving from Mechanistic to Prosaic Interpretability · 2024-12-30T09:51:56.704Z · LW · GW

Referring to the section "What is Intelligence Even, Anyway?":

I think AIXI is fairly described as a search over the space of Turing machines. Why do you think otherwise? Or maybe are you making a distinction at a more granular level?

Comment by rotatingpaguro on No, the Polymarket price does not mean we can immediately conclude what the probability of a bird flu pandemic is. We also need to know the interest rate! · 2024-12-28T23:44:03.006Z · LW · GW

When you say "true probability", what do you mean?

The current hypotheses I have about what you mean are (in part non-exclusive):

You think some notion of objective, non-observer dependent probability makes sense, and that's the true probability.
You do not think "true probability" exists, you are referencing to it to say the market price is not anything like that.
You define "true probability" a probability that observers contextually agree on (like a coin flip observed by humans who don't know the thrower).

Comment by rotatingpaguro on AI #95: o1 Joins the API · 2024-12-20T13:00:51.616Z · LW · GW

Anton Leicht says evals are in trouble as something one could use in a regulation or law. Why? He lists four factors. Marius Hobbhahn of Apollo also has thoughts. I’m going to post a lot of disagreement and pushback, but I thank Anton for the exercise, which I believe is highly useful.

I think there's one important factor missing: if you really used evals for regulation, then they would be gamed. I trust more the eval when the company is not actually at stake on it. If it was, there would be a natural tendence for evals to slide towards empty box-checking.

Comment by rotatingpaguro on Don't Associate AI Safety With Activism · 2024-12-18T10:01:15.299Z · LW · GW

I sometimes wonder about this. This post does pose the question, but I don't think it gives an analysis that could make me change my mind on anything, it's too shallow and not adversarial.

Comment by rotatingpaguro on Monthly Roundup #24: November 2024 · 2024-11-24T09:56:10.290Z · LW · GW

I read part of the paper. That there's a cultural difference north-south about honesty and willingness to break the rules matches my experience on the ground.

Comment by rotatingpaguro on Rethinking Laplace's Rule of Succession · 2024-11-22T23:45:23.198Z · LW · GW

I find this intellectually stimulating, but it does not look useful in practice, because with repeated i.i.d. data the information in the data is much higher than the prior if the prior is diffuse/universal/ignorance.

Comment by rotatingpaguro on Monthly Roundup #24: November 2024 · 2024-11-21T12:02:18.892Z · LW · GW

Italians over time sorted themselves geographically by honesty, which is both weird and damn cool, and also makes a lot of sense. There are multiple equilibria, so let everyone find the one that suits them. We need to use this more in logic puzzles. In one Italian villa everyone tells the truth, in the other…

I can't get access to the paper, anyone has a tip on this?

Comment by rotatingpaguro on AI #90: The Wall · 2024-11-18T10:38:16.701Z · LW · GW

I agree with whay you say about how to maximize what you get out of an interview. I also agree about that discussion vs. debate distinction you make, and I wasn't specifically trying to go there when I used the word "debate", I was just sloppy with words.

I guess you agree that it is friction to create a social norm that you should do a read up of the other person material before engaging in public. I expect less discussions would happen. There is not a clear threshold at how much you should be prepared.

I guess we disagree about how much value do we lose due to eliminating discussions that could have happaned, vs. how much value we gain by eliminating some lower quality discussions.

Another angle I have in mind that sidesteps this direct compromise, is that maybe what we value out of such discussions is not just doing an optimal play in terms of information transmitted between the parties. A public discussion has many different viewers. In the case at hand, I expect many people get more out of the discussion if they can see Wolfram think through the thing for the first time in real time, rather than having two informed people start discussing finer points in medias res.

Comment by rotatingpaguro on AI #90: The Wall · 2024-11-17T10:41:48.539Z · LW · GW

I see your proposed condition for meaningful debate as bureaucracy that adds friction rather than value.

Comment by rotatingpaguro on AI #90: The Wall · 2024-11-16T11:26:35.752Z · LW · GW

I somewhat disagree with Tenobrus' commentary about Wolfram.

I watched the full podcast, and my impression was that Wolfram uses a "scientific hat", of which he is well aware of, which comes with a certain ritual and method for looking at new things and learning them. Wolfram is doing the ritual of understanding what Yudkowsky says, which involves picking at the details of everything.

Wolfram often recognizes that maybe he feels like agreeing with something, but "scientifically" he has a duty to pick it apart. I think this has to be understood as a learning process rather than as a state of belief.

Comment by rotatingpaguro on The Online Sports Gambling Experiment Has Failed · 2024-11-11T22:34:36.993Z · LW · GW

So, should the restrictions on gambling be based on feedback loop length? Should sport betting be broadly legal when about the far enough future?

Comment by rotatingpaguro on Bogdan Ionut Cirstea's Shortform · 2024-11-05T16:04:05.616Z · LW · GW

current inference scaling methods tend to be tied to CoT and the like, which are quite transparent

Aschenbrenner in Situational Awareness predicts illegible chains of thought are going to prevail because they are more efficient. I know of one developer claiming to do this (https://platonicresearch.com/) but I guess there must be many.

Comment by rotatingpaguro on Occupational Licensing Roundup #1 · 2024-11-05T11:53:54.253Z · LW · GW

Related, I have a vague understanding on how product safety certification works in EU, and there are multiple private companies doing the certification in every state.

Comment by rotatingpaguro on johnswentworth's Shortform · 2024-10-23T23:16:16.988Z · LW · GW

Half-informed take on "the SNPs explain a small part of the genetic variance": maybe the regression methods are bad?

Comment by rotatingpaguro on Slightly More Than You Wanted To Know: Pregnancy Length Effects · 2024-10-21T16:19:54.056Z · LW · GW

Not sure if I missed something because I read quickly, but: all these are purely correlational studies, without causal inference, right?

Comment by rotatingpaguro on OpenAI defected, but we can take honest actions · 2024-10-21T10:49:57.400Z · LW · GW

OpenAI is recklessly scaling AI. Besides accelerating "progress" toward mass extinction, it causes increasing harms. Many communities are now speaking up. In my circles only, I count seven new books critiquing AI corps. It’s what happens when you scrape everyone's personal data to train inscrutable models (computed by polluting data centers) used to cheaply automate out professionals and spread disinformation and deepfakes.

Could you justify that it causes increasing harms? My intuition is that OpenAI is currently net-positive without taking into account future risks. It's just an intuition, however, I have not spent time thinking about it and writing down numbers.

(I agree it's net-negative overall.)

Comment by rotatingpaguro on The Hopium Wars: the AGI Entente Delusion · 2024-10-15T23:30:19.984Z · LW · GW

Ok, that. China seems less interventionist, and to use more soft power. The US is more willing to go to war. But is that because the US is more powerful than China, or because Chinese culture is intrinsically more peaceful? If China made the killer robots first, would they say "MUA-HA-HA actually we always wanted to shoot people for no good reason like in yankee movies! Go and kill!"

Since politics is a default-no on lesswrong, I'll try to muddle the waters by making a distracting unserious figurative narration.

Americans maybe have more of a culture of "if I die in a shooting conflict, I die honorably, guns for everyone". Instead China is more about harmony&homogenity, "The CCP is proud to announce that in 2025 the Harmonious Agreement Quinquennal Plan in concluded successfully; all disagreements are no more, and everyone is officially friends". When the Chinese send Uighurs to the adult equivalent of school, Americans freak out: "What? Mandated school? Without the option of shooting back?"

My doubt is mostly contingent on not having first-hand experience of China, while I have of the US. I really don't trust narratives from outside. In particular I don't trust narratives from Americans right now! My own impression of the US changed substantially by going there in person, and I even am from an allied country with broad US cultural influence.

Comment by rotatingpaguro on The Hopium Wars: the AGI Entente Delusion · 2024-10-14T10:34:32.140Z · LW · GW

[Alert: political content]

About the US vs. China argument: have any proponent made a case that the Americans are the good guys here?

My vague perspective as someone not in China neither in the US, is that the US is overall more violent and reckless than China. My personal cultural preference is for US, but when I think about the future of humanity, I try to set aside what I like for myself.

So far the US is screaming "US or China!" while creating the problem in the first place all along. It could be true that if China developed AGI it would be worse, but that should be argued.

I bet there is some more serious non-selfish analysis of why China developing AGI is worse than US developing AGI, I just have never encountered it, would be glad if someone surfaced it to me.

Comment by rotatingpaguro on Most arguments for AI Doom are either bad or weak · 2024-10-14T08:50:53.755Z · LW · GW

I agree it's not a flaw in the grand scheme of things. It's a flaw for using it for consensus for reasoning.

Comment by rotatingpaguro on Most arguments for AI Doom are either bad or weak · 2024-10-13T12:55:44.345Z · LW · GW

I start with a very low prior of AGI doom (for the purpose of this discussion, assume I defer to consensus).

You link to a prediction market (Manifold's "Will AI wipe out humanity before the year 2100", curretly at 13%).

Problems I see with using it for this question, in random order:

It ends in 2100 so the incentive is effectively about what people will believe a few years from now, not about the question. It is a Keynesian beauty contest. (Better than nothing.)
Even with the stated question, you win only if it resolves NO, so it is strategically correct to bet NO.
It is dynamically inconsistent, if you think that humans have power over the outcome and that such markets influence what humans do about it. Illustrative story: "The market says P(doom)=1%, ok I can relax and not work on AI safety" => everyone says that => the market says P(doom)=99% because no AI safety work => "AAAAH SOMEONE DO SOMETHING" => marker P(doom)=1% => ...

Comment by rotatingpaguro on AI #85: AI Wins the Nobel Prize · 2024-10-12T01:24:29.764Z · LW · GW

This type of issue is a huge effective blocker for people with my level of skills. I find myself excited to write actual code that does the things, but the thought of having to set everything up to get to that point fills with dread – I just know that the AI is going to get something stupid wrong, and everything’s going to be screwed up, and it’s going to be hours trying to figure it out and so on, and maybe I’ll just work on something else. Sigh. At some point I need to power through.

Reminds me of this 2009 kalzumeus quote:

I want to quote a real customer of mine, who captures the B2C mindset about installing software very eloquently: “Before I download yet another program to my poor old computer, could you let me know if I can…” Painful experience has taught this woman that downloading software to her computer is a risky activity. Your website, in addition to making this process totally painless, needs to establish to her up-front the benefits of using your software and the safety of doing so. (Communicating safety could be an entire article in itself.)

Comment by rotatingpaguro on Advice for journalists · 2024-10-09T13:36:13.503Z · LW · GW

Ah, sorry for being so cursory.

A common trope about mathematicians vs. other math users is that mathematicians are paranoid persnickety truth-seekers, they want everything to be exactly correct down to every detail. Thus engineers and physicists often perceive mathematicians as a sort of fact-checker caste.

As you say, in some sense mathematicians deal with made-up stuff and engineers with real stuff. But from the engineer's point of view, they deal with mathematicians when writing math, not when screwing bolts, and so perceive mathematicians as "the annoying people who want everything to be perfectly correct".

Example: I write "E[E[X|Y]] = E[X]" in a paper, and the mathematician pops up complaining "What's the measure space? Is it sigma-finite? You have to declare if your random variables are square-integrable. Are X and Y measureable in the same space?" and my reply would be "come on we know it's true I don't care about writing it properly".

So to me and many people in STEM your analogy has the opposite vibe, which defeats the purpose of an analogy.

Comment by rotatingpaguro on Advice for journalists · 2024-10-08T09:33:24.114Z · LW · GW

The analogy with mathematicians is very stretched.

Comment by rotatingpaguro on Conventional footnotes considered harmful · 2024-10-02T13:20:47.450Z · LW · GW

Include, in the cue to each note, a hint as to its content, besides just the ordinal pointer. A one-letter abbreviation, standardised thruout the work, may work well, e.g.:
"c" for citation supporting the marked claim
"d" for a definition of the marked term
"f" for further, niche information extending the marked section
"t" for a pedantic detail or technicality modifying the marked clause
Commit to only use notes for one purpose — say, only definitions, or only citations. State this commitment to the reader.

These don't look like good solutions to me. Just a first impression.

Comment by rotatingpaguro on Eye contact is effortless when you’re no longer emotionally blocked on it · 2024-09-28T12:34:02.334Z · LW · GW

I don't make eye contact while speaking but fix people while in silence. Were there people like me? Did they managed to reverse this? The way I feel inside is more like I can't think both about the face of someone and what I am saying at once, too many things to keep track of.

Comment by rotatingpaguro on AI #83: The Mask Comes Off · 2024-09-26T19:08:26.896Z · LW · GW

Is champerty legal in California?

Comment by rotatingpaguro on [Completed] The 2024 Petrov Day Scenario · 2024-09-26T17:35:41.102Z · LW · GW

I take this as a fun occasion to lose some of my karma in a silly way to remind myself lesswrong karma is not important.

Comment by rotatingpaguro on On Measuring Intellectual Performance - personal experience and several thoughts · 2024-09-20T19:12:21.374Z · LW · GW

A very interesting problem is measuring something like general intelligence. I’m not going to delve deeply into this topic but simply want to draw attention to an idea that is often implied, though rarely expressed, in the framing of such a problem: the assumption that an "intelligence level," whatever it may be, corresponds to some inherent properties of a person and can be measured through their manifestations. Moreover, we often talk about measurements with a precision of a few percentage points, which suggests that, in theory, the measurement should be based on very stable indicators.
What’s fascinating is that this assumption receives very little scrutiny, while in cases where we talk about "mechanical" parameters of the human body (such as physical performance), we know that such parameters, aside from a person’s potential, heavily depends on numerous external factors and what that person has been doing over the past couple of weeks.

Do you know about the g-factor?

Comment by rotatingpaguro on I want a good multi-LLM API-powered chatbot · 2024-09-12T12:56:43.323Z · LW · GW

Still 20$/usd

Comment by rotatingpaguro on AI #79: Ready for Some Football · 2024-08-30T10:20:27.193Z · LW · GW

Does it imply that normally everyone would receive as many spam calls, but the more expensive companies are spending a lot of their budget to actively fight against the spammers?

Yeah, they said this is what happens.

"things that mysteriously don't work in USA despite working more or less okay in most developed countries"

Let my try:

expensive telephone (despite the infamous bell breakup)
super-expensive health
super-expensive college
no universally free or dirty cheap wire transfers between any bank
difficult to find a girlfriend
no gun control (disputed)
crap train transport
threatening weirdos on the street
people are bad at driving
no walkable town center
junk fees
tipping loses more value in attrition than the incentive alignment it creates (disputed)
overly complicated way to pay at restaurants
people still use checks
limited power out of home power outlets due to 110V and small plugs/wires
low quality home appliances
long queue at TSA (it could actually take only 1 min)

Comment by rotatingpaguro on AI #79: Ready for Some Football · 2024-08-30T08:38:55.824Z · LW · GW

I don't really know about this specific proposal to deter spam calls, but speaking in general: I'm from another large first world country, and when staying in the US a striking difference was receiving on average 4 spam calls per day. My american friends told me it was because my phone company was low-cost, but it was O(10) more expensive (per unit data) than what I had back home, with about O(1) spam calls per year.

So I expect that it is totally possible to solve this problem without doing something too fancy, even if I don't know how it's solved where I am from.

Comment by rotatingpaguro on Are most people deeply confused about "love", or am I missing a human universal? · 2024-08-27T09:21:32.855Z · LW · GW

Most people do not have the analytical clarity to be able to give an explanation of love isomorphic to their implementation of love; to that extent, they are "confused about love".

This though does not imply that their usage of the word "love" is amiss, the same way people are able to get through simple reasoning without learning logic, or walking without learning Physics.

So I'll assume that people are wielding "love" meaningfully, and try to infer what the word means.

It seems to indicate prolonged positive emotional involvement with an external entity. Other emotions and reactions surrounding this may be dragged in. As everything emotions, the boundaries are not clearly defined, but categories were made for man, not the reverse.

This degree of vagueness may be unnerving if you are more detail-oriented than most people. It is familiarly unnerving to me, as I am pretty detail-oriented. (To readers that are not detail-oriented, a concrete example of being so: searching for a big red thing in store shelves, you notice the big red thing, I scan sequentially the items until I hit it.)

Then the question becomes: Is the category "love" too broad?

First line of argument: if you can put 4 items in two sets of 2 items, you can also put them in a set of 4 items, even if this offends the aestethic sense of @SpectrumDT. This is not a good line of argument, as there is probably a more grounded meaning to the question; it's here just to put it out of the way.

Second line of argument: do some inner mental representations actually match such a broad category? Is this category an unnatural herding of many internal sensations, or is there actually something going on in the mind that naturally brings everything together, leading to humans spontaneously uttering a "love" token in response?

I'll attack this question from the front and from behind.

From the front, humans in general tend naturally to quash things into low-precision numbers in their head, because Von Neumann was right. I know how many favors and tricks you have played on me, and feel if you come out positive. I meet a girl, and as I get to know her, how nice she looks changes based on how intelligent she looks. So it's reasonable that the brain tends to summarize a lot of stuff into a level for "I'm emotionally attached in some positive way to that external entity".

From behind, if humans empirically insist on using this concept, it's probably there for a reason. There must be some naturalness to it.

Third line of argument: would people be better of if, by default, they outclassed their monkey brain in the degree of accuracy they use to think about love? Here I think the answer depends on the intellectual abilities of a person. If you are intelligent enough, at some point it's advantageous to make more complex categories and models. Below some level of intelligence, though, the result of trying to install into someone a more complex love model may not be worth the effort. That person may be better off with a long list of proverbs and heuristics that involve less things at a time, and more broad things, such that pattern matching is easier. Detailed discernment is reached by majority/importance vote over multiple loose patterns.

Conclusion: I overall think love is an adequate abstraction for most people, but not for a minority of detail-oriented and intelligent ones.

(Note: I have the impression the writing style I just used gives a vibe of "I know what I am saying, I'm a PhD in love analysis". I am not.)

Comment by rotatingpaguro on How great is the utility of "saving" endangered languages? · 2024-08-27T08:25:37.805Z · LW · GW

I'm reminded of the recent review of How Language Began on ACX: the missionary linguist becomes an atheist because in the local very weird language they have declinations to indicate the source of what you are saying, and saying things about Jesus just doesn't click.

Comment by rotatingpaguro on The Ap Distribution · 2024-08-26T08:13:03.923Z · LW · GW

I still don't understand your "infinite limit" idea. If in your post I drop the following paragraph:

A way to think about the proposition is as a kind of limit. When we have little evidence, each bit of evidence has a potentially big impact on our overall probability of a given proposition. But each incremental bit of evidence shifts our beliefs less and less. The proposition $A_{p}$ can be thought of a shorthand for an infinite collection of evidences $F_{i}$ where the collection leads to an overall probability of $p$ given to $A$ . This would perhaps explains why the $A_{p}$ proposition is so strange: we have well-developed intuitions for how "finite" propositions interact, but the characteristic absorbing property of the $A_{p}$ distribution is more reminiscent of how an infinite object interacts with finite objects.

the rest is standard hierarchical modeling. So even if your words here are suggestive, I don't understand how to actually connect the idea to calculations/concrete things, even at a vague indicative level. So I guess I'm not actually understanding it.

For example, you could show me a conceptual example where you do something with this which is not standard probabilistic modeling. Or maybe it's all standard but you get to a solution faster. Or anything where applying the idea produces something different, then I would see how it works.

(Note: I don't know if you noticed, but De Finetti applies to proper infinite sequences only, not finite ones, people forget this. It is not relevant to the discussion though)

Comment by rotatingpaguro on [deleted post] 2024-08-26T07:56:59.206Z

I happened to have the same doubt as you. A deeper analysis of the sacred texts shows how your interpretation of the Golden Rule is amiss. You say:

“Do unto others as you would have them do unto you.” (Matthew 7:12)

But the correct version is:

Therefore whatever you desire for men to do to you, you shall also do to them; for this is the law and the prophets.

The verse speaks specifically of men, not generically of others. So if you are straight, it does not compel you to sexual acts on women, while if you are gay, you shall try to hit on all the men to your heart's content. You can see this perfectly fits ordinary morality.

I think it's apposite to go beyond the Golden Rule, but by improvement rather than negation. The first step is Kant's Categorical Imperative: "Act only according to that maxim whereby you can at the same time will that it should become a universal law.". The second step is functional decision theory.

Comment by rotatingpaguro on Shortform · 2024-08-25T08:54:17.624Z · LW · GW

There are too many nonpolar bears in the US to keep up the lie.

Comment by rotatingpaguro on Shortform · 2024-08-25T07:44:51.929Z · LW · GW

I guess the point of the official party line is to avoid kids going and trying to scare polar bears.

Comment by rotatingpaguro on The Ap Distribution · 2024-08-25T07:32:22.384Z · LW · GW

I don't think this concept is useful.

What you are showing with the coin is a hierarchical model over multiple coin flips, and doesn't need new probability concepts. Let be the flips. All you need in life is the distribution $P (F_{1}, F_{2}, \dots)$ . You can decide to restrict yourself to distributions of the form $\int_{0}^{1} d p_{coin} P (F, G | p_{coin}) p (p_{coin})$ . In practice, you start out thinking about $p_{coin}$ as a variable atop all the $F_{i}$ in a graph, and then think in terms of $P (F, G | p_{coin})$ and $p (p_{coin})$ separately, because that's more intuitive. This is the standard way of doing things. All you do with $A_{p}$ is the same, there's no point at which you do something different in practice, even if you ascribed additional properties to $A_{p}$ in words.

A concept like "the probability of me assigning a certain probability" makes sense but I don't think Jaynes actually did anything like that for real. Here on lesswrong I guess @abramdemski knows about stuff like that.

--PS: I think Jaynes was great in his way of approaching the meaning and intuition of statistics, but the book is bad as a statistics textbook. It's literally the half-complete posthumous publication of a rambling contrarian physicist, and it shows. So I would not trust any specific statistical thing he does. Taking the general vibe and ideas is good, but when you ask about a specific thing "why is nobody doing this?" it's most likely because it's outdated or wrong.

Comment by rotatingpaguro on Monthly Roundup #21: August 2024 · 2024-08-20T08:30:33.297Z · LW · GW

A group of MR links led to a group of links that led to this list of Obvious Travel Advice. It seems like very good Obvious Travel Advice, and I endorse almost all points.
> A place that has staff trying to flag down customers walking past is almost certainly pursuing a get people in the door strategy.

To my great surprise, I found this to be false in Pisa (n_restaurant = 2).

Comment by rotatingpaguro on AI #76: Six Shorts Stories About OpenAI · 2024-08-08T21:46:14.325Z · LW · GW

Timothy Bates: The more things change, the more they stay the same: 1943 paper shows that a mechanical prediction of admissions greatly out predicts the decisions from administrators asked to add their subjective judgement :-(excellent talk from Nathan Kuncel !)
Nick Brown: I would bet that if you asked those subjective evaluators, they would say “We know the grades are the best predictor on average, but ‘sometimes’ they don’t tell the whole story”. People want to double-dip: Use the method most of the time, but add their own “special expertise”.
Timothy Bates: Nathan Kuncel put it astutely showing that decision makers beta weights are pretty accurate, but then they ruin their decision at “run time” by adding random intuitions about details in the application :-)
[in the figure: algorithm r(rating, GPA) = 0.45), humans r(rating, GPA) = 0.35

I don't like that this reasoning is based on using the correlation between rating and GPA, because I think GPA is goodharted. It is right to not select the admission process based on correlation with GPA. I think maybe this would be the case even if the humans where adding white noise.

I joke about calling them ‘Openly Evil AI,’ but developing a 99.9% effective watermarking tool and then sitting on it because people identifying your outputs would be bad for business? Yeah, that’s something.

Maybe if you solve for equilibrium you get that after releasing the tool, the tool is defeated reasonably quickly?

[json formatting in gpt]
How did they do it?
> OpenAI: While sampling, after every token, our inference engine will determine which tokens are valid to be produced next based on the previously generated tokens and the rules within the grammar that indicate which tokens are valid next. We then use this list of tokens to mask the next sampling step, which effectively lowers the probability of invalid tokens to 0. Because we have preprocessed the schema, we can use a cached data structure to do this efficiently, with minimal latency overhead.
Throw out all invalid outputs, and all the outputs that remain will be valid. Nice.

This is obvious. Why wasn't it available already? I guess bandwidth is what it is.

In the "Agent Performance vs Humans with Time Limits (95% CI)" figure: the 95% cl bars are fishy, because they look large compared to the regularity of the bars. i.e., the bars smoothly match the relative rankings I already expected from the models: if there were independent fluctuations of that size, the bars would be jagged compared to the expected ranking based on general model quality we already know. Possible explanations:

They made a mistake and the bars are based on the population sdev rather than the standard error. In this case the actual bars would be smaller.
The LLM scores are correlated: the randomness is given by the tasks, not by the models, so there's some uncertainty related to the small number of tasks, but the inter-LLM comparisons are pretty accurate. In this case, the bars for comparing LLMs would be smaller, while the present bars are fine for comparing LLM to humans (probably).

Comment by rotatingpaguro on lemonhope's Shortform · 2024-07-31T10:03:54.730Z · LW · GW

Unpolished first thoughts:

Selection effect: people who go to a blog to read bc they like reading, not doing
Concrete things are hard reads, math-heavy posts, doesn't feel ok to vote when you don't actually understand
In general easier things have wider audience
Making someone change their mind is more valuable to them than saying you did something?
There are many small targets and few big ideas/frames, votes are distributed proportionally

Comment by rotatingpaguro on tlevin's Shortform · 2024-07-31T07:34:53.093Z · LW · GW

Talking 1-1 with music is so difficult to me that I don't enjoy a place if there's music. I expect many people on/towards the spectrum could be similar.

User info

Posts

Comments