Posts
Comments
It can be both, of course. Start with process supervision but combine it with... something else. It's hard to learn how to reason from scratch, but it's also clearly not doing pure strict imitation learning, because the transcripts & summaries are just way too weird to be any kind of straightforward imitation learning of expert transcripts (or even ones collected from users or the wild).
Also worth noting Dustin Moskowitz was a prominent enough donor this election cycle, for Harris, to get highlighted in news coverage of her donors: https://www.washingtonexaminer.com/news/campaigns/presidential/3179215/kamala-harris-influential-megadonors/ https://www.nytimes.com/2024/10/09/us/politics/harris-billion-dollar-fundraising.html
There's no way I can meaningfully pick from like 100 covers. Pick 5 or 10, max, if you expect meaningful votes from people.
The extensive effort they make to integrate into legacy systems & languages shows how important that is.
codyz is doubling down on the UFO claims, but as far as I can see, the case has fallen apart so completely no one even wants to discuss it and even Tyler Cowen & Robin Hanson have stopped nudge-nudge-wink-winking it for now.
So I hereby double my unilateral bet to $2,000.
Show me a field where replication crises tear through, exposing fraud and rot and an emperor that never had any clothes, a field where replications fail so badly that they result in firings and polemics in the New York Times and destroyed careers- and then I will show you a field that is a little confused but has the spirit and will get there sooner or later.
So... parapsychology? How'd that work out? Did they have the (ahem) spirit and get there sooner or later?
Considering that one of the primary barriers to civilian nuclear power plants was, and remains, nuclear bomb proliferation risk, I'm not sure how telling this analogy is. There's a big reason that nuclear power plants right now are associated with either avowed nuclear powers or powers that want to be nuclear powers at some point (eg. France, South Korea, North Korea, Japan, Iran, Pakistan...) or countries closely aligned with said nuclear powers. Or rather, it seems to me that the analogy goes the opposite of how you wanted: if someone had 'solved' nuclear reactor design by coming up with a type of nuclear reactor which was provably impossible to abuse for nukes, that would have been a lot more useful for nuclear reactor design than fiddling with details about how exactly to boil water to drive a turbine. If you solve the latter, you have not solved the former at all; if you solve the former, someone will solve the latter. And if you don't, 'nuclear power plant' suddenly becomes a design problem which includes things like 'resistant to Israeli jet strikes' or 'enables manipulation of international inspectors' or 'relies on a trustworthy closely-allied superpower rather than untrustworthy one for support like spent fuel reprocessing and refueling'.
That is what someone might claim, yes, to avoid losing face by too visibly caring about losing face or attempting to manipulate it.
Can't you do this as polls in a single comment?
The fact that Bob has this policy in the first place is more likely when he's being self-deceptive.
A fun fictional example here is Bester's The Demolished Man: how do you plan & carry out an assassination when telepaths are routinely eavesdropping on your mind? The protagonist visits a company musician, requesting a musical earworm for a company song to help the workers' health or something; alas! the earworm gets stuck in his head, and so all any telepath hears is the earworm. And you can't blame a man for having an earworm stuck in his head, now can you? He has an entirely legitimate reason for that to be there, which 'explains away' the evidence of the deception hypothesis that telepathic-immunity would otherwise support.
Hm. Does that imply that a pack of dogs hunting a human is a stag hunt game?
I am also a little dubious that this is defining a concept which doesn't just mostly overlap with "face", which is substantially older, already well-known, and infinitely easier to remember & write.
Most of these examples seem like substituting in 'face' or 'lose face' would work just fine. "Senator, may I cause you to lose face by criticizing you publicly?" "He didn't like the advice I gave him about his errors because he lost face." "She felt infantilized and like losing face when her boyfriend told her how to solve something instead of commiserating with her."
They may, but I think the AI code generators would have to be quite good. As long as the LLMs are merely complementing programming languages, I expect them to remain human-readable & writable; only once they are replacing existing programming languages do I expect serious inscrutability. Programming language development can be surprisingly antiquated and old-fashioned: there are many ways to design a language or encode it where it could be infeasible to 'write' it without a specialized program, and yet, in practice, pretty much every language you'll use which is not a domain-specific (usually proprietary) tool will let you write source code in a plain text editor like Notepad or nano.
The use of syntax highlighting goes back to at least the ALGOL report, and yet, something like 50 years later, there are not many languages which can't be read without syntax highlighting. In fact, there's very few which can't be programming just fine with solely ASCII characters in an 80-col teletype terminal, still. (APL famously failed to ever break out of a niche and all spiritual successors have generally found it wiser to at least provide a 'plain text' encoding; Fortress likewise never became more than a R&D project.) Like this website - HTML, CSS, JS, maybe some languages which compile to JS, SVG... all writable in a 1970s Unix minicomputer printing out to physical paper.
Or consider IDEs which operate at 'project' level or have 'tags' or otherwise parse the code in order to allow lookups of names, like methods on an object - you could imagine programming languages where these are not able to be written out normally because they are actually opaque UUIDs/blobs/capabilities, and you use a structural editor (similar to spreadsheets) to modify everything, instead of typing out names letter by letter like a barbarian. (And 'visual' programming languages often do such a thing.) The Smalltalk systems where you did everything by iteratively interacting with GUI objects come to mind as systems where it's not even clear what the 'plain text' version is, after you've used the systems dynamically as they were intended to be used, and rewritten enough objects or overridden enough methods... But again, few languages in widespread use will do that.
Also of relevance is the wave of resignations from the DC newspaper The Washington Post the past few days over Jeff Bezos suddenly exerting control.
I had no idea ABBYY was so big. I thought it was just some minor OCR or PDF software developer. Interesting to hear about their historical arc. (I am also amused to see my Sutton meme used.)
Our strategy is for variants to preserve well-defined behavior in the application but introduce diversity in the effect of undefined behavior (such as out-of-bounds accesses).
This Galois work is a lot narrower and targeted at low-level details irrelevant to most code, which thankfully is now written in non-C languages - where out-of-bounds accesses don't pwn your machine and undefined behavior does not summon nasal demons and stuff like ASLR is largely irrelevant.
So AI is wholly necessary for most of the value of such a metamorphic code idea.
And yeah, I think it's a pretty decent idea: with cheap enough LLMs, you can harden applications by sampling possible implementations which pass all unit-tests, and whose final combination pass all end-to-end or integration tests. You can already do this a bit to check things with LLMs being so cheap. (Last night, Achmiz asked a Markov chain question and I was too lazy to try to figure it out myself, so I had ChatGPT solve it 3 ways in R: Monte Carlo, solving the matrix, and proving an exact closed-form probability. The answer could be wrong but that seems unlikely when they all seem to agree. If I wanted to write it up, I'd also have Claude solve it independently in Python so I could cross-check all 6 versions...)
This would help avoid a decent number of logic bugs and oversights, and it would also have some benefits in terms of software engineering: you are getting a lot of automated 'chaos engineering' and unit-test generation and performance benchmarking for free, by distributing a combinatorial number of implementations. It's almost like a mass fuzzing exercise, where the users provide the fuzz.
You might think this would run into issues with tracking the combinatorial number of binaries, which could take up petabytes if you are distributing, say, a 1GB package to 1 million users, but this has plenty of possible fixes: if you are using reproducible builds, as you ought to, then you only need to track a list of the variants for each function and store that per user, and then you can rebuild the exact binary for a given user on-demand.* I think a bigger issue is that forcing diversity out of tuned LLMs is quite hard, and so you would run into the systematic error problem at a higher level: all the tuned LLMs, feeding on each others' outputs & mode-collapsed, will turn in code with the same implicit assumptions & algorithms & bugs, which would mostly defeat the point.
* Similarly, the LLMs are, or should be, deterministic and fixable with a seed. So the overhead here might be something like, if you have a codebase with 10,000 functions, each time you push out a release - which might happen daily or weekly - you store the RNG seed for the LLM snapshot ID (maybe a kilobyte total), generate 2 versions of each function and randomize per user, and track 10,000 bits or ~1kb per user, so if you have a million users that's just a gigabyte. Whenever you need to investigate a specific binary because it triggered a crash or something, you just fetch the LLM ID & RNG, decode the specific 10,000 function variants they used, and compile. For anyone with millions of users who is serious about security or reliability, a gigabyte of overhead per release is nothing. You already waste that much with random Docker images and crap.
Maybe a better framing would be the economic perspective from Hanson's growth paper: "is AI a complement or is it a substitute?" Does AI assist a human worker (or a human organization), making them more productive, functioning as simply a kind of tool (or 'capital') which multiplies their labor; or does it replace that human worker/organization? When it's the former, it may indeed take a very long time; but the latter can happen instantly.
No one can force a freelance artist to learn to use Photoshop or how to best use some snazzy new feature, and artists will be learning the ins-and-outs of their new technologies and workflows for many decades to come and slowly becoming more productive thanks to their complementing by digital illustration tools. Whereas on the other hand, their employers can replace them potentially in minutes after the next big Midjourney upgrade.*
More historically, in colonization, a group of settlers may simply arrive literally overnight in their wagons and set up a new town (eg. a gold rush boomtown), and begin replacing the local indigenous peoples, without any sort of centuries-long gradual '+2% local per capita GDP growth per year until convergence' using only the original local indigenous people's descendants.
* A personal example: when I wanted more fancy dropcaps for Gwern.net, I was contacting human artists and trying to figure out how much it would cost and what the workflow was, and how many thousands of dollars & months of back-and-forth a good dropcap set might cost, and if I would have to settle for instead something like 1 custom dropcap per essay. When Midjourney became reasonably adequate at v5 & DALL-E at 3, I didn't spend decades working with artists to integrate AI into their workflow and complement their labor... I substituted AI for artists: stopped my attempt to use them that night, and never looked back. When I made 10 dropcaps for this year's Halloween theme (the 'purple cats' got particularly good feedback because they're adorable), this is something I could never do with humans because it would be colossally expensive and also enormously time-consuming to do all that just for a special holiday mode which is visible a few hours out of the year. At this point, I'm not sure how many artists or font designers I would want to use even if they were free, because it means I don't have to deal with folks like Dave or have one of my projects delayed or killed by artists, or the hassle of all the paperwork and payments, and I get other benefits like extremely rapid iteration & exploration of hundreds of possibilities without wearing out their patience etc.
This is similar to the answer I got from o1-preview
in ChatGPT when I originally asked with OP's post as the text, so that's pleasant to see. (I didn't post anything here because I was unsure and wasn't checking it in enough detail to repost, and so didn't believe in publishing it without being able to improve it.)
I thought there might be some relationship at first with an appropriate transformation, but when I recalled how Kelly requires both edge and net worth, and the problem of frequency of payoffs, I lost my confidence that there would be any simple elegant relationship beyond a simple 'more information = more returns'. Why indeed would you expect 1 bit of information to be equally valuable for maximizing expected log growth in eg. both a 50:50 shot and a 1,000,000,000:1 shot? Or for a billionaire vs a bankrupt? (Another way to think of it: suppose you have 1 bit of information on both over the market and you earn the same amount. How many trades would it take before your more informed trade ever made a difference? In the first case, you quickly start earning a return and can compound that immediately; in the second case, you might live a hundred lives without ever once seeing a payoff.)
There the main bottleneck is the iteration of selection, or making synthetic genomes. Going for the most typical genome with the least amount of originality is not a technical challenge in itself, right?
Right.
If you are doing genome synthesis, you aren't frustrated by the rare variant problems as much because you just aren't putting them in in the first place; therefore, there is no need to either identify the specific ones you need to remove from a 'wild' genome nor make highly challenging edits. (This is the 'modal genome' baseline. I believe it has still not been statistically modeled at all.)
While if you are doing iterated embryo selection, you can similarly rely mostly on maximizing the common SNPs, which provide many SDs of possible improvement, and where you have poor statistical guidance on a variant, simply default to trying to select out against them and move towards a quasi-modal genome. (Essentially using rare-variant count as a tiebreaker and slowly washing out all of the rare variants from your embryo-line population. You will probably wind up with a lot in the final ones anyway, but oh well.)
Which would be a good thing as nominally they claim to let everyone opt out of scraping already by using robots.txt
and other methods, and so the canary shouldn't do anything there that people couldn't already do.
No, that's what I think too: they were turning down investors, even, excluding them from the upround. The conditionality was probably not necessary at all. But it does serve a valuable purpose for the inevitable lawsuit.
With SNPs, there's tens of thousands of different SNPs which would each need to be targeted differently. With high copy sequences, there's a relatively small set of different sequences.
No, rare variants are no silver bullet here. There's not a small set, there's a larger set - there would probably be combinatorially more rare variants because there are so many ways to screw up genomes beyond the limited set of ways defined by a single-nucleotide polymorphism, which is why it's hard to either select on or edit rare variants: they have larger (harmful) effects due to being rare, yes, and account for a large chunk of heritability, yes, but there are so many possible rare mutations that each one has only a few instances worldwide which makes them hard to estimate correctly via pure GWAS-style approaches. And they tend to be large or structural and so extremely difficult to edit safely compared to editing a single base-pair. (If it's hard to even sequence a CNV, how are you going to edit it?)
They definitely contribute a lot of the missing heritability (see GREML-KIN), but that doesn't mean you can feasibly do much about them. If there are tens of millions of possible rare variants, across the entire population, but they are present in only a handful of individuals a piece (as estimated by the GREML-KIN variance components where the family-level accounts for a lot of variance), it's difficult to estimate their effect to know if you want to select against or edit them in the first place. (Their larger effect sizes don't help you nearly as much as their rarity hurts you.)
So this is why if you read the CNV studies and you look at the hits they identify, and how many subjects are covered by the identified hits, you find that like, maybe 2% of the cohort will have one of those specific identified hits and lose 2 IQ points or gain 2 kg of fat etc. So you can see how that would work out in embryo selection: you'd be able to avoid that loss, which is meaningful! ...in a tiny fraction of all embryos. On average, you'd just sequence them all, find no known pathogenic variant, and shrug, and use the SNP PGS like usual, having gained nothing.
Also, of course, WGS is substantially more expensive than SNP genotyping and more difficult to do on embryos.
If the genetic architecture had worked out otherwise, if there had instead been a lot of rare mutations which increased intelligence, then life would be a lot more convenient. Instead, it's a lot of 'sand in the gears', and once you move past the easy specks of sand, they all become their own special little snowflakes.
This is why rare variants are not too promising, although they are the logical place to go after you start to exhaust common SNPs. You probably have to find an alternative approach like directly modeling or predicting the pathogenicity of a rare variant from trying to understand its biological effects, which is hard to do and hard to quantify or predict progress in. (You can straightforwardly model GWAS on common SNPs and how many samples you need and what variance your PGS will get, but predicting progress of pathogenicity predictors has no convenient approach.) Similarly, you can try very broad crude approaches like 'select embryos with the fewest de novo mutations'... but then you lose most of the possible variance and it'll add little.
Or is it, 'OpenAI the for-profit is doing good in the world, and they can do much more good if they can raise more money, and there's certainly no way they could raise more money without us giving up control'?
Basically, yes, that is what the argument will be. The conditionality of the current investment round is also an example of that: "we can only raise more capital on the condition that we turn ourselves into a normal (B-corp) company, unencumbered by our weird hybrid structure (designed when we thought we would need OOMs less capital than it turns out we do), and free of the exorbitant Board control provisions currently governing PPUs etc. And if we can't raise capital, we will go bust soon and will become worthless and definitely lose the AGI race, and the Board achieves none of its fiduciary goals at all. Better a quarter of a live OA than 100% of a dead one."
Well, what's the alternative? If you think there is something weird enough and suboptimal about essay formats that you are reaching for 'random chance' or 'monkey see monkey do' level explanations, that implies you think there is some much superior format they ought to be using instead. But I can't see what. I think it might be helpful to try to make the case for doing these things via some of the alternatives:
- a peer-reviewed Nature paper which would be published 2 years from now, maybe, behind a paywall
- a published book, published 3 years from starting the first draft now, which some people might get around to reading a year or two after that, and dropping halfway through (assuming you finish and didn't burn out writing it)
- a 1 minute Tiktok video by an AI person with non-supermodel looks
- a 5-minute heavily-excerpted interview on CNN
- a 750-word WSJ or NYT op-ed
- a 10-page Arxiv paper in the standard LaTeX template
- a Twitter thread of 500 tweets (which can only be read by logged-in users)
- a Medium post (which can't be read because it is written in a light gray font illegible to anyone over the age of 20. Also, it's paywalled 90% of the time.)
- a 6 hour Lex Fridman podcast interview, about 4 hours in after Lex has finished his obligatory throatclearing questions (like asking you if aliens exist or the universe is made out of love)
- interpretive dance in front of the Lincoln Memorial livestreamed on Twitch
- ...
(I'd also add in Karnofsky's blog post series.)
Sunglasses can be too cool for most people to be able to wear in the absence of a good reason. Tom Cruise can go around wearing sun glasses any time he wants, and it'll look cool on him, because he's Tom Cruise. If we tried that, we would look like dorks because we're not cool enough to pull it off and it would backfire on us. (Maybe our mothers would think we looked cool.) This could be said of many things: Tom Cruise or Kanye West or fashionable celebrities like them can go around wearing a fedora and trench coat and it'll look cool and he'll pull it off; but if anyone else tries it...
"Gwern, why don't you just buy an AI-themed ETF and 'buy the whole sector' if investing in individual stonks is so hard but you're optimistic about its long-term value?"
"How to Lose Money on the World’s Most Popular Investment Theme: Pity the investors in the three artificial-intelligence-themed ETFs that managed to lose money this year" (mirror):
There are lots of embarrassing ways to lose money, but it is particularly galling to lose when you correctly identify the theme that will dominate the market and manage to buy into it at a good moment.
Pity the investors in the three artificial-intelligence-themed exchange-traded funds that managed to lose money this year. Every other AI-flavored ETF I can find has trailed both the S&P 500 and MSCI World. That is before the AI theme itself was seriously questioned last week, when investor doubts about the price of leading AI stocks Nvidia and Super Micro Computer became obvious.
The AI fund disaster should be a cautionary tale for buyers of thematic ETFs, which now cover virtually anything you can think of, including Californian carbon permits (down 15% this year), Chinese cloud computing (down 21%) and pet care (up 10%). Put simply: You probably won’t get what you want, you’ll likely buy at the wrong time and it will be hard to hold for the long term.
Ironically enough, Nvidia’s success has made it harder for some of the AI funds to beat the wider market. Part of the point of using a fund is to diversify, so many funds weight their holdings equally or cap the maximum size of any one stock. With Nvidia making up more than 6% of the S&P 500, that led some AI funds to have less exposure to the biggest AI stock than you would get in a broad index fund. This problem hit the three losers of the year. First Trust’s $457 million AI-and-robotics fund has only 0.8% in Nvidia, a bit over half what it holds in cybersecurity firm BlackBerry. WisdomTree’s $213 million AI-and-innovation fund holds the same amount of each stock, giving it only 3% in Nvidia. BlackRock’s $610 million iShares Future AI & Tech fund was also equal weighted until three weeks ago, when it altered its purpose from being a robotics-and-AI fund, changed ticker and switched to a market-value-based index that gives it a larger exposure to Nvidia.
The result has been a 20-percentage-point gap between the best and worst AI ETFs this year. There is a more than 60-point gap since the launch of ChatGPT in November 2022 lit a rocket under AI stocks—although the ETFs are at least all up since then.
...Dire timing is common across themes: According to a paper last year by Prof. Itzhak Ben-David of Ohio State University and three fellow academics, what they call “specialized” ETFs lose 6% a year on average over their first five years due to poor launch timing.
...But mostly, look at the fees: They will be many times higher than a broad market index fund, and the dismal history of poor timing suggests that for most people they aren’t worth paying.
Also, note that the 'Blue LED' was not originally my example at all, someone else brought it up as an example.
Then maybe you shouldn't be trying to defend it (or your other two examples of engines and programming languages, for that matter), especially given that you still have not explained how 'the LED' could have been given a Nobel ever inasmuch as everyone involved was dead.
One of the problems with the Nobel Prize as a measurement or criteria is that it is not really suited for that by nature, especially given criteria like no posthumous awards. This means that it is easy to critique awarding a Nobel Prize, but it is harder to critique not awarding one. You can't give a Nobel Prize to the inventor of the engine, because they probably died a long time ago; you could have for a recent kind of engine. Similarly, you could give a Turing Award to the inventors of C (and they probably did) but the first person who created a mnemonic shorthand over raw machine opcodes during WWII or whatever was probably dead before the Turing Award was even created.
Let's take your 'inventing the LED' for example. You seem keen on interpreting the absence of a Nobel Prize here as a relative judgment about 'inventing LEDs' vs 'inventing blue LEDs'. But you don't establish that there is any reason to think this is one of the cases where the lack of an award can be validly interpreted as a snub & a judgment by the relevant committee. Is it?
Well, let's take 5 seconds to check some of the historical context here, like who would you award a prize to? I open up Wikipedia and I check the first three names. (Why three? Because Nobel Prizes are arbitrarily limited to 3 awardees.)
All 3 of them, including Oleg Losev who is described as physically creating the first bona fide LED and so seems to be the closest to "the inventor of the LED", died before or around the first commercial LED being announced (October 1962). For about a decade, early weak expensive red LEDs "had little practical use", until finally they began to replace nixie tubes. Only then did they start to take off, and only then did they start to become a revolution. (And reading this WP history, it seems like blue LEDs have wound up being more important than the original red ones anyway.)
Oleg Losev in particular died in 1942, in obscurity, and given the year, you won't be too surprised why:
Losev died of starvation in 1942, at the age of 38, along with many other civilians, during the Siege of Leningrad by the Germans during World War 2.
You can't award Nobel Prizes to the dead - and by the time it was clear LEDs were a major revolution, many of the key players were well and thoroughly dead. That is, the committee could not have awarded a Nobel Prize for 'inventing the LED', without either being prescient or awarding it to later researchers, who were lucky enough to be long-lived but did not actually invent the LED, and that would be a travesty on its own and also crowd out meritorious alternative physics breakthroughs (of which there were many in the 20th century that they are still working their way through).
So, this is one reason to not put too much stress on the absence of a Nobel Prize. Not having a Nobel Prize for work in the early-to-mid 20th century means in considerable part things like "was not killed by Hitler or Stalin", things which are not particularly related to the quality or value of your scientific research but are related to whether you can survive for the 20 or 40 years it may take for your Nobel Prize to show up.
I guess LLMs are model-free, so that's relevant
FWIW, I strongly disagree with this claim. I believe they are model-based, with the usual datasets & training approaches, even before RLHF/RLAIF.
Is "arithmetic" here simply a synonym for "Fermi estimates"?
I semi-agree with #2: if you use mostly old and highly-curated data as a "seed" dataset for generating synthetic data from, you do bound the extent to which self-replicating memes and perona and Sydneys can infect the model. If there is a Sydney-2 in later data, it obviously cannot exist in some snapshot taken at an earlier date. And if the model is fully trained with a strong personality, like a "Claude", and only then exposed to data infected by a Sydney or Sydney-2, you might reasonably expect there to be much less infection: the same way that you or I could roleplay as "Sydney" but we are in no danger of being hijacked by those outputs into being an 'actual' Sydney because we're not base models, so too a fully-trained safety-tuned LLM might have been mode-collapsed down so heavily onto 1 persona that it is far more difficult for any alternate persona to be instantiated meaningfully.
I don't think this is water-tight. It's not obvious that any known data-quality screening suffices here, or indeed, even in principle what screening you would do to block Sydney but not God or Obama or the Easter Rabbit. (How would you, for example, define a data-quality screening procedure for creating a seed dataset which could be efficiently implemented at scale on petabytes of candidate data, which you did not hand-engineer specifically to block Sydney, but which successfully blocks automatically, say, the New York Times articles containing extensive quotes from & descriptions of Sydney?) Sydney-infected data will still be there from the start of training because NYT data is so high quality, etc. But yes, the more you cut the causal loops from the datasets to other agents, the more you guard yourself from malign influences from those.
I am not sure I buy your #1. A bootstrapping synthetic data LLM seems more prone to iteratively developing steganography by training on its own datasets, because it's easier to amplify its own protocols and coordinate with itself through the many generations of bootstrapping. (This is why self-play tends to lead to such 'neuralese'.) And to the extent that they are smarter (because that approach works better & is why it's being used), they are going to be that much better at developing or learning steganography sample-efficiently.
What the hermetic seal of bootstrapping means is that the LLM/Internet pollution effect of #2 is less likely to happen... as long as, of course, you manage to cut the data off early enough that there are not enough examples to let steganography emerge somewhere in the generations of bootstraps anyway. It's far from obvious to me that even cutting off the data right now would be adequate. (There's so many kinds of covert communication or discussion already out there, after all. And current LLMs are already so smart that you don't even need to provide examples of many things, just descriptions.)
And you further have the problem that now the post-training phase is not going to be robust to this - unlike viral persona - because there's no reason to not learn steganography if it's useful. You might succeed in bootstrapping a frontier LLM which is robust to any persona viruses, sure, but when you refresh it with recent scraped data or user inputs, it notices steganography from other models in 1 odd sample, and realizes what a nifty idea that is and how it saves a lot of tokens etc, and boom! now you have a steganographic model. The model doesn't even have to be finetuned, necessarily - information might be getting smuggled around in "plain text" (like some of the more horrifying corners of Unicode) as a prefix trigger. (The longer context windows/prompts are, the more prompt prefixes can "pay their way", I'd note.) We've seen some early experiments in trying to make self-replicating prompts or texts...
Given his history, I think it is perfectly reasonable to conclude that childhood experiences directly caused BPD.
Everything you describe is equally consistent with BPD causing childhood experiences, or BPD causing memories of childhood experiences.
So to summarize: this studies turning the original meaningful inner-monologue, with meaningful varying tokens, into a 1:1 constant-token version, so those constant tokens cannot be conveying any information, but do cause the model to increase computation. It then shows that the LLM trained with constant-tokens is effectively doing the original varying-token inner-monologue, and successfully doing that while emitting the constant-token prediction by merely erasing the original token in favor of the constant-token in the final layers right before the output.
This shows that the constant-trained LLM is not amortizing compute across steps or breaking up the full compute into multiple steps, but being taught how to internally do the extended serial computation, and how to spread that out over itself as more of the LLM gets brought into action by the padding. And since it's the same computation, you can easily read it off with interpretability tools.
The fact that an LLM can do the computation internally is not too surprising. Most of these problems are intrinsically very simple: adding or multiplying a few short numbers is not an impressive feat when the LLM is using up literally trillions of multiplications of long FP16 numbers, after all. So it is not surprising if a NN is expressive enough to, say, multiple two entire numbers using hundreds of layers, billions of parameters, and trillions of multiplications. (And you can do quite a lot with very small neural nets, sometimes with just a few neurons.)
The relevance here is to steganography: what sorts of computations can be distilled into the forward pass and hidden from interpretability? What happens when selective pressure is placed on LLMs to learn how to store intermediate state in tokens in hidden ways or encrypt forward passes to hide reward-enhancing-but-forbidden thoughts from overseers bulk-deploying interpretability methods over masses of LLMs? Or to just use as few tokens as possible?
But there are no such pressures for steganography here. The 1:1 token swap encourages the LLM to stay the same. There's no real need for it to change internally, much less to change in any difficult-to-understand way. (You might as well freeze most of the weights to save yourself backprop at that point.) So, I don't learn anything about steganography or AI safety from the results as presented.
What I'd be more interesting in seeing would be variants which did encourage steganography, and so the interpretability evals were more relevant:
-
you could randomize the padding tokens, instead of just using an easy constant token
-
you could shorten the padding tokens, like 2:1. How much does that compression distort the inner-monologues?
-
you could allow a subset of padding tokens, like maybe just 0/1, and it has to learn some useful binary encoding.
-
you could introduce an adversary, like a LLM co-trained on the visible "padding" tokens, and penalize the LLM if the co-LLM can figure out the answer from reading the inner-monologue's "padding" tokens. (This could be hard to implement. One simple approach would be rejection sampling: just drop all samples where the co-LLM predicts the right answer, so the original LLM only trains on samples which pass the adversary.)
-
you could randomly vary padding ratio: sometimes train on a very small ratio like 1:100, but other times train on very large ones like 1:100
-
one can think of several ways a LLM might cope with expanding ratios like 1:100, in order to keep computing the right thing without screwing up by accidentally predicting the answer "too soon".
Can interpretability methods distinguish between approaches like "try to keep count of exactly how many tokens and on the 100th padding token, start generating the real answer" and "internally hobble the computation so it only winds up finishing exactly on the 100th token"?
-
Switching to another task like "counting" might be helpful for implementing variants.
It sounds like SC2 might just be a bad testbed here. You should not have to be dealing with issues like "but can I get a computer fast enough to run it at a fast enough speedup" - that's just silly and a big waste of your effort. Before you sink any more costs into shaving those and other yaks, it's time to look for POMDPs which at least can be paused & resumed appropriately and have sane tooling, or better yet, have continuous actions/time so you can examine arbitrary ratios.
Also, I should have probably pointed out that one issue with using LLMs you aren't training from scratch is that you have to deal with the changing action ratios pushing the agents increasingly off-policy. The fact that they are not trained or drawing from other tasks with similarly varying time ratios means that the worsening performance with worsening ratio is partially illusory: the slower player could play better than it does, it just doesn't know how, because it was trained on other ratios. The kind of play one would engage in at 1:1 is different from the kind of play one would do at 10:1, or 1:10; eg a faster agent will micro the heck out of SC, while a slow agent will probably try to rely much more on automated base defenses which attack in realtime without orders and emphasize economy & grand strategy, that sort of thing. (This was also an issue with the chess hobbling experiments: Stockfish is going to do very badly when hobbled enough, like removing its queen, because it was never trained on such bizarre impossible scenarios / alternate rulesets.) Which is bad if you are using this as some sort of AI safety argument, because it will systematically deceive you, based on the hobbled off-policy agents, into thinking slowed-down agents are less capable (ie. safer) in general than they really are. This is another reason to not use SC2 or try to rely on transfer from a pre-existing model, convenient as the latter may be.
Given both these issues, you should probably think about instead more Jones-like training an agent from scratch, simultaneously at all ratios to meta-learn competency at all ratios while sharing training in a fair fashion, on a much simpler environment. Maybe not even a POMDP, MDPs might be adequate for most of it. Something like a large tic-tac-toe board, or perhaps a continuous Pong, would be simple enough that you could afford to train very competent unhobbled agents at widely-varying ratios, and fit various scaling laws, with few GPUs.
It wouldn't collide with normal Markdown syntax use. (I can't think of any natural examples, aside from bracket use inside links, like [[editorial comment]](URL)
, which could be special-cased by looking for the parentheses required for the URL part of a Markdown link.) But it would be ambiguous where the wiki links point to (Sarah's Roam wiki? English Wikipedia?), and if it pointed to somewhere other than LW2 wiki entries, then it would also be ambiguous with that too (because the syntax is copied from Mediawiki and so the same as the old LW wiki's links).
And it seems like an overloading special case you would regret in the long run, compared to something which rewrote them into regular links. Adds in a lot of complexity for a handful of uses.
Methodologically, I think it would make more sense to frame it in terms of action granularity ratio, rather than using units like seconds or %s. The use of seconds here seems to make the numbers much more awkward. It'd be more natural to talk about scaling trends for Elo vs action-temporal granularity. For example, 'a 1:2 action ratio translates to a 1:3 win ratio advantage (+500 Elo)" or whatever. This lets you investigate arbitrary ratios like 3:2 and fill out the curves. (You'd wind up doing a transform like this anyway.)
Then you can start easily going through various scaling laws, like additional finetuning samples or parameter scaling vs Elo, and bring in the relevant DRL scaling literature like Jones and temporal scaling laws for horizons/duration. (For example, you could look at horizon scaling in terms of training samples: break up each full Starcraft episode to train on increasingly truncated samples.) The thresholds you talk about might be related to the irreducible loss of the horizon RL scaling law: if there is something that happens "too quick" each action-timestep, and there is no way to take actions which affect too-quick state changes, then those too-quick events will be irreducible by agents.
I don't think LLMs do the equivalent of that. It's more like, learning Chinese from a Chinese/Chinese dictionary stapled to a Chinese encyclopedia.
It is not obvious to me that using a Chinese/Chinese dictionary, purged of example sentences, would let you learn, even in theory, even things a simple n-grams or word2vec model trained on a non-dictionary corpus does and encodes into embeddings. For example, would a Chinese/Chinese dictionary let you plot cities by longitude & latitude? (Most dictionaries do not try to list all names, leaving that to things like atlases or gazetteers, because they are about the language, and not a specific place like China, after all.)
Note that the various examples from machine translation you might think of, such as learning translation while having zero parallel sentences/translations, are usually using corpuses much richer than just an intra-language dictionary.
everybody want to test rats in mazes, ain't nobody want to test this janky-ass maze!
One of the interesting things I found when I finally tracked down the source is that one of the improved mazes before that was a 3D maze where mice had to choose vertically, keeping them in the same position horizontally, because otherwise they apparently were hearing some sort of subtle sound whose volume/direction let them gauge their position and memorize the choice. So Hunter created a stack of T-junctions, so each time they were another foot upwards/downwards, but at the same point in the room and so the same distance away from the sound source.
Perhaps the norm should be to use some sort of LLM-based survey service like https://news.ycombinator.com/item?id=36865625 in order to try to get a more representative population sample of LLM outputs?
This seems like it could be a useful service in general: do the legwork to take base models (not tuned models), and prompt in many ways and reformulate in many ways to get the most robust distribution of outputs possible. (For example, ask a LLM to rewrite a question at various levels of details or languages, or switch between logically equivalent formulations to avoid acquiescence bias; or if it needs k shots, shuffle/drop out the shots a bunch of times.)
It is worth noting that the Pros made more extreme forecasts than the bots. The Pros were not afraid to forecast less than 2% or more than 90%, while the bots stayed closer to 50% with their forecasts.
This sounds like an example of 'flattened logits' or loss of calibration in tuned models. I take it that all of the models involved were the usual RLHF/instruction-tuned models, and no efforts were made to use base models like the original davincis or llama-3-405b-base, which ought to have better calibration?
Yes, people have been pulling this sort of semantic knowledge out of word embeddings since the start. Here is a long list from like 5 years ago, going far beyond just geographic locations: https://gwern.net/gpt-2#fn11
This is one of the reasons that people have rejected the claims that LLMs are doing anything special: because after all, just a word2vec, which barely even counts as a neural net, or n-grams, seems able to 'learn' a lot of the same things as a LLM does, even though it's "obviously" not a world model. (It's a modus ponens/tollens thing.)
One of the coolest demonstrations of extracting world models (and demonstrating the flaws in the learned world models due to a lack of inductive priors) is a paper on inferring the exact street connectivity & geography of New York City from training on taxi cab trajectories: https://x.com/keyonV/status/1803838591371555252 https://arxiv.org/abs/2406.03689
This is certainly the more convoluted explanation, but it certainly matches with my observations of SBF's psychology, from well-before the FTX blowup.
I disagree. I think Altman is, in many respects, the exact opposite of SBF, and your read of his personality is wrong. This is why you can't predict things like Sutskever & Murati leaving OA, without being pushed (and in fact Altman going to lengths to keep them), while I could. I encourage you to go back and reread things like the New Yorker profile or discussions of his highschool career or his abortive political run or UBI experiment with that in mind.
This thought experiment is unrealistic
many such cases
Unsurprisingly, black and white top the list, along with some other neutrals; red, a perennial favorite, is the top non-neutral color.
The ABRSM is in X days. It too does not care how efficient you were time-wise in getting to grade-8 competency. There are no bonus points for sample-efficiency.
(And of course, it's not like Asian parents are doing their kids much good in the first place with that music stuff, so there's even less of an issue there.)
Well, it would certainly be nice if that were true, but all the interpretability research thus far has pointed out the opposite of what you seem to be taking it to. The only cases where the neural nets turn out to learn a crisp, clear, extrapolable-out-many-orders-of-magnitude-correctly algorithm, verified by interpretability or formal methods to date, are not deep nets. They are tiny, tiny nets either constructed by hand or trained by grokking (which appears to not describe at all any GPT-4 model, and it's not looking good for their successors either). The bigger deeper nets certainly get much more powerful and more intelligent, but they appear to be doing so by, well, slapping on ever more bags of heuristics at scale. Which is all well and good if you simply want raw intelligence and capability, but not good if anything morally important hinges on them reasoning correctly for the right reasons, rather than heuristics which can be broken when extrapolated far enough or manipulated by adversarial processes.
If you distrust OA's selection, it seems like o1 is occasionally leaking the chains of thought: https://www.reddit.com/r/OpenAI/comments/1fxa6d6/two_purported_instances_of_o1preview_and_o1mini/ So you can cross-reference those to see if OA's choices seem censored somehow, and also just look at those as additional data.
It's also noteworthy that people are reporting that there seem like there are other blatant confabulations in the o1 chains, much more so than simply making up a plausible URL, based on the summaries: https://www.reddit.com/r/PromptEngineering/comments/1fj6h13/hallucinations_in_o1preview_reasoning/ Stuff which makes no sense in context and just comes out of nowhere. (And since confabulation seems to be pretty minimal in summarization tasks these days - when I find issues in summaries, it's usually omitting important stuff rather than making up wildly spurious stuff - I expect those confabulations were not introduced by the summarizer, but were indeed present in the original chain as summarized.)
the human value is not complex point is frankly aging very well with the rise of LLMs
You just pointed out that what a LLM learned for even a very simple game with extensive clean data turned out to be "a bag of heuristics": https://www.lesswrong.com/posts/LNA8mubrByG7SFacm/against-almost-every-theory-of-impact-of-interpretability-1?commentId=ykmKgL8GofebKfkCv
Spaced repetition is the most efficient way in terms of time spent per item. That doesn't make it the most efficient way to achieve a competitive goal. For this reason, SRS systems often include a 'cramming mode', where review efficiency is ignored in favor of maximizing memorization probability within X hours. And as far as musicians go - orchestras don't select musicians based on who spent the fewest total hours practicing but still manage to sound mostly-kinda-OK, they select based on who sounds the best; and if you sold your soul to the Devil or spent 16 hours a day practicing for the last 30 years to sound the best, then so be it. If you don't want to do it, someone else will.
That said, the spaced repetition research literature on things like sports does suggest you still want to do a limited form of spacing in the form of blocking or rotating regularly between each kind of practice/activity.
My yakshaving essay seems relevant here. Especially relevant is the vicious cycle of overload/working-hard/not-yakshaving: