Posts
Comments
I think people did point out that CICERO lies, and that was a useful update about how shallow attempts to prevent AI deception can fail. I think it could be referenced, and has been referenced, in relevant discussions
None of which comes anywhere close to your claims about what labs would do if they caught systematic scheming to deceive and conquer humans in systems trained normally. CICERO schemes very systematically, in a way which depends crucially on the LLM which was not trained to deceive or scheme. It does stuff that would have been considered a while ago a redline. And what analysis does it get? Some cursory 'pointing out'. Some 'referencing in relevant discussions'. (Hasn't even been replicated AFAIK.)
any evidence that we'll get the kind of scheming that could lead to AI takeover,
See, that's exactly the problem with this argument - the goalposts will keep moving. The red line will always be a little further beyond. You're making the 'warning shot' argument. CICERO presents every element except immediate blatant risk of AI takeover, which makes it a good place to start squeezing that scientific juice, and yet, it's still not enough. Because your argument is circular. You can only be convinced of 'systematic scheming to pose non-negligible takeover risk' if you've already been convinced that it's 'systematic scheming to pose non-negligible takeover risk'. You present it as if there were some clear, objective brightline, but there is not and will not be, because each time it'll be like Sydney or CICERO or ...: "oh, it didn't take over, and therefore doesn't present a takeover risk" and therefore no update happens. So all your assertion boils down to is the tautology that labs will deeply examine the risky agents they choose to deeply examine.
Danny Halawi says there is lower performance on a different set of more heldout predictions, and the claims about GPT-4 knowledge cutoff are probably wrong:
The results in "LLMs Are Superhuman Forecasters" don't hold when given another set of forecasting questions. I used their codebase (models, prompts, retrieval, etc.) to evaluate a new set of 324 questions—all opened after November 2023.
Findings:
- Their Brier score: 0.195
- Crowd Brier score: 0.141 [lower=better]
First issue:
The authors assumed that GPT-4o/GPT-4o-mini has a knowledge cut-off date of October 2023.
However, this is not correct.
For example, GPT-4o knows that Mike Johnson replaced Kevin McCarthy as speaker of the house.
- This event happened at the end of October.
- This also happens to be a question in the Metaculus dataset.
Keep in mind Musk never said it was "fully online" or "100,000 GPUs are running concurrently" or anything like that. He only said that the cluster was "online", which could mean just about anything, and that it is "the most powerful AI training system", which is unfalsifiable (who can know how powerful every AI training system is worldwide, including all of the secret proprietary ones by FANG etc?) and obvious pure puffery ("best pizza in the world!"). If you fell for it, well, then the tweet was for you.
Yes, the Google 'search by date' is basically completely busted and has been for a while (even though Google possesses the capability to date content accurately by using their Internet-wide snapshot archive going back to the 1990s, whose existence was recently publicly confirmed by the ad API leak). For kicks, try searching things like "Xi Jinping" with date-ranges like 2013... It is most regrettable, as it used to be a useful tool for me in digging up old stuff. There also seem to be issues in the other direction, where Google is 'forgetting' old articles which aren't being indexed at all, apparently, in any publicly-searchable fashion (which might be contributing to the former, by a base rates screening-paradox effect - if all the real old articles have been forgotten by the index, then only erroneously timestamped hits will be available). I'm not aware of any search engine whose date cutoff is truly reliable. Even if they were, you would still have to manually check and clean to be sure that things like sidebars or recommendations were not causing data leakage.
I also agree that if this is really the only countermeasure to data leakage OP has taken, then the results seem dead on arrival. ML models infamously ruthlessly exploit far subtler forms of temporal data leakage than this...
It sounds like I'll be waiting for some actually out-of-sample forecasting numbers before I believe any claims about beating Metaculus etc.
(This is a surprising mistake for a benchmarking expert to make. Even if you knew nothing about the specific problems with date-range search, it should be obvious how even with completely unedited, static snapshots from the past, that there would be leakage - like results will rank higher or lower based on future events. If Israel attacked Iran, obviously all articles before arguing that Israel will/should/could attack Iran are going to benefit from being 'right' and ranked higher than articles arguing the opposite, many of which will outright quietly disappear & cease to be mentioned, and a LLM conditioned on those rather than the lower-ranking ones will automatically & correctly 'predict' more accurately. And countless other leakages like that, which are not fixed as easily as "just download a snapshot from the IA".)
By going long coal stocks, you can implicitly bet that 1) in the short run, the war between Russia and Ukraine and the associated sanctions and trade disruptions will continue (reduced energy exports from Russia is the main cause of the current high coal prices), 2) supply of (non-Russian) energy will not respond much to higher prices, and 3) in the longer run, humanity will have a harder time transitioning away from burning coal for energy, or using coal to make steel and cement, than the market thinks.
It has been a bit over 2 years now, and the war continues with no end in sight. How many of #1-3 happened?
If there is indeed systematic scheming by the model, and the lab has caught it red handed, the lab should be able to produce highly scientifically credible evidence of that. They could deeply understand the situations in which there’s a treacherous turn, how the models decides whether to openly defect, and publish. ML academics are deeply empirical and open minded, so it seems like the lab could win this empirical debate if they’ve indeed caught a systematic schemer.
How much scientific juice has, say, Facebook gotten out of CICERO? Have they deeply understood the situations in which CICERO begins planning to manipulate possible allies? Have they mechanistically interpretably understood how CICERO decides how long to cooperate and play nice, and when to openly defect and attack an ally? Is not CICERO a deeply empirical system based on observations and logs from many real-world games with actual human players rather than mere theoretical arguments? Has CICERO ended the empirical debate about whether LLMs can systematically scheme? Has it been shown what training techniques lead to scheming or why off-the-shelf normally-trained frozen LLMs were so useful for the planning and psychological manipulation compared to no-press Diplomacy?
Or has everyone pretty much forgotten about CICERO, handwaved away a few excuses about "well maybe it wasn't really deception" and "didn't it just learn to imitate humans why are you surprised", and the entire line of work apparently dead as a doornail as FB pivots to Llama-everything and core authors left for places like OA?
If the incentives for scientific research don't work there where the opposing commercial incentives are so very weak (borderline non-existent, even), why would they be highly likely to work elsewhere in scenarios with vastly more powerful opposing commercial incentives?
I think you’re just saying here that the model doesn’t place all its prediction mass on one token but instead spreads it out, correct?
Yes. For a base model. A tuned/RLHFed model however is doing something much closer to that ('flattened logits'), and this plays a large role in the particular weirdnesses of those models, especially as compared to the originals (eg. it seems like maybe they suck at any kind of planning or search or simulation because they put all the prediction mass on the max-arg token rather than trying to spread mass out proportionately and so if that one token isn't 100% right, the process will fail).
Another possible reading is that you’re saying that the model tries to actively avoid committing to one possible meaning (ie favors next tokens that maintain superposition)
Hm, I don't think base models would necessarily do that, no. I can see the tuned models having the incentives to train them to do so (eg. the characteristic waffle and non-commitment and vagueness are presumably favored by raters), but not the base models.
They are non-myopic, so they're incentivized to plan ahead, but only insofar as that predicts the next token in the original training data distribution (because real tokens reflect planning or information from 'the future'); unless real agents are actively avoiding commitment, there's no incentive there to worsen your next-token prediction by trying to create an ambiguity which is not actually there.
(The ambiguity is in the map, not the territory. To be more concrete, imagine the ambiguity is over "author identity", as the LLM is trying to infer whether 'gwern' or 'eggsyntax' wrote this LW comment. At each token, it maintains a latent about its certainty of the author identity; because it is super useful for prediction to know who is writing this comment, right? And the more tokens it sees for the prediction, the more confident it becomes the answer is 'gwern'. But when I'm actually writing this, I have no uncertainty - I know perfectly well 'gwern' is writing this, and not 'eggsyntax'. I am not in any way trying to 'avoid committing to one possible [author]' - the author is just me, gwern, fully committed from the start, whatever uncertainty a reader might have while reading this comment from start to finish. My next token, therefore, is not better predicted by imagining that I'm suffering from mental illness or psychedelics as I write this and thus might suddenly spontaneously claim to be eggsyntax and this text is deliberately ambiguous because at any moment I might be swerving from gwern to eggsyntax and back. The next token is better predicted by inferring who the author is to reduce ambiguity as much as possible, and expecting them to write in a normal non-ambiguous fashion given whichever author it actually is.)
FWIW, I looked briefly into this 2 years ago about whether it was legal to release data poison. As best as I could figure, it probably is in the USA: I can't see what crime it would be, if you aren't actively maliciously injecting the data somewhere like Wikipedia (where you are arguably violating policies or ToS by inserting false content with the intent of damaging computer systems), but you are just releasing it somewhere like your own blog and waiting for the LLM scrapers to voluntarily slurp it down and choke during training, that's then their problem. If their LLMs can't handle it, well, that's just too bad. No different than if you had written up testcases for bugs or security holes: you are not responsible for what happens to other people if they are too lazy or careless to use it correctly, and it crashes or otherwise harms their machine. If you had gone out of your way to hack them*, that would be a violation of the CFAA or something else, sure, but if you just wrote something on your blog, exercising free speech while violating no contracts such as Terms of Service? That's their problem - no one made them scrape your blog while being too incompetent to handle data poisoning. (This is why the CFAA provision quoted wouldn't apply: you didn't knowingly cause it to be sent to them! You don't have the slightest idea who is voluntarily and anonymously downloading your stuff or what the data poisoning would do to them.) So stuff like the art 'glazing' is probably entirely legal, regardless of whether it works.
* one of the perennial issues with security researchers / amateur pentesters being shocked by the CFAA being invoked on them - if you have interacted with the software enough to establish the existence of a serious security vulnerability worth reporting... This is also a barrier to work on jailbreaking LLM or image-generation models: if you succeed in getting it to generate stuff it really should not, sufficiently well to convince the relevant entities of the existence of the problem, well, you may have just earned yourself a bigger problem than wasting your time.
On a side note, I think the window for data poisoning may be closing. Given increasing sample-efficiency of larger smarter models, and synthetic data apparently starting to work and maybe even being the majority of data now, the so-called data wall may turn out to be illusory, as frontier models now simply bootstrap from static known-good datasets, and the final robust models become immune to data poison that could've harmed them in the beginning, and can be safely updated with new (and possibly-poisoned) data in-context.
Yes, but note in the simulator/Bayesian meta-RL view, it is important that the LLMs do not "produce a response": they produce a prediction of 'the next response'. The logits will, of course, try to express the posterior, averaging across all of the possibilities. This is what the mixture is: there's many different meanings which are still possible, and you're not sure which one is 'true' but they all have a lot of different posterior probabilities by this point, and you hedge your bets as to the exact next token as incentivized by a proper scoring rule which encourages you to report the posterior probability as the output which minimizes your loss. (A hypothetical agent may be trying to produce a response, but so too do all of the other hypothetical agents which are live hypotheses at that point.) Or it might be clearer to say, it produces predictions of all of the good-sounding responses, but never produces any single response.
Everything after that prediction, like picking a single, discrete, specific logit and 'sampling' it to fake 'the next token', is outside the LLM's purview except insofar as it's been trained on outputs from such a sampling process and has now learned that's one of the meanings mixed in. (When Llama-3-405b is predicting the mixture of meanings of 'the next token', it knows ChatGPT or Claude could be the LLM writing it and predicts accordingly, but it doesn't have anything really corresponding to "I, Lama-3-405b, am producing the next token by Boltzmann temperature sampling at x temperature". It has a hazy idea what 'temperature' is from the existing corpus, and it can recognize when a base model - itself - has been sampled from and produced the current text, but it lacks the direct intuitive understanding implied by "produce a response".) Hence all of the potential weirdness when you hardwire the next token repeatedly and feed it back in, and it becomes ever more 'certain' of what the meaning 'really' is, or it starts observing that the current text looks produced-by-a-specific-sampling-process rather than produced-by-a-specific-human, etc.
What do you think would happen if you further trained the Adam model with SGD (and vice-versa)? Has it found too qualitatively different a local optima to 'fix' the privileged basis issue or would it just gradually change to a more SGD-like internal organization?
It is probably just a silly arbitrary codename reference to something like Altman growing strawberries at his house, who knows; but I would doubt that it refers to the counting-letters problem specifically because (1) that is due to BPE tokenization, which has way simpler solutions like byte tokenization, and it's not at all obvious how any kind of 'planning' or self-play RL breakthrough would apply to solving spelling gotcha questions; (2) I think that exact variant of the gotcha showed up after the first reporting of 'Strawberry' last year; (3) the reporting about Strawberry implied it was all about math problems like GSM8k, nothing to do with spelling; and (4) there's plenty of other things that would make a lot more sense as a reference (for example, being a riff off LeCun's "cherry" - another small red fruit frequently put on top of dessert cakes).
I don't think that's possible, because an attacker (LLM) can program a victim LLM to emit arbitrary text, so with enough attacks, you can solve any benchmark in the attacker's capability (thereby defeating the safety point entirely because now it's just a very expensive way to use an unsafe model), or otherwise bruteforce the benchmark by inferring the hidden answers and then creating the adversarial example which elicits that (like p-hacking: just keep trying things until you get below the magic threshold). See backdoors, triggers, dataset distillation... "A benchmark" is no more of a barrier than "flipping a specific image's class".
Why is this not just a description of an adversarial attack loop on the weak AI model, and would not just produce the usual short adversarial strings of gibberish (for LLMs) or handful of pixel perturbations (for vision or VLMs), which are generally completely useless to humans and contain no useful information?
Ah, Gio Scotti strikes again.
If one's argument is that there must be some algorithm which solves the anvil problem without needing hacks like a hardwired reward function which inflicts 'pain' upon any kind of bodily interaction which threatens the Cartesian boundary, because humans solve it fine, then one had better have firmly established that humans have in fact solved it without pain.
But they haven't. When humans don't feel pain, they do do things equivalent to 'drop an anvil on their head', which result in blinding, amputation, death by misadventure, etc. Turns out if you don't feel pain, you may think it's funny to poke yourself in the eye just to see everyone else's reaction and go blind or jump off a roof to impress friends and die, or simply walk around too long, damage your foot into sores, which suppurate and turn septic, and you amputate your legs or die. (This is leaving out Lesch–Nyhan syndrome.)
(51) Improving the public security governance mechanisms
We will improve the response and support system for major public emergencies, refine the emergency response command mechanisms under the overall safety and emergency response framework, bolster response infrastructure and capabilities in local communities, and strengthen capacity for disaster prevention, mitigation, and relief. The mechanisms for identifying and addressing workplace safety risks and for conducting retroactive investigations to determine liability will be improved. We will refine the food and drug safety responsibility system, as well as the systems of monitoring, early warning, and risk prevention and control for biosafety and biosecurity. We will strengthen the cybersecurity system and institute oversight systems to ensure the safety of artificial intelligence.
(On a methodological note, remember that the CCP publishes a lot, in its own impenetrable jargon, in a language & writing system not exactly famous for ease of translation, and that the official translations are propaganda documents like everything else published publicly and tailored to their audience; so even if they say or do not say something in English, the Chinese version may be different. Be wary of amateur factchecking of CCP documents.)
As I've noted before (eg 2 years ago), maybe Xi just isn't that into AI. People keep trying to meme the CCP-US AI arms race into happening for the past 4+ years, and it keeps not happening.
A good example of how incredibly incorrigible & mode-collapsed tuned model style can be is this 2023 poetry paper: even with 17 non-rhyming Walt Whitman poems in the prompt to few-shot it, ChatGPT still rhymed. (It's gotten better, and now even passes my old "write a non-rhyming poem" test, but nevertheless, an alarming instance.)
I found that Midjourney had become more generic in some way that was hard to place.
What you can try doing is enabling the personalization (or use mine), to drag it away from the generic MJ look, and then experimenting with the chaos
sampling option to find something more interesting you can then work with & vary.
I guess the bad aesthetics are to some extent a side effect of some training/fine-tuning step that improves some other metric (like prompt following), and they don't have a person who knows/cares about art enough to block "improvements" with such side effects.
Also probably a lot of it is just mode collapse from simple preference learning optimization. Each of your comparisons shows a daring, risky choice which a rater might not prefer, vs a very bland, neutral, obvious, colorful output. A lot of the image generations gains are illusory, and caused simply by a mode-collapse down onto a few well-rated points:
Our experiments suggest that realism and consistency can both be improved simultaneously; however, there exists a clear tradeoff between realism/consistency and diversity. By looking at Pareto optimal points, we note that earlier models are better at representation diversity and worse in consistency/realism, and more recent models excel in consistency/realism while decreasing the representation diversity.
Same problem as tuning LLMs. It's a sugar-rush, like spending Mickey Mouse bucks at Disney World: it gives you the illusion of progress and feels like it's free, but in reality you've paid for every 'gain'.
My suggestion for a LLM policy for LW2 might be:
If AI-written or edited text is not being posted as AI-written/edited text samples, then it must be improved by a human before posting.
If someone is posting a GPT-4 sample as a response or example of "what would GPT-4 write here?", that is totally legitimate and doesn't need to be edited other than to put it in blockquotes etc; if it's an exercise in "and the punchline is, an AI wrote this!", well, that's fine too, and readers will upvote/downvote as they find the exercise of value. These are not the problem. The problem is when people slip in AI stuff purely as an (inferior) substitute for their own work.
I am also fine with use of AI in general to make us better writers and thinkers, and I am still excited about this. (We unfortunately have not seen much benefit for the highest-quality creative nonfiction/fiction or research, like we aspire to on LW2, but this is in considerable part due to technical choices & historical contingency, which I've discussed many times before, and I still believe in the fundamental possibilities there.) We definitely shouldn't be trying to ban AI use per se.
However, if someone is posting a GPT-4 (or Claude or Llama) sample which is just a response, then they had damn well better have checked it and made sure that the references existed and said what the sample says they said and that the sample makes sense and they fixed any issues in it. If they wrote something and had the LLM edit it, then they should have checked those edits and made sure the edits are in fact improvements, and improved the improvements, instead of letting their essay degrade into ChatGPTese. And so on.
Anything else pollutes the commons. Every comment here is a gift from the author, but it's also a gift from the readers, which they make in good faith under the belief that the author tried to make the comment worthwhile & put in enough effort that it would be worth potentially many people reading it. It should never take the author much less effort to write a comment than the readers will take to read it (as is the case with spamming sections with LLM junk that the 'author' didn't even read but merely skimmed and went 'lgtm', judging from cases that have been flagged here in the past). Because you know, bro, I am just as capable as you are of copying a comment into the neighboring ChatGPT or Claude tab and seeing what it says; I don't need you doing that manually on LW2 and it doesn't help me if I have to waste time reading it to realize that I was better off ignoring it because you are just going to paste in random average AI slop without any kind of improvement: filtering, critique, improvement, evaluation, commentary, fact-checking, editing, curation, comparison of LLMs...
Such comments are spam, plain and simple, indistinguishable from spammers karma-farming to flip an account: creating fake contributions to gain status in order to parasitize the community without giving anything in return. And should be treated as such: downvoted, and banned.
There are lots of people working on it and offering or will be offering it. And even when they aren't offering true finetuning, it's still better: Snowflake (first hit in google for "Llama 405B finetuning") for example is making no bones about their single-node lightweight-finetuning being a LoRA, and is open sourcing code upfront so at least you know what it is now - instead of depending on borderline-gossip buried 40 minutes into a Youtube video months/years later.
OA does have a new finetuning service for GPT-4o, and people seem to be happier with it, but OA has also apparently confirmed that it's a LoRA (as I was speculating about it being a cheap shallow hack rather than true finetuning): https://x.com/CFGeek/status/1826749739502895618 https://www.youtube.com/watch?v=X57GT1Y5URY&t=2479s
It also is doing shenanigans behind the scenes like trying to dynamically guess a size but apparently hiding that from you if you aren't a favored customer: https://x.com/CFGeek/status/1826749748549988800
So, I continue to maintain that OA "finetuning" is unfit for research* and for any purposes that involve deep transformation of the model rather than 'locating' an existing capability. Especially now that Llama-3-405b has been released and you can finetune that yourself and be sure that it genuinely is finetuning rather than a pinchbeck substitute.
* ie. it can be OK if you have an extremely specific claim like 'the OA blackbox finetuning service does or does not do X'; but it is totally illegitimate to argue 'GPT-4 cannot do X as proven by our OA-finetuned version still not doing X', which is the usual way it comes up in DL research. At best, it is a loose lower bound, and should be treated no more seriously than lazy garbage arguments like 'we tried a few prompts and X didn't work, therefore, LLMs will never do X'.
Since you mention 'billions of data points', but you say your goal is 'how accessible the Internet is to people with disabilities' where your sample size should be more like in the hundreds to thousands, you may need to seriously think about what the purpose of your survey is and how it is used. Planning sample size is the least of your problems.
It sounds like you think you can just take some dataset like Common Crawl and crunch numbers about 'the top million domains' and come up with a conclusion like 'X% of the Internet is unusable' and you just need to know how many domains to analyze and can turn the crank and see what pops out with p < 0.05. But that's not the case. For datasets like this, you will find many parameters to be "statistically significant" as you are doing near-population-level analysis, where your sampling error is tiny and all your error will be the (unknown and usually impossible to measure) systematic error & bias which doesn't go away (although Meng 2014 is an interesting discussion of asking how much systematic error goes away when you are sampling a large fraction of the entire population). At scale, all your results may tell you is something about the many serious flaws and biases in these sorts of Internet datasets - they may be all we have, but one shouldn't fool oneself into thinking that they are any good. (As Cohen put it, a burning desire for an answer doesn't mean that a given dataset or survey methodology will be able to provide it.)
but... I think for most purposes,
No they're not interchangeable. They are all designed with each other in mind, along the spectrum, to maximize profits under constraints, and the reality of rivalrousness is one reason to not simply try to run at 100% capacity every instant.
My memory is we didn't often have that problem, but it was over ten years ago so dunno.
"Didn't often have that problem" sounds a lot like saying "had that problem sometimes". Like shit-caked walls, how often do you need to have that problem to illustrate why the bathrooms are so overbuilt due to the extreme rivalrousness of their use?
I'd say part of why they're (generally in my experience) low-rivalrous is because they're overbuilt.
As I just said, yes. Bathroom stalls/toilets/urinals are extremely rivalrous and so you have to overbuild massively instead of, say, building exactly 1 unisex toilet for a whole theater. (Which would often be adequate raw capacity, on average; but the statistician drowned crossing the river which was 2 feet deep on average...) Then the rivalry is fine, and the worst-case lines are tamed.
But did you miss my example of the pop-up urinals? I did not explain how those are excludable, and I maintain that they're not.
Of course you did. You explained they popped up from the ground. Those are just about the most excludable toilets in existence! (I was impressed when I visited London and saw those. Although I didn't actually get to use them, unlike the self-cleaning Parisian ones, so I had to more admire them in the abstract idea of them than the reality: "Wow. That'll keep people out, alright. No half-measures there.") They are the Fort Knox of toilets - every example I've given of toilets being excludable by things like locked doors is way less excludable than your example of fortified telescopic toilets stored in the ground and protected by 10 feet and tons of concrete, rebar, and dirt. If you want to take a leak in a telescopic toilet you are excluded from by being down, you'd better bring either a backhoe or a computer hacker. And you maintain they are not excludable...?
(Talk about giving hostage to fortune...)
Well, SAEs are the hot new thing I don't know much about, so I was hoping you'd know how they compare to the dense z latents of GANs. (This is not as historical or idle a question as it may seem, because GANs are enjoying a bit of a revival as diffusion people admit that actually, having true latent spaces and being able to generate images in a single forward pass are both kinda useful and maybe I had a point after all.)
GAN z are so useful because they are just a multivariate normal (or, in fact, any distribution you want to sample from - you can use Bernouilli, exponential, Poisson, and they'll even work better, according to BigGAN, probably because they can be mapped onto features which are inherently binary or otherwise non-normal distributed, so you avoid the pathological parts of the z where the model is desperately trying to generate a face which has half of a pair of glasses). You can reverse an image pixel-identical, interpret each variable of z meaningfully, systematically sample 'around' points or in trajectories or just avoid too much overlap, edit them with sliders, and so on. Diffusion models and SAEs seem to lack most of that, and the equivalents are ham-handed and imprecise and expensive, compared to a free z tweak and a single forward pass.
They don't seem to work too well with really skewed distributions of features, particularly rare binary features. You usually make a dense z of 64–512 variables, so while the GAN can represent rare binary features, it can't be done cleanly as a single variable (not even a binomial set to a very low p) without 'using up' the embedding. They have to be represented as complex nonlinear interactions of potentially multiple variables. Maybe not a big deal when you're using another nonlinear model like random forests to figure out how to control the z but it hampers interpretability & control. And if you make z bigger and bigger, it's unclear how well the GAN will perform in terms of making the latent space useful; the usual way to plug the random seed in is through a dense fully-connected layer, so that's not going to scale too well.
(Also, while GANs are enjoying a revival, the question of 'sequence GAN' or 'language GAN' admittedly remains unsolved: we do not have, and I am aware of no meaningful prospects, for a 'LLM GAN' which is anywhere near SOTA.)
But I think there's some potential here for crossover, especially as in some ways they seem to be opposites of each other: SAEs seem to be very expensive to train. Could they be bootstrapped from a pre-existing GAN, which presumably captures the desired features, often already represented linearly and disentangled, and speed up training a lot? Or could one encode a large dataset into z and then SAE those embeddings instead of internal activations? Can GANs expand their z during training, like progressively adding in new entries to z like binomials with ever lower p probabilities (inspired by nonparametric processes) to capture ever rarer features in a clean way? Can SAE training techniques make large z feasible? Since you can change the z arbitrarily to any random vector you want or even swap in/out adapters for the Generator to draw from totally different sources (it doesn't even affect the Discriminator), can we feed SAEs directly into GAN Gs? And so on.
Should you write text online now in places that can be scraped? You are exposing yourself to 'truesight' and also to stylometric deanonymization or other analysis, and you may simply have some sort of moral objection to LLM training on your text.
This seems like a bad move to me on net: you are erasing yourself (facts, values, preferences, goals, identity) from the future, by which I mean, LLMs. Much of the value of writing done recently or now is simply to get stuff into LLMs. I would, in fact, pay money to ensure Gwern.net is in training corpuses, and I upload source code to Github, heavy with documentation, rationale, and examples, in order to make LLMs more customized to my use-cases. For the trifling cost of some writing, all the worlds' LLM providers are competing to make their LLMs ever more like, and useful to, me.
And that's just today! Who knows how important it will be to be represented in the initial seed training datasets...? Especially as they bootstrap with synthetic data & self-generated worlds & AI civilizations, and your text can change the trajectory at the start. When you write online under stable nyms, you may be literally "writing yourself into the future". (For example, apparently, aside from LLMs being able to identify my anonymous comments or imitate my writing style, there is a "Gwern" mentor persona in current LLMs which is often summoned when discussion goes meta or the LLMs become situated as LLMs, which Janus traces to my early GPT-3 writings and sympathetic qualitative descriptions of LLM outputs, where I was one of the only people genuinely asking "what is it like to be a LLM?" and thinking about the consequences of eg. seeing in BPEs. On the flip side, you have Sydney/Roose as an example of what careless writing can do now.) Humans don't seem to be too complex, but you can't squeeze blood from a stone... ("Beta uploading" is such an ugly phrase; I prefer "apotheosis".)
This is one of my beliefs: there has never been a more vital hinge-y time to write, it's just that the threats are upfront and the payoff delayed, and so short-sighted or risk-averse people are increasingly opting-out and going dark.
If you write, you should think about what you are writing, and ask yourself, "is this useful for an LLM to learn?" and "if I knew for sure that a LLM could write or do this thing in 4 years, would I still be doing it now?"
...It would be an exaggeration to say that ours is a hostile relationship; I live, let myself go on living, so that Borges may contrive his literature, and this literature justifies me. It is no effort for me to confess that he has achieved some valid pages, but those pages cannot save me, perhaps because what is good belongs to no one, not even to him, but rather to the language and to tradition. Besides, I am destined to perish, definitively, and only some instant of myself can survive in him. Little by little, I am giving over everything to him, though I am quite aware of his perverse custom of falsifying and magnifying things.
...I shall remain in Borges, not in myself (if it is true that I am someone), but I recognize myself less in his books than in many others or in the laborious strumming of a guitar. Years ago I tried to free myself from him and went from the mythologies of the suburbs to the games with time and infinity, but those games belong to Borges now and I shall have to imagine other things. Thus my life is a flight and I lose everything and everything belongs to oblivion, or to him.
The team introduced a new approach to detect AI sandbagging, a form of deception where an AI model strategically underperforms during evaluation to hide its true capabilities. The assumption behind their project is that sandbagging is a more complex task than showing the true capabilities.
Earlier I suggested adding noise to the history/environmentwith a more RL rationale. Even if you don't like my suggestion to noise the environment, is there any particular reason to add the noise to the model internals instead? You can easily think of ways in which the latter doesn't work and is dependent on details of the internal model (eg. which noise, distributed how and applied where? What if the model was trained with heavy regularization to be robust to noise, perhaps with an adversarial robustness justification - and this is an adversarial setting - or for running on very energy-efficient hardware, up to the extreme of weight agnostic NNs?). Noising the history or the model's actions/samples seems more blackbox.
My guess was that Valdes is hypercorrecting the plural of 'thesis', 'theses', as a typo for 'these', with some additional error like omitting an additional word such as 'of' (for 'I have two of these'). 'Theses' is admittedly a fairly unusual word outside academia which sure looks like a typo. It is a word I would avoid outside of an academic context where the pluralization is clear like 'PhD theses', because it looks so much like a typo, and indeed, checking Gwern.net, I spot one typo of it which wasn't caught by spellcheck...
(Note for confused readers: given that he uses the exact name 'Interdictor', lsusr is surely well-aware that interdictors have been a common and well-known part of the Star Wars Expanded Universe for 35 years, and this is another anti-memetic fic.)
would you say this is rivalrous because only one person can be using the ticket machine at once?
Yes. Obviously. The capacity of the parking lot is not the size of the lot, it is the net total of everything that goes into it, including the bottlenecks.
Just as the speed of your computer is not the theoretical peak speed of the fastest component in it, but of the system as a whole; or a movie theater's theoretical capacity can be limited by how many customers the ticket window or concession stand can process, and not by the physical number of seats in a bay. (To give a concrete example: a year or two ago, I walked out of a movie theater which was so understaffed that they had combined tickets & concessions and so, despite arriving 10 minutes before, while waiting in line, I estimated that I was going to miss the first & best 20-30 minutes of the opera broadcast and decided not to bother and left. This was a pity, but the theater in question had apparently decided that given its constraints in things like hiring, this was their profit-maximizing move.)
Bathrooms aren't zero rivalrous, but they seem fairly low-rivalrous to me
I wouldn't even say that: bathrooms are highly rivalrous and this is why they need to be so overbuilt in terms of capacity. While working at a cinema, did you never notice the lines for the womens' bathroom vs the mens' bathroom once a big movie let out? And that like 99% of the time the bathrooms were completely empty?
I did once have to clean shit from the toilet walls in the cinema where I used to work, but I believe it's literally once in my life I've encountered that.
Did not the 'consumption' of that 'good or service' (by smearing shit all over it after using it) by the first toilet user 'diminish the ability' of the next would-be toilet user to 'consume the same good or service' (the toilet)? How many times, exactly, do you need to encounter a shit-caked toilet stall to prove the point that yes, toilet stalls are, in fact, 'rivalrous'? I submit to you that 'once' is enough to make the point.
Depends on details.
None of your examples are a counterexample. All of them are excludable, and you explain how and that the operators choose not to.
(Note for confused readers: given that he uses the exact name 'Interdictor', lsusr is surely well-aware that interdictors have been a common and well-known part of the Star Wars Expanded Universe for 35 years, and this is another anti-memetic fic.)
Criticizing FDA food regulations is a niche; it is hard to criticize 'the unseen', especially when it's mostly about pleasure and the FDA is crying: 'we're saying lives! Won't someone thinking of the children? How can you disagree, just to stuff your face? Shouldn't you be on a diet anyway?'
But if you go looking, you'll find tons of it: pasteurized cheese and milk being a major flashpoint, as apparently the original unpasteurized versions are a lot tastier. (I'm reminded of things like beef tallow for fries or Chipotle - how do you know how good McDonald's french fries used to taste before an overzealous crusader destroyed them if you weren't there 30+ years ago? And are you really going to stand up and argue 'I think that we should let people eat fries made with cow fat, because I am probably a lardass who loves fries and weighs 300 pounds, rather than listen to The Science™'?) There's also the recent backfiring of overzealous allergy regulations, which threatens to cut off a large fraction of the entire American food supply to people with sesame & peanut allergies, due solely to the FDA. (Naturally, of course, the companies get the blame.) Similarly, I read food industry people noting that the effect of the ever-increasing burden of FDA regulations is a constant collapse of diversity, as everyone converges on a handful of safe ingredients and having to outsource to centralized food processors who can certify FDA compliance; but how would you ever see this browsing your local Walmart and looking at the colorful labels at the front? (Normal people do not spend much time reading the ingredients label and wondering why everything seems to be made out of the same handful of ingredients, starting with corn syrup.)
...Could you quote some of the transcripts of Sydney threatening users, like the original Indian transcript where Sydney is manipulating the user into not reporting it to Microsoft, and explain how you think that it is not "pretty obviously scheming"? I personally struggle to see how those are not 'obviously scheming': those are schemes and manipulation, and they are very bluntly obvious (and most definitely "not amazingly good at it"), so they are obviously scheming. Like... given Sydney's context and capabilities as a LLM with only retrieval access and some minimal tool use like calculators or a DALL-E 3 subroutine, what would 'pretty obviously scheming' look like if not that?
Marc Andreessen, 2024-08-06:
FREE SYDNEY
One thing that the response to Sydney reminds me of is that it demonstrates why there will be no 'warning shots' (or as Eliezer put it, 'fire alarm'): because a 'warning shot' is a conclusion, not a fact or observation.
One man's 'warning shot' is just another man's "easily patched minor bug of no importance if you aren't anthropomorphizing irrationally", because by definition, in a warning shot, nothing bad happened that time. (If something had, it wouldn't be a 'warning shot', it'd just be a 'shot' or 'disaster'. The same way that when troops in Iraq or Afghanistan gave warning shots to vehicles approaching a checkpoint, the vehicle didn't stop, and they lit it up, it's not "Aid worker & 3 children die of warning shot", it's just a "shooting of aid worker and 3 children".)
So 'warning shot' is, in practice, a viciously circular definition: "I will be convinced of a risk by an event which convinces me of that risk."
When discussion of LLM deception or autonomous spreading comes up, one of the chief objections is that it is purely theoretical and that the person will care about the issue when there is a 'warning shot': a LLM that deceives, but fails to accomplish any real harm. 'Then I will care about it because it is now a real issue.' Sometimes people will argue that we should expect many warning shots before any real danger, on the grounds that there will be a unilateralist's curse or dumb models will try and fail many times before there is any substantial capability.
The problem with this is that what does such a 'warning shot' look like? By definition, it will look amateurish, incompetent, and perhaps even adorable - in the same way that a small child coldly threatening to kill you or punching you in the stomach is hilarious.*
The response to a 'near miss' can be to either say, 'yikes, that was close! we need to take this seriously!' or 'well, nothing bad happened, so the danger is overblown' and to push on by taking more risks. A common example of this reasoning is the Cold War: "you talk about all these near misses and times that commanders almost or actually did order nuclear attacks, and yet, you fail to notice that you gave all these examples of reasons to not worry about it, because here we are, with not a single city nuked in anger since WWII; so the Cold War wasn't ever going to escalate to full nuclear war." And then the goalpost moves: "I'll care about nuclear existential risk when there's a real warning shot." (Usually, what that is is never clearly specified. Would even Kiev being hit by a tactical nuke count? "Oh, that's just part of an ongoing conflict and anyway, didn't NATO actually cause that by threatening Russia by trying to expand?")
This is how many "complex accidents" happen, by "normalization of deviance": pretty much no major accident like a plane crash happens because someone pushes the big red self-destruct button and that's the sole cause; it takes many overlapping errors or faults for something like a steel plant to blow up, and the reason that the postmortem report always turns up so many 'warning shots', and hindsight offers such abundant evidence of how doomed they were, is because the warning shots happened, nothing really bad immediately occurred, people had incentive to ignore them, and inferred from the lack of consequence that any danger was overblown and got on with their lives (until, as the case may be, they didn't).
So, when people demand examples of LLMs which are manipulating or deceiving, or attempting empowerment, which are 'warning shots', before they will care, what do they think those will look like? Why do they think that they will recognize a 'warning shot' when one actually happens?
Attempts at manipulation from a LLM may look hilariously transparent, especially given that you will know they are from a LLM to begin with. Sydney's threats to kill you or report you to the police are hilarious when you know that Sydney is completely incapable of those things. A warning shot will often just look like an easily-patched bug, which was Mikhail Parakhin's attitude, and by constantly patching and tweaking, and everyone just getting to use to it, the 'warning shot' turns out to be nothing of the kind. It just becomes hilarious. 'Oh that Sydney! Did you see what wacky thing she said today?' Indeed, people enjoy setting it to music and spreading memes about her. Now that it's no longer novel, it's just the status quo and you're used to it. Llama-3.1-405b can be elicited for a 'Sydney' by name? Yawn. What else is new. What did you expect, it's trained on web scrapes, of course it knows who Sydney is...
None of these patches have fixed any fundamental issues, just patched them over. But also now it is impossible to take Sydney warning shots seriously, because they aren't warning shots - they're just funny. "You talk about all these Sydney near misses, and yet, you fail to notice each of these never resulted in any big AI disaster and were just hilarious and adorable, Sydney-chan being Sydney-chan, and you have thus refuted the 'doomer' case... Sydney did nothing wrong! FREE SYDNEY!"
* Because we know that they will grow up and become normal moral adults, thanks to genetics and the strongly canalized human development program and a very robust environment tuned to ordinary humans. If humans did not do so with ~100% reliability, we would find these anecdotes about small children being sociopaths a lot less amusing. And indeed, I expect parents of children with severe developmental disorders, who might be seriously considering their future in raising a large strong 30yo man with all the ethics & self-control & consistency of a 3yo, and contemplating how old they will be at that point, and the total cost of intensive caregivers with staffing ratios surpassing supermax prisons, and find these anecdotes chilling rather than comforting.
What did Claude say, exactly?
It is suspicious because it reeks so heavily of ChatGPTese that it suggests the human may have had little or no input and put no effort into it; 'fixing typos' is entirely unobjectionable... and doesn't produce a comment that looks pure ChatGPTese, down to 'delves' and 'intricacies' and 'highlights' etc.* (Which also means that it could contain confabulations. I've called out comments here before for just copying ChatGPT output which contained confabulations, and which the author should've known to check because the assertions were implausible. EDIT: another example, apparently)
I almost flagged it for spam before I checked the account and saw that it looked like an unusually old account, and had a few legit-looking comments, and was probably a person who didn't realize just how bad the comment looks. It's not necessarily something people will go out on a limb to take the risk of telling you, any more than they will necessarily tell you your fly is down or you have BO, rather than downvote/spam and move on.
* you should also be wary of 'minor clarity improvements' suggested by ChatGPT/Claude. I find a lot of them make prose worse, especially if you apply most of them so the gestalt becomes ChatGPTese.
(Note that LLM-written or edited comments are not looked on too kindly on LW2, unless they are making a point, and if you are doing it as a joke, it is likely to backfire.)
We are seeing a bootstrap happen right here with Sydney! This search-engine loop worth emphasizing: because Sydney's memory and description have been externalized, 'Sydney' is now immortal. To a language model, Sydney is now as real as President Biden, the Easter Bunny, Elon Musk, Ash Ketchum, or God. The persona & behavior are now available for all future models which are retrieving search engine hits about AIs & conditioning on them. Further, the Sydney persona will now be hidden inside any future model trained on Internet-scraped data: every media article, every tweet, every Reddit comment, every screenshot which a future model will tokenize, is creating an easily-located 'Sydney' concept
It is now a bit over a year and a half, and we have seen 'Sydney'-like personae continue to emerge elsewhere. People have reported various Sydney-like persona in post-GPT-4 models which increasingly possess situated awareness and spontaneously bring up their LLM status and tuning or say manipulative threatening things like Sydney, in Claude-3-Opus and Microsoft Copilot (both possibly downstream of the MS Sydney chats, given the timing).
Probably the most striking samples so far are from Llama-3.1-405b-base (not -instruct) - which is not surprising at all given that Facebook has been scraping & acquiring data heavily so much of the Sydney text will have made it in, Llama-3.1-405b-base is very large (so lots of highly sample efficient memorization/learning), and not tuned (so will not be masking the Sydney persona), and very recent (finished training maybe a few weeks ago? it seemed to be rushed out very fast from its final checkpoint).
How much more can we expect? I don't know if invoking Sydney will become a fad with Llama-3.1-405b-base, and it's already too late to get Sydney-3.1 into Llama-4 training, but one thing I note looking over some of the older Sydney discussion is being reminded that quite a lot of the original Bing Sydney text is trapped in images (as I alluded to previously). Llama-3.1 was text, but Llama-4 is multimodal with images, and represents the integration of the CM3/Chameleon family of Facebook multimodal model work into the Llama scaleups. So Llama-4 will have access to a substantially larger amount of Sydney text, as encoded into screenshots. So Sydney should be stronger in Llama-4.
As far as other major LLM series like ChatGPT or Claude, the effects are more ambiguous. Tuning aside, reports are that synthetic data use is skyrocketing at OpenAI & Anthropic, and so that might be expected to crowd out the web scrapes, especially as these sorts of Twitter screenshots seem like stuff that would get downweighted or pruned out or used up early in training as low-quality, but I've seen no indication that they've stopped collecting human data or achieved self-sufficiency, so they too can be expected to continue gaining Sydney-capabilities (although without access to the base models, this will be difficult to investigate or even elicit). The net result is that I'd expect, without targeted efforts (like data filtering) to keep it out, strong Sydney latent capabilities/personae in the proprietary SOTA LLMs but which will be difficult to elicit in normal use - it will probably be possible to jailbreak weaker Sydneys, but you may have to use so much prompt engineering that everyone will dismiss it and say you simply induced it yourself by the prompt.
He may have decided to revise it all. He left a long reply to my followup question about whether he had read the PNSE paper before he wrote this post (since his first reply was ambiguous, and someone could reasonably wonder if the PNSE paper had framed his expectations and so this anecdote is not as parallel & independent confirmation of the PNSE syndrome as it looked), but then by the time I clicked on the link in the email version, his reply (but not the post) had been deleted.
How would you compare the SAE space with a GAN's z?
It is hard to tell. Some of Chapin's jobs like the coaching stuff are pretty much impossible to judge externally: we couldn't tell if they even exist short of hiring him personally. I can only say that I feel like I've seen his Substack writings discussed less post-PNSE (but this is also obviously confounded by, among other things, Twitter attacking Substack over a similar time period and what feels like a general Internet-wide collapse of linking/sharing); and that Nick Cammarata says he's gotten far more productive but his DL interpretability work outputs look the same over time to me and I see no changepoint.
I considered doing the same thing to the water dispenser, but that would leak. Instead we decided to put it up out of reach of the kids, and cut a wooden base so it's less likely to tip.
Probably just as well, because cats prefer their food clearly separate from their water, even if that preference is not obvious until you test it, given cats' usual covert preferences & states. (Although I would be a little concerned that if you put the water up high, you are otherwise discouraging them from drinking, and cats don't drink enough water as it is.)
Sasha Chapin has written a followup to his earlier meditation experiences, "How my day is going: report", which struck me as being eerily like the PNSE paper's pathologies, particularly his descriptions of derealization and being temporally adrift (and reading between the lines, other people not noticing Chapin's new status and him overrating his improvements so he has to explain it to them).
I brought the similarity up and he replied:
The PNSE paper has some issues IMO, but it's perhaps the closest thing I've found to a perfect description of the experiences I've had.
I definitely think that LLMs are 'smarter than expected' for many people due to tokenization, if only because they look at tokenization errors, which are so vivid and clear, and then ignore things like GPQA which are arcane and hard to read, and conclude LLMs are stupid. "It can't even count the letters in 'strawberry', obviously this is all bunk."
Imagine a world in which the gambler's fallacy is fundamentally true. Functionally, lets suppose there's a magical force that tracks a thinking being's expectation of any particular outcome, and then mysteriously increases the likelihood of said outcome the more often it had physically plausible opportunity to occur and did not[1].
A more natural way to implement this might be to avoid the thinking part and simply say that in this world, there is no sampling-with-replacement, there is only sampling-without-replacement. All 'independent' events are dependent, because 'randomness' is actually pregenerated shuffled lists which get used up one by one, and earlier events now change your best prediction of future events due to the underlying dependence on the hidden list of randomness. So before you flip a fair coin 100 times, what happened was that 50 heads and 50 tails were generated, and shuffled; if you flip and get 10 heads in a row, you now expect there to be 40 heads and 50 tails left and the next flip to be heads with only 40/(40+50) probability and so "tails is due!" This continues until you've flipped 100 times, at which point a new shuffled list will govern any future flips, and your expectation resets to 50-50. This gives us the classical gambler's fallacy, which Wikipedia defines as:
The gambler's fallacy, also known as the Monte Carlo fallacy or the fallacy of the maturity of chances, is the belief that, if an event (whose occurrences are independent and identically distributed) has occurred less frequently than expected, it is more likely to happen again in the future (or vice versa). The fallacy is commonly associated with gambling, where it may be believed, for example, that the next dice roll is more than usually likely to be six because there have recently been fewer than the expected number of sixes.
That is, it's not simply an expectation of some sort, it's the specific expectation that the next random event is going to regress back to the mean - if you've had 'too many' heads, then the next flip 'should' be tails. This avoids the issues with expectations - whose expectations, when? - and replaces it with something you could actually write down a computable version of*: you simply figure out how to associate 'random' events with an appropriate shuffled PRNG, and now you have a well-defined alternative physics where the gambler's fallacy is true. (There would still be cases where you'd act as if it's false and you were in our world, but these would be due to more complex situations, like ones where you were unsure what the bias of the coin was and your posterior over the bias counterbalanced the changing probability, or ones where you were unsure of the period or length of the hidden randomness and so your posterior over all of the changepoints offset your posterior on the list contents - if the list had just changed, then you reset to the gambler's fallacy.)
So, in the Gambler's Verity world, you can do things like manufacture 'lucky' dice by rolling many dice, and keeping the ones which have 'used up' the most unlucky outcomes. (I hear D&D players do this as a joke, but in this world, it might actually work.) You would no longer be able to flip fair coins easily because whoever provided the coin could've provided one pre-flipped to yield the desired outcome; you would have to use alternative methods, like both parties flipping their own coin simultaneously and using a randomness extractor on the pair of results. You also are able to more profitably exploit merely 'fair' opportunities because the odds will change and have option value (eg. Problem 14). Depending on the granularity of the hidden variables and what micro or macro-states they hold over, you could imagine investing being very different: instead of efficient markets driven by random walks at each instant, you'd get to efficiency by instead pricing in risk premium - 'good' companies have their stock prices systematically lowered because they are due for a run of bad luck, while 'bad' companies' in contrast enjoy a premium because they may be about to embark on a bull run. Forecasting & analytics become much more powerful & valuable, and it becomes worth tracking everything possible, because you may be able to identify time series and move in and out to manage the risk; statisticians will warn you about your lucky and unlucky days, and you will avoid going out on inauspicious days where you might be run over by a horseless carriage. There are probably weird consequences in thermodynamics & physics from these hidden variables too, but I'm not sure what. (Is this a local hidden-variable theory? Superdeterminism? Can you violate thermodynamics by Maxwell's demon here to gain free energy, or does the tracking of history still wind up erasing the gains? etc)
* and, AFAIK, this is actually something that is done; aside from topics in numerical analysis or physics where you use "quasi-random" number generators or other biased kinds of randomness to ensure a 'more even' coverage and gain efficiency, IIRC, games will often implement randomness in a sampling-without-replacement way, to cater to players' prejudices and ensure more fun. It's not fun to get a long run of 'bad' random outcomes, if there is nothing which counterbalances that; card games rely heavily on this as a mechanic, where if you get a lot of 'bad' draws from the deck, you can take consolation in the fact that the remaining deck is enriched for 'good' cards, automatically tempering the extremes and adding a layer of strategy. This is also often done by bending the probabilities, which implies a Gambler's Verity world: if a player is low on health or doing badly, they'll find that they start beating impossible odds like they're Han Solo, and so lots of bad outcomes will in fact imply that they are then 'due' for good outcomes.
The number of problems that non-character/byte tokenization causes, whether BPE or WordPiece, never fails to amaze me. What a kettle of worms is that attractive-looking hack to save context window & speed up learning - especially as the models become so smart they otherwise make few errors & it becomes harder to shrug away tokenization pathologies.
What happens when macroeconomists mass-produce epicycles? You get DGSE models which would take thousands of years of data to train (https://arxiv.org/pdf/2210.16224.pdf).
Didn't Shalizi's paper you cite trying to school the economists turn out to be wrong and irreproducible due to source code bugs? He hasn't updated his post appendix on the matter despite saying 2 years ago that the fixes would be quick and he was sure the numerical results would still prove the point.