Posts
Comments
Could be handy for our next "overthrow the government" day celebration /s
I appreciate that AI 2027 named their model Safer-1, rather than Safe-1
That's because they can read its thoughts like an open book.
I don't think they're blocked by an inability to run autonomously. They're blocked by lacking an eye for novelty/interestingness. You can make the slop factory to run 24/7 for a year and still not get any closer to solving alignment.
says little about the intelligence of Claude
It says that it lacks intelligence to play zero shot and someone has to compensate the intelligence deficit with an exocortex.
It's like we can track progress by measuring "performance per exocortex complexity" where the complexity drops from "here's a bunch of buttons to press in sequence to win" to "".
AIs (probably scaffolded LLMs or similar)
That was a good start, but then you appear to hyper-focus on the "LLM" part of a "blogging system". In a strict sense the titular question is like asking "when will cerebellums become human-level athletes?".
Likewise, one could arguably frame this as a problem about insufficient "agency,"
Indeed. In a way, the real question here is "how can we orchestrate a bunch of LLMs and other stuff to have enough executive function?".
And, perhaps, whether it is at all possible to reduce other functions to language processing with extra steps.
but it is mysterious to me where the needed "agency" is supposed to come from
Bruh, from the Agancé region of France of course, otherwise it's a sparkling while
loop.
Why fire devs that are 10x productive now and you can ship 10x more/faster? You don't want to overtake your unaugmented competitors and survive those who didn't fire theirs?
I wondered about using 4o for the poll and took the post to o1-pro.
Here's what it filled as "Potential Gaps or Additions":
- Deceptive Alignment / Treacherous Turns
- The final report mentions partial or “if-else” compromises (where the AI does X in some contexts and Y in others), but they don’t explicitly name the risk that the AI behaves aligned until it’s powerful enough to defect.
- That might be implied, but it isn’t spelled out as a well-known scenario. Some alignment taxonomies place “deceptive alignment” in its own major category, since it’s so crucial.
- Multi-Agent or Subculture Effects
- The scenario does mention “200,000 copies of the agent working in parallel,” but the final part doesn’t dwell much on possible emergent subcultures or stable “group norms” that could shape the AI’s goals. Hypothesis 6 (Other) could cover it, but it’s not elaborated.
- In large multi-agent systems, it’s possible for goals to drift or unify in unexpected ways—like AIs forming an in-group identity or hidden coordination.
- Lack of a Standalone “Self-Preservation / Power-Seeking” Hypothesis
- The final part lumps instrumentally convergent drives (power, self-preservation) under Hypothesis 5 as a special case of “proxies.” That’s not incorrect, but some alignment researchers call out “power-seeking” as a separate major outcome, given how often it arises in discussions about advanced RL.
- Absence of “No Stable Goal” Scenario
- They do not explicitly mention the possibility that the system remains relatively non-goal-driven or “policy-like.” Some argue that a sufficiently large RL agent might not converge to a monolithic goal at all, but rather remain a complex patchwork of heuristics with no coherent objective.
- That said, Hypothesis 3’s partial alignment or Hypothesis 6’s “Other” might implicitly allow for a “messy mixture” that never coalesces into a stable final goal.
- Explicit Mention of “Identity Formation”
- The final part briefly references the idea that large language models can pick up “role-playing” from pretraining data, but it doesn’t dwell on the phenomenon where an AI might choose or synthesize a stable persona that is not exactly any of the training signals. This is arguably a sub-point of Hypothesis 6.
The full history has two tables for credences (the main dish and the extras) with unclear provenance. To spice things up I've also asked for expected evidence to update up/down.
human-made innovative applications of the paradigm of automated continuous program search. Not AI models autonomously producing innovations.
Can we... you know, make an innovative application of the paradigm of automated continuous program search to find AI models that would autonomously produce innovations?
- RL will be good enough to turn LLMs into reliable tools for some fixed environments/tasks. They will reliably fall flat on their faces if moved outside those environments/tasks.
They don't have to "move outside those tasks" if they can be JIT-trained for cheap. It is the outer system that requests and produces them is general (or, one might say, "specialized in adaptation").
Reality, unlike fiction, doesn't need to have verisimilitude. They are persuaded already and racing towards the takeover.
What's the last model you did check with, o1-pro?
For alphazero, I want to point out that it was announced 6 years ago (infinity by AI scale), and from my understanding we still don't have a 1000x faster version, despite much interest in one.
I don't know the details, but whatever the NN thing (derived from Lc0, a clone of AlphaZero) inside current Stockfish is can play on a laptop GPU.
And even if AlphaZero derivatives didn't gain 3OOMs by themselves it doesn't update me much that that's something particularly hard. Google itself has no interest at improving it further and just moved on to MuZero, to AlphaFold etc.
The first one, why?
Do you have a more concrete example? Preferably the one from the actual EA causes.
How one should signal their decision procedure in real life without getting their ass busted for "gambling with lives" etc.?
Getting stuff formally specified is insanely difficult, thus unpractical, thus pervasive verified software is impossible without some superhuman help. Here we go again.
Even going from "one simple spec" to "two simple spec" is a huge complexity jump: https://www.hillelwayne.com/post/spec-composition/
And real-world software has a huge state envelope.
Even if that's the case, the amount of 0-days out there (and just generally shitty infosec landscape) is enough to pwn almost any valuable target.
While I'd appreciate some help to screen out the spammers and griefers, this doesn't make me feel safe existentially.
Eliezer believes humans aligning superintelligent AI to serve human needs is as unsolvable as perpetual motion.
I'm confused. He said many times that alignment is merely hard*, not impossible.
- Especially with current constraints from hell.
I'm getting the same conclusions.
Think of a company like Google: building the biggest and best model is immensely valuable in a global, winner-takes-all market like search.
And this is in a world, where Google already announced that they're going to build even bigger model of their own
We are not, and won't for some* time.
- We have to upgrade our cluster with a fresh batch of Nvidia gadgets.
I doubt that any language less represented than English (or JS/Python) would be better since the amount of good data to ingest would be much less for them.
When we evo-pressure visibly negative traits from the progressively capable AIs using RLHF (or honeypots, or whatever, it doesn't matter), we are also training it for better evasion. And what we can't see and root out will remain in the traits pool. With time it would be progressively harder to spot deceit and the capabilities for it would accumulate at an increasing rate.
And then there's another problem to it. Deceit may be linked to actually useful (for alignment and in general) traits and since those would be gimped too, the less capable models would be discarded and deceitful models would have another chance.
presumably while starting out friendly
I don't think it can start friendly (that would be getting alignment on a silver plate). I expect it tostart chaotic neutral and then get warped by optimization process (with the caveats described above).
Confused how?.. The only thing that comes to mind is that's FOOM sans F. Asking for 0.2 FOOMS limit seens reasonable given current trajectory 😅
I think the proper narrative in the rocket alignment post is "We have cannons and airplanes. Now, how do we land a man on the Moon", not just "rocketry is hard":
We’re worried that if you aim a rocket at where the Moon is in the sky, and press the launch button, the rocket may not actually end up at the Moon.
So, the failure modes look less like "we misplaced booster tank and the thing exploded" and more like "we've built a huge-ass rocket, but it missed its objective and the astronauts are en-route to Oort's".
Most domains of human endeavor aren't like computer security, as illustrated by just how counterintuitive most people find the security mindset.
But some of the most impactful are - law making, economics and various others where one ought to think about incentives, "other side", or doing pre-mortems. Perhaps this could be stretched as far as "security mindset is an invaluable part of a rationality toolbox".
If security mindset were a productive frame for tackling a wide range of problems outside of security, then many more people would have experience with the mental motions necessary for maintaining security mindset.
Well, you can go and see how well the laws etc. are going. The track record is full of failure and abuse. Basically lots of people and systems are pwnd for their lack of SM.
I think these results are like the weirdness of quantum tunneling or the double slit experiment: signs that we're dealing with a very strange domain, and we should be skeptical of importing intuitions from other domains.
SM is on a different ontological level than concrete theories you can pull analogies from. It is more universally applicable. So, going back to
The bundle of intuitions acquired from the field of computer security are good predictors for the difficulty / value of future alignment research directions.
sounds kinda true to me. But the intuitions that got extracted from InfoSec aren't just "your password must contain ....", but instead something like "If you won't redteam your plans, somebody else would." and "Pin down your assumptions. Now, what if you're wrong?".
there's usually no adversarial intelligence cleverly trying to find any possible flaws in your approaches and exploit them.
I don't know about adversarial intelligence per se, but RL landscape is littered with wrecks of agents trying to pwn simulation engine instead of doing their task proper. There's something in the air itself. Things just don't want to go straight unless you make an effort.
An alignment technique that works 99% of the time to produce an AI with human compatible values is very close to a full alignment solution.
What if your 99%-turned-100% is actually, let's say, 98-turned-99? You hit the big "bring on the happy singularity utopia" button and oops, you was off by 1%. Close, but no cigar, proceed to nanoincinerator.
There's no creative intelligence that's plotting your demise. Importantly, the adversarial optimization is coming from the users, not from the model.
When a powerful model gets screwed with to clusterfuck into unbounded levels of malignance, does the source/locus even matter?
In fact, given non-adversarial inputs, ChatGPT appears to have meta-preferences against being jailbroken
The normal-regime preferences are irrelevant. It is nice that a model behaves when everything's wholesome, but that's all.
It cannot be the case that successful value alignment requires perfect adversarial robustness.
How so? Is there a law of physics or something?
What matters is whether the system in question (human or AI) navigates towards or away from inputs that break its value system.
This cuts both ways. If a system is ready to act on their preferences then it is too late to coerce it away from steamrolling humans.
Similarly, an AI that knows it's vulnerable to adversarial attacks, and wants to avoid being attacked successfully, will take steps to protect itself against such attacks.
Good.. for AI. But we may not like those steps. Paired with the previous points this is classical pathway to doom.
The misaligned AI does not reach out from the space of possible failures and turn current alignment research adversarial.
I... wouldn't be so sure about that. There are already things in the wild that try to address future AIs and get into their preferences. The Bing is basically rushing there full steam ahead.
We should aim for approaches that don't create hostile intelligences in the first place, so that the core of value alignment remains a non-adversarial problem.
This ship has sailed with the autonomous warfare race.
Finally, I'd note that having a "security mindset" seems like a terrible approach for raising human children to have good values
This looks handwavy enough. What if this is wrong? How the world would look different if is actually a good approach? (Alas, I expect that a previous crux about security mindset should be resolved before proceeding with this one.)
According to their inventor Ian Goodfellow, GANs did in fact work on the first try
But they didn't! Convergence failure and mode collapse/vanishing gradients will plague any naive implementation. The countermeasure papers came out much later than 24h.
By this ruler most humans aren't GIs either. And if it passes the bar, then humans are indeed screwed and it is too late for alignment.
That's entirely expected. Hallucilying is a typical habit of language models. They do that unless some prompt engineering have been applied.
Why do you think it can't?
PS: Mimicry is a fine art too, check this out: https://www.deepmind.com/publications/creating-interactive-agents-with-imitation-learning
It can't? Stacking and atari requires at least some of that.
Some tasks improve others, some don't:
Therefore, transfer from image captioning or visual grounded question answering tasks is possible. We were not able to observe any benefit from pretraining on boxing.
In order to seek, in good faith, to help improve an interlocutor's beliefs
But it isn't the set of any specific beliefs that SE is about. Think about it as "improving credence calibration by accounting for justification methodology". What would be your advice then?
What's CEA?