Posts
Comments
If verification is placed sufficiently all over the place physically, it probably can't be circumvented
Thanks! Could you say more about your confidence in this?
the chip needs some sort of persistent internal clocks or counters that can't be reset
Yes, specifically I don't want an attacker to reliably be able to reset it to whatever value it had when it sent the last challenge.
If the attacker can only reset this memory to 0 (for example, by unplugging it) - then the chip can notice that's suspicious.
Another option is a reliable wall clock (though this seems less promising).
I think James Petrie told me about a reliable clock (in the sense of the clock signal used by chips, not a wall clock), I'll ask
I agree that we won't need full video streaming, it could be compressed (most of the screen doesn't change most of the time), but I gave that as an upper bound.
If you still run local computation, you lose out on some of the advantages I mentioned.
(If remote vscode is enough for someone, I definitely won't be pushing back)
Some hands-on experience with software development without an internet connection, from @niplav , which seems somewhat relevant :
Off switch / flexheg / anti tampering:
Putting the "verifier" on the same chip as the GPU seems like an approach worth exploring as an alternative to anti-tampering (which seems hard)
I heard[1] that changing the logic running on a chip (such as subverting an off-switch mechanism) without breaking the chip seems potentially hard[2] even for a nation state.
If this is correct (or can be made correct?) then this seems much more promising than having a separate verifier-chip and gpu-chip with anti tampering preventing them from being separated (which seems like the current plan (cc @davidad ) , and seems hard).
James Petrie shared hacks that seems relevant, such as:
Courdesses built a custom laser fault injection system to avoid anti-glitch detection. A brief pulse of laser light to the back of the die, revealed by grinding away some of the package surface, introduced a brief glitch, causing the digital logic in the chip to misbehave and open the door to this attack.
It intuitively[3] seems to me like Nvidia could implement a much more secure version of this than Raspberry Pi (serious enough to seriously bother a nation state) (maybe they already did?), but I'm mainly sharing this as a direction that seems promising, and I'm interested in expert opinions.
For profit AI Security startup idea:
TL;DR: A laptop that is just a remote desktop ( + why this didn't work before and how to fix that)
Why this is nice for AI Security:
- Reduces the amount of GPUs that will be sent to customers
- Somewhat better security for this laptop since it's a bit more like defending a cloud computer. Maybe labs would use this to make the employee computers more secure?
Network spikes: A reason this didn't work before and how to solve it
The problem: Sometimes the network will be slow for a few seconds. It's really annoying if this lag also affects things like they keyboard and mouse, which is one reason it's less fun to work with a remote desktop
Proposed solution: Get multiple network connections to work at the same time:
Specifically, have many network cards (both WIFI and SIM) in the laptop, and use MPTCP to utilize them all.
Also, the bandwidth needed (to the physical laptop) is capped at something like "streaming a video of your screen" (Claude estimates this as 10-20 Mbps if I was gaming), which would probably be easy to get reliable given multiple network connections. Even if the the user is downloading a huge file, the file is actually being downloaded into the remote computer in the data center, enjoying way faster internet connection than a laptop would have.
Why this is nice for customers
- Users of the computer can "feel like" they have multiple GPUs connected to their laptop, maybe autoscaling, and with a way better UI than ssh
- The laptop can be light, cheap, and have an amazing battery, because many of the components (strong CPU/GPU/RAM) aren't needed and can be replaced with network cards (or more batteries).
Both seem to me like killer features.
MVP implementation
Make:
- A dongle that has multiple SIM cards
- A "remote desktop" provider and client that support MPTCP (and where the server offers strong GPUs).
This doesn't have the advantages of a minimalist computer (plus some other things), but I could imagine this would be such a good customer experience that people would start adopting it.
I didn't check if these components already exist.
Thanks to an anonymous friend for most of this pitch.
For-profit startup idea: Better KYC for selling GPUs
I heard[1] that right now, if a company wants to sell/resell GPUs, they don't have a good way to verify that selling to some customer wouldn't violate export controls, and that this customer will (recursively) also keep the same agreement.
There are already tools for KYC in the financial industry. They seem to accomplish their intended policy goal pretty well (economic sanctions by the U.S aren't easy for nation states to bypass), and are profitable enough that many companies exist that give KYC services (Google "corporate kyc tool").
- ^
From an anonymous friend, I didn't verify this myself
Anti Tempering in a data center you control provides very different tradeoffs
I'll paint a picture for how this could naively look:
We put the GPUs in something equivalent to a military base. Someone can still break in, steal the GPU, and break the anti tempering, but (I'm assuming) using those GPUs usefully would take months, and meanwhile (for example), a war could start.
How do the tradeoffs change? What creative things could we do with our new assumptions?
- Tradeoffs we don't really care about anymore:
- We don't need the anti tampering to reliably work (it's nice if it works, but it now becomes "defense in depth")
- Slowing down the attacker is already very nice
- Our box can be maintainable
- We don't have to find all bugs in advance
- ...
- "Noticing the breach" becomes an important assumption
- Does our data center have cameras? What if they are hacked? And so on
- (An intuition I hope to share: This problem is much easier than "preventing the breach")
- It doesn't have to be in "our" data center. It could be in a "shared" data center that many parties monitor.
- Any other creative solution to notice breaches might work
- How about spot inspections to check if the box was tampered with?
- "preventing a nation state from opening a big and closing it without any visible change, given they can do whatever they want with the box, and given the design is open source" seems maybe very hard, or maybe a solved problem.
- Does our data center have cameras? What if they are hacked? And so on
- "If a breach is noticed then something serious happens" becomes an important assumption
- Are the stakeholders on board?
Things that make me happy here:
- Less hard assumptions to make
- Less difficult tradeoffs to balance
- The entire project requires less world class cutting edge engineering.
Do you think this hints at "doing engineering in an air gapped network can be made somewhat reasonable"?
(I'm asking in the context of securing AI labs' development environments. Random twist, I know)
Oh yes the toll unit needs to be inside the GPU chip imo.
why do I let Nvidia send me new restrictive software updates?
Alternatively the key could be in the central authority that is supposed to control the off switch. (same tech tho)
Why don't I run my GPUs in an underground bunker, using the old most broken firmware?
Nvidia (or whoever signs authorization for your GPU to run) won't sign it for you if you don't update the software (and send them a proof you did it using similar methods, I can elaborate).
The interesting/challenging technical parts seem to me:
1. Putting the logic that turns off the GPU (what you called "the toll unit") in the same chip as the GPU and not in a separate chip
2. Bonus: Instead of writing the entire logic (challenge response and so on) in advance, I think it would be better to run actual code, but only if it's signed (for example, by Nvidia), in which case they can send software updates with new creative limitations, and we don't need to consider all our ideas (limit bandwidth? limit gps location?) in advance.
Things that seem obviously solvable (not like the hard part) :
3. The cryptography
4. Turning off a GPU somehow (I assume there's no need to spread many toll units, but I'm far from a GPU expert so I'd defer to you if you are)
I love the direction you're going with this business idea (and with giving Nvidia a business incentive to make "authentication" that is actually hard to subvert)!
I can imagine reasons they might not like this idea, but who knows. If I can easily suggest this to someone from Nvidia (instead of speculating myself), I'll try
I'll respond to the technical part in a separate comment because I might want to link to it >>
More on starting early:
Imagine a lab starts working in an air gapped network, and one of the 1000 problems that comes up is working-from-home.
If that problem comes up now (early), then we can say "okay, working from home is allowed", and we'll add that problem to the queue of things that we'll prioritize and solve. We can also experiment with it: Maybe we can open another secure office closer to the employee's house, would they like that? If so, we could discuss fancy ways to secure the communication between the offices. If not, we can try something else.
If that problem comes up when security is critical (if we wait), then the solution will be "no more working from home, period". The security staff will be too overloaded with other problems to solve, not available to experiment with having another office nor to sign a deal with Cursor.
Yeah it will compromise productivity.
I hope we can make the compromise not too painful. Especially if we start early and address all the problems that will come up before we're in the critical period where we can't afford to mess up anymore.
I also think it's worth it
I don't think this is too nuanced for a lab that understands the importance of security here and wants a good plan (?)
Ah, interesting
Still, even if some parts of the architecture are public, it seems good to keep many details private, details that took the lab months/years to figure out? Seems like a nice moat
Some hard problems with anti tampering and their relevance for GPU off-switches
Background on GPU off switches:
It would be nice if a GPU could be limited[1] or shut down remotely by some central authority such as the U.S government[2] in case there’s some emergency[3].
This shortform is mostly replying to ideas like "we'll have a CPU in the H100[4] which will expect a signed authentication and refuse to run otherwise. And if someone will try to remove the CPU, the H100's anti-tampering mechanism will self-destruct (melt? explode?)".
TL;DR: Getting the self destruction mechanism right isn't really the hard part.
Some hard parts:
- Noticing if the self destruction mechanism should be used
- A (fun?) exercise could be suggesting ways to do this (like "let's put a wire through the box and if it's cut then we'll know someone broke in") and then thinking how you'd subvert those mechanisms. I mainly think doing this 3 times (if you haven't before) could be a fast way to get some of the intuition I'm trying to gesture at
- Maintenance
- If the H100s can never be opened after they're made, then any problem that would have required us to open them up might mean throwing them away. And we want them to be expensive, right?
- If the H100s CAN be opened, then what's the allowed workflow?
- For example, If they require a key - then what if there's a problem in the mechanism that verifies the key, or in something it depends on, like the power supply? We can't open the box to fix the mechanism that would allow us to open the box
- ("breaking cryptography" and "breaking in to the certificate authority" are out of scope for what I'm trying to gesture at)
- False positives / false negatives
- If the air conditioning in the data center breaks, will the H100s get hot[5], infer they're under attack and self destruct? Will they EXPLODE? Do we want them to only self destruct if something pretty extreme is happening, meaning attackers will have more room to try things?
- Is the design secret?
- If someone gets the full design of our anti-tampering mechanism, will they easily get around it?
- If so, are these designs being kept in a state-proof computer? Are they being sent to a private company to build them?
- Complexity
- Whatever we implement here better not have any important bugs. How much bug-free software and hardware do we think we can make? How are we going to test it to be confident it works? Are we okay with building 10,000 when we only think our tests would catch 90% of the important bugs?
- Tradeoffs here can be exploited in interesting ways
- For example, if air conditioning problems will make our H100s self destruct, an attacker might focus on our air conditioning on purpose. After 5 times of having to replace all of our data center - I can imagine an executive saying bye bye to our anti tampering ideas.
- I'm trying to gesture at something like "making our anti tampering more and more strict probably isn't a good move, even from a security perspective, unless we have good idea how to deal with problems like this"
- For example, if air conditioning problems will make our H100s self destruct, an attacker might focus on our air conditioning on purpose. After 5 times of having to replace all of our data center - I can imagine an executive saying bye bye to our anti tampering ideas.
In summary, if we build anti-tampering mechanisms in GPUs that an adversaries have easy access to (especially nation states) then I don't expect this will be a significant problem for them to overcome.
(Maybe later - Ideas for off-switches that seem more promising to me)
Edit:
- See "Anti Tempering in a data center you control provides very different tradeoffs"
- I think it's worth exploring putting the authrization component on the same chip as the GPU
- ^
For example bandwidth, gps location, or something else. the exact limitations are out of scope, I mainly want to discuss being able to make limitations at all
- ^
The question of "who should be the authority" is out of scope, I mainly want to enable having some authority at all. If you're interested in this, consider checking out Vitalik's suggestion too, search for "Strategy 2"
- ^
Or perhaps this can be used for export controls or some other purpose
- ^
I'm using "H100" to refer to the box that contains both the GPU chip and the CPU chip
- ^
I'm using "getting hot means being under attack" as a toy example for a "paranoid" condition that someone might suggest building as one of the triggers for our anti-tampering mechanism. Other examples might include "the box shakes", "the power turns off for too long", and so on.
"Protecting model weights" is aiming too low, I'd like labs to protect their intellectual property too. Against state actors. This probably means doing engineering work inside an air gapped network, yes.
I feel it's outside the Overton Window to even suggest this and I'm not sure what to do about that except write a lesswrong shortform I guess.
Anyway, common pushbacks:
- "Employees move between companies and we can't prevent them sharing what they know": In the IDF we had secrets in our air gapped network which people didn't share because they understood it's important. I think lab employees could also understand it's important. I'm not saying this works perfectly, but it works well enough for nation states to do when they're taking security seriously.
- "Working in an air gapped network is annoying": Yeah 100%, but it's doable, and there are many things to do to make it more comfortable. I worked for about 6 years as a developer in an air gapped network.
Also, a note of hope: I think It's not crazy for labs to aim for a development environment that is world leading in the tradeoff between convenience and security. I don't know what the U.S has to offer in terms of a ready made air gapped development environment, but I can imagine, for example, Anthropic being able to build something better if they take this project seriously, or at least build some parts really well before the U.S government comes to fill in the missing parts. Anyway, that's what I'd aim for
Are you interested in having a prediction market about this that falls back on your judgement if the situation is unclear?
Something like "If it's publicly known that an AI lab 'caught the AI red handed' (in the spirit of Redwood's Control agenda), will the lab temporarily shut down as Redwood suggested (as opposed to applying a small patch and keep going)?"
Ryan and Buck wrote:
> The control approach we're imagining won't work for arbitrarily powerful AIs
Okay, so if AI Control works well, how do we plan to use our controlled AI to reach a safe/aligned ASI?
Different people have different opinions. I think it would be good to have a public plan so that people can notice if they disagree and comment if they see problems.
Opinions I’ve heard so far:
- Solve ELK / mechanistic anomaly detection / something else that ARC suggested
- Let the AI come up with alignment plans such as ELK that would be sufficient for aligning an ASI
- Use examples of “we caught the AI doing something bad” to convince governments to regulate and give us more time before scaling up
- Study the misaligned AI behaviour in order to develop a science of alignment
- We use this AI to produce a ton of value (GDP goes up by ~10%/year), people are happy and don’t push that much to advance capabilities even more, and this can be combined with regulation preventing an arms race and pause our advancement
- We use this AI to invent a new paradigm for AI which isn’t based on deep learning and is easier to align
- We teach the AI to reason about morality (such as consider hypothetical situations) instead of responding with “the first thing that comes to its mind”, which will allow it to generalize human values not just better than RLHF but also better than many humans, and this passes a bar for friendliness
These 7 answers are from 4 people.
I think this should somewhat update people away from "we can prevent model weights from being stolen by limiting the outgoing bandwidth from the data center", if that protection is assuming that model weights are very big and [the dangerous part] can't be made smaller.
I'd also bet that, even if Deep Seek turns out to be somehow "fake" (optimized for benchmarks in some way) (not that this currently seems like the situation), some other way of making at least the dangerous[1] parts of a model much smaller[2] will be found and known[3] publicly.
- ^
If someone is stealing a model, they probably care about "dangerous" capabilities like ML engineering and the ability to autonomously act in the world, but not about "not dangerous" capabilities like memorizing Harry Potter and all its fan fictions. If you're interested to bet with me, I'd probably let you judge what is and isn't dangerous. Also, as far as I can tell, Deep Seek is much smaller without giving up a lot of knowledge, so the claim I'm making in this bet is even weaker
- ^
At least 10x smaller, but I'd also bet on 100x at some odds
- ^
This sets a lower bar for the secret capabilities a nation state might have if they're trying to steal model weights that are defended this way. So again, I expect the attack we'd actually see against such a plan to be even stronger
If you’re in software engineering, pivot to software architecting.
fwiw, architecting feels to me easier than coding (I like doing both). I have some guesses on why it doesn't feel like this to most people (architecting is imo somewhat taught wrong, somewhat a gated topic, has less feedback in real life), but I don't think this will stand up to AIs for long and I would even work on building an agent that is good at architecture myself if I thought it would have a positive impact.
If o3/o4 aren't "spontaneously" good at architecture, then I expect it's because openAI didn't figure out (or try to figure out) how to train on relevant data, not many people write down their thoughts as they're planning a new architecture. What data will they use, system design interviews? but to be fair, this is a similar pushback to "there's not much good data on how to plan the code of a computer game" but AIs can still somehow output a working computer game line by line with no scratchpad.
I'm not sure I'm imagining the same thing as you, but as a draft solution, how about a robots.txt
?
TL;DR: point 3 is my main one.
1)
What's an example of alignment work that aims to build an aligned system (as opposed to e.g. checking whether a system is aligned)?
[I'm not sure why you're asking, maybe I'm missing something, but I'll answer]
For example, checking if human values are a "natural abstraction", or trying to express human values in a machine readable format, or getting an AI to only think in human concepts, or getting an AI that is trained on a limited subset of things-that-imply-human-preferences to generalize well out of that distribution.
I can make up more if that helps? anyway my point was just to say explicitly what parts I'm commenting on and why (in case I missed something)
2)
it seems like you think RLHF counts as an alignment technique
It's a candidate alignment technique.
RLHF is sometimes presented (by others) as an alignment technique that should give us hope about AIs simply understanding human values and applying them in out of distribution situations (such as with an ASI).
I'm not optimistic about that myself, but rather than arguing against it, I suggest we could empirically check if RLHF generalizes to an out-of-distribution situation, such as minecraft maybe. I think observing the outcome here would effect my opinion (maybe it just would work?), and a main question of mine was whether it would effect other people's opinions too (whether they do or don't believe that RLHF is a good alignment technique).
3)
because you have to somehow communicate to the AI system what you want it to do, and AI systems don't seem good enough yet to be capable of doing this without some Minecraft specific finetuning. (Though maybe you would count that as Minecraft capabilities? Idk, this boundary seems pretty fuzzy to me.)
I would finetune the AI on objective outcomes like "fill this chest with gold" or "kill that creature [the dragon]" or "get 100 villagers in this area". I'd pick these goals as ones that require the AI to be a capable minecraft player (filling a chest with gold is really hard) but don't require the AI to understand human values or ideally anything about humans at all.
So I'd avoid finetuning it on things like "are other players having fun" or "build a house that would be functional for a typical person" or "is this waterfall pretty [subjectively, to a human]".
Does this distinction seem clear? useful?
This would let us test how some specific alignment technique (such as "RLHF that doesn't contain minecraft examples") generalizes to minecraft
If you talk about alignment evals for alignment that isn't naturally incentivized by profit-seeking activities, "stay within bounds" is of course less relevant.
Yes.
Also, I think "make sure Meth [or other] recipes are harder to get from an LLM than from the internet" is not solving a big important problem compared to x-risk, not that I'm against each person working on whatever they want. (I'm curious what you think but no pushback for working on something different from me)
one of the most generalizing and concrete works involves at every step maximizing how many choices the other players have (liberalist prior on CEV) to maximize the optional utility for humans.
This imo counts as a potential alignment technique (or a target for such a technique?) and I suggest we could test how well it works in minecraft. I can imagine it going very well or very poorly. wdyt?
In terms of "understanding the spirit of what we mean," it seems like there's near-zero designs that would work since a Minecraft eval would be blackbox anyways
I don't understand. Naively, seems to me like we could black-box observe whether the AI is doing things like "chop down the tree house" or not (?)
(clearly if you have visibility to the AI's actual goals and can compare them to human goals then you win and there's no need for any minecraft evals or most any other things, if that's what you mean)
Intuitively, this involves two components: the ability to robustly steer high-level structures like objectives, and something good to target at.
I agree.
But if we solve these two problems then I think you could go further and say we don't really need to care about deceptiveness at all. Our AI will just be aligned.
P.S
“Ah”, but straw-you says,
This made me laugh
My own pushback to minecraft alignment evals:
Mainly, minecraft isn't actually out of distribution, LLMs still probably have examples of nice / not-nice minecraft behaviour.
Next obvious thoughts:
- What game would be out of distribution (from an alignment perspective)?
- If minecraft wouldn't exist, would inventing it count as out of distribution?
- It has a similar experience to other "FPS" games (using a mouse + WASD). Would learning those be enough?
- Obviously, minecraft is somewhat out of distribution, to some degree
- Ideally we'd have a way to generate a game that is out of distribution to some degree that we choose
- "Do you want it to be 2x more out of distribution than minecraft? no problem".
- But having a game of random pixels doesn't count. We still want humans to have a ~clear[1] moral intuition about it.
- I'd be super excited to have research like "we trained our model on games up to level 3 out-of-distribution, and we got it to generalize up to level 6, but not 7. more research needed"
- ^
Moral intuitions such as "don't chop down the tree house in an attempt to get wood", which is the toy example for alignment I'm using here.
Related: Building a video game to test alignment (h/t @Crazytieguy )
https://www.lesswrong.com/posts/ALkH4o53ofm862vxc/announcing-encultured-ai-building-a-video-game
Thanks!
In the part you quoted - my main question would be "do you plan on giving the agent examples of good/bad norm following" (such as RLHFing it). If so - I think it would miss the point, because following those norms would become in-distribution, and so we wouldn't learn if our alignment generalizes out of distribution without something-like-RLHF for that distribution. That's the main thing I think worth testing here. (do you agree? I can elaborate on why I think so)
If you hope to check if the agent will be aligned[1] with no minecraft-specific alignment training, then sounds like we're on the same page!
Regarding the rest of the article - it seems to be mainly about making an agent that is capable at minecraft, which seems like a required first step that I ignored meanwhile (not because it's easy).
My only comment there is that I'd try to not give the agent feedback about human values (like "is the waterfall pretty") but only about clearly defined objectives (like "did it kill the dragon"), in order to not accidentally make human values in minecraft be in-distribution for this agent. wdyt?
(I hope I didn't misunderstand something important in the article, feel free to correct me of course)
- ^
Whatever "aligned" means. "other players have fun on this minecraft server" is one example.
:)
I don't think alignment KPIs like "stay within bounds" are relevant to alignment at all even as toy examples: because if so, then we could say for example that playing a packman maze game where you collect points is "capabilities", but adding enemies that you must avoid is "alignment". Do you agree that plitting it up that way wouldn't be interesting to alignment, and that this applies to "stay within bounds" (as potentially also being "part of the game")? Interested to hear where you disagree, if you do
Regarding
Distribute resources fairly when working with other players
I think this pattern matches to a trolly problem or something, where there are clear tradeoffs and (given the AI is even trying), it could probably easily give an answer which is similarly controversial to an answer that a human would give. In other words, this seems in-distribution.
Understanding and optimizing for the utility of other players
This is the one I like - assuming it includes not-well-defined things like "help them have fun, don't hurt things they care about" and not only things like "maximize their gold".
It's clearly not a "in packman, avoid the enemies" thing.
It's a "do the AIs understand the spirit of what we mean" thing.
(does this resonate with you as an important distinction?)
I agree.
This all sounds pretty in-distribution for an LLM, and also like it avoids problems like "maybe thinking in different abstractions" [minecraft isn't amazing at this either, but at least has a bit], "having the AI act/think way faster than a human", "having the AI be clearly superhuman".
a number of ways to achieve the endgame, level up, etc, both more and less morally.
I'm less interested in "will the AI say it kills its friend" (in a situation that very clearly involves killing and a person and perhaps a very clear tradeoff between that and having 100 more gold that can be used for something else), I'm more interested in noticing if it has a clear grasp of what people care about or mean. The example of chopping down the tree house of the player in order to get wood (which the player wanted to use for the tree house) is a nice toy example of that. The AI would never say "I'll go cut down your tree house", but it.. "misunderstood" [not the exact word, but I'm trying to point at something here]
wdyt?
Your guesses on AI R&D are reasonable!
Apparently this has been tested extensively, for example:
https://x.com/METR_Evals/status/1860061711849652378
[disclaimers: I have some association with the org that ran that (I write some code for them) but I don't speak for them, opinions are my own]
Also, Anthropic have a trigger in their RSP which is somewhat similar to what you're describing, I'll quote part of it:
Autonomous AI Research and Development: The ability to either: (1) Fully automate the work of an entry-level remote-only Researcher at Anthropic, as assessed by performance on representative tasks or (2) cause dramatic acceleration in the rate of effective scaling.
Also, in Dario's interview, he spoke about AI being applied to programming.
My point is - lots of people have their eyes on this, it seems not to be solved yet, it takes more than connecting an LLM to bash.
Still, I don't want to accelerate this.
+1
I'm imagining an assistant AI by default (since people are currently pitching that an AGI might be a nice assistant).
If an AI org wants to demonstrate alignment by showing us that having a jerk player is more fun (and that we should install their jerk-AI-app on our smartphone), then I'm open to hear that pitch, but I'd be surprised if they'd make it
I think there are lots of technical difficulties in literally using minecraft (some I wrote here), so +1 to that.
I do think the main crux is "would the minecraft version be useful as an alignment test", and if so - it's worth looking for some other solution that preserves the good properties but avoids some/all of the downsides. (agree?)
Still I'm not sure how I'd do this in a text game. Say more?
More like what I mean might be generalization to new activities for humans to do in minecraft that humans would find fun, which would be a different kind of 'better at minecraft.'
Oh I hope not to go there. I'd count that as cheating. For example, if the agent would design a role playing game with riddles and adventures - that would show something different from what I'm trying to test. [I can try to formalize it better maybe. Or maybe I'm wrong here]
I mean it in a way where the preferences are modeled a little better than just "the literal interpretation of this one sentence conflicts with the literal interpretation of this other sentence."
Absolutely. That's something that I hope we'll have some alignment technique to solve, and maybe this environment could test.
Thanks!
Opinions about putting in a clause like "you may not use this for ML engineering" (assuming it would work legally) (plus putting in naive technical measures to make the tool very bad for ML engineering) ?
:)
If you want to try it meanwhile, check out https://github.com/MineDojo/Voyager
I think a simple bash tool running as admin could do most of these:
it can get any info on a computer into its context whenever it wants, and it can choose to invoke any computer functionality that a human could invoke, and it can store and retrieve knowledge for itself at will
Regarding
and its training includes the use of those functionalities
I think this isn't a crux because the scaffolding I'd build wouldn't train the model. But as a secondary point, I think today's models can already use bash tools reasonably well.
it's not completely clear to me that it wouldn't already be able to do a slow self-improvement takeoff by itself
This requires skill in ML R&D which I think is almost entirely not blocked by what I'd build, but I do think it might be reasonable to have my tool not work for ML R&D because of this concern. (would require it to be closed source and so on)
Thanks for raising concerns, I'm happy for more if you have them
Hey,
Generalization because we expect future AI to be able to take actions and reach outcomes that humans can't
I'm assuming we can do this in Minecraft [see the last paragraph in my original post]. Some ways I imagine doing this:
- Let the AI (python program) control 1000 minecraft players so it can do many things in parallel
- Give the AI a minecraft world-simulator so that it can plan better auto-farms (or defenses or attacks) than any human has done so far
- Imagine Alpha-Fold for minecraft structures. I'm not sure if that metaphor makes sense, but teaching some RL model to predict minecraft structures that have certain properties seems like it would have superhuman results and sometimes be pretty hard for humans to understand.
- I think it's possible to be better than humans currently are at minecraft, I can say more if this sounds wrong
- [edit: adding] I do think minecraft has disadvantages here (like: the players are limited in how fast they move, and the in-game computers are super slow compared to players) and I might want to pick another game because of that, but my main crux about this project is whether using minecraft would be valuable as an alignment experiment, and if so I'd try looking for (or building?) a game that would be even better suited.
preference conflict resolution because I want to see an AI that uses human feedback on how best to do it (rather than just a fixed regularization algorithm)
Do you mean that if the human asks the AI to acquire wood and the AI starts chopping down the human's tree house (or otherwise taking over the world to maximize wood) then you're worried the human won't have a way to ask the AI to do something else? That the AI will combine the new command "not from my tree house!" into a new strange misaligned behaviour?
Hey Esben :) :)
The property I like about minecraft (which most computer games don't have) is that there's a difference between minecraft-capabilities and minecraft-alignment, and the way to be "aligned" in minecraft isn't well defined (at least in the way I'm using the word "aligned" here, which I think is a useful way). Specifically, I want the AI to be "aligned" as in "take human values into account as a human intuitively would, in this out of distribution situation".
In the link you sent, "aligned" IS well defined by "stay within this area". I expect that minecraft scaffolding could make the agent close to perfect at this (by making sure, before performing an action requested by the LLM, that the action isn't "move to a location out of these bounds") (plus handling edge cases like "don't walk on to a river which will carry you out of these bounds", which would be much harder, and I'll allow myself to ignore unless this was actually your point). So we wouldn't learn what I'd hope to learn from these evals.
Similarly for most video games - they might be good capabilities evals, but for example in chess - it's unclear what a "capable but misaligned" AI would be. [unless again I'm missing your point]
P.S
The "stay within this boundary" is a personal favorite of mine, I thought it was the best thing I had to say when I attempted to solve alignment myself just in case it ended up being easy (unfortunately that wasn't the case :P ). Link
Why downvote? you can tell me anonymously:
https://docs.google.com/forms/d/e/1FAIpQLSca6NOTbFMU9BBQBYHecUfjPsxhGbzzlFO5BNNR1AIXZjpvcw/viewform
Do we want minecraft alignment evals?
My main pitch:
There were recently some funny examples of LLMs playing minecraft and, for example,
- The player asks for wood, so the AI chops down the player's tree house because it's made of wood
- The player asks for help keeping safe, so the AI tries surrounding the player with walls
This seems interesting because minecraft doesn't have a clear win condition, so unlike chess, there's a difference between minecraft-capabilities and minecraft-alignment. So we could take an AI, apply some alignment technique (for example, RLHF), let it play minecraft with humans (which is hopefully out of distribution compared to the AI's training), and observe whether the minecraft-world is still fun to play or if it's known that asking the AI for something (like getting gold) makes it sort of take over the world and break everything else.
Or it could teach us something else like "you must define for the AI which exact boundaries to act in, and then it's safe and useful, so if we can do something like that for real-world AGI we'll be fine, but we don't have any other solution that works yet". Or maybe "the AI needs 1000 examples for things it did that we did/didn't like, which would make it friendly in the distribution of those examples, but it's known to do weird things [like chopping down our tree house] without those examples or if the examples are only from the forest but then we go to the desert"
I have more to say about this, but the question that seems most important is "would results from such an experiment potentially change your mind":
- If there's an alignment technique you believe in and it would totally fail to make a minecraft server be fun when playing with an AI, would you significantly update towards "that alignment technique isn't enough"?
- If you don't believe in some alignment technique but it proves to work here, allowing the AI to generalize what humans want out of its training distribution (similarly to how a human that plays minecraft for the first time will know not to chop down your tree house), would that make you believe in that alignment technique way more and be much more optimistic about superhuman AI going well?
Assume the AI is smart enough to be vastly superhuman at minecraft, and that it has too many thoughts for a human to reasonably follow (unless the human is using something like "scalable oversight" successfully. that's one of the alignment techniques we could test if we wanted to)
Opinions on whether it's positive/negative to build tools like Cursor / Codebuff / Replit?
I'm asking because it seems fun to build and like there's low hanging fruit to collect in building a competitor to these tools, but also I prefer not destroying the world.
Considerations I've heard:
- Reducing "scaffolding overhang" is good, specifically to notice if RSPs should trigger a more advanced RSP level
- (This depends on the details/quality of the RSP too)
- There are always reasons to advance capabilities, this isn't even a safety project (unless you count... elicitation?), our bar here should be high
- Such scaffolding won't add capabilities which might make the AI good at general purpose learning or long term general autonomy. It will be specific to programming, with concepts like "which functions did I look at already" and instructions on how to write high quality tests.
- Anthropic are encouraging people to build agent scaffolding, and Codebuff was created by a Manifold cofounder [if you haven't heard about it, see here and here]. I'm mainly confused about this, I'd expect both to not want people to advance capabilities (yeah, Anthropic want to stay in the lead and serve as an example, but this seems different). Maybe I'm just not in sync
Thanks! I'm excited to go over the things I never heard of
So far,
- Elevenlabs app: great, obviously
- Bolt: I didn't like it
- I asked it to create a React Native app that prints my GPS coordinates to the screen (as a POC), it couldn't do it. I also asked for a podcast app (someone must and no one else will..), it did less well than Replit (though Replit used web). Anyway my main use case would be mobile apps (I don't have a reasonable solution for that yet) (btw I hardly have mobile development experience, so this is an extra interesting use case for me).
- It sounds like maybe you're missing templates to start from? I do think Bolt's templates have something cool about them, but I don't think
- Warp: I already use the free version and I like it very much. Great for things like "stop this docker container and also remove the volume"
- Speech to text: I use ChatGPT voice. My use case is "I'm riding my bike and I want to use the time to write a document", so we chat about it back and forth
Q:
5. How do you "Use o1-mini for more complex changes across the codebase"? (what tool knows your code and can query o1 about it?)
5.1. OMG, Is that what Cursor Composer is? I have got to try that
I don't think so (?)
There are physical things that make me have more nightmares, like being too hot, or needing to pee
Sounds like I might be missing something obvious?
I find lucid dreams to be effective "against" nightmares (for 10+ years already).
AMA if you want
Thanks for sharing <3
My main concern about trying SSRIs is that they'll make me stop noticing certain things that I care about, things that currently manifest as anxiety or so.
Opinions?
As AIs become more capable, we may at least want the option of discussing them out of their earshot.
If I'd want to discuss something outside of an AI's earshot, I'd use something like Signal, or something that would keep out a human too.
AIs sometimes have internet access, and robots.txt won't keep them out.
I don't think having this info in their training set is a big difference (but maybe I don't see the problem you're pointing out, so this isn't confident).
Scaling matters, but it's not all that matters.
For example, RLHF