Posts
Comments
Doing stuff manually might provide helpful intuitions/experience for automating it?
I would be very interested to know what the monks think about this.
I think it's much easier to talk about boundaries than preferences because true boundaries don't really contradict between individuals
I'm quite curious about this. What if you're stuck on an island with multiple people and limited food?
Very Wittgensteinian:
“What is your aim in Philosophy?”
“To show the fly the way out of the fly-bottle” (Philosophical Investigations)
Oh, they're definitely valid questions. The problem is that the second question is rather vague. You need to either state what a good answer would look like or why existing answers aren't satisifying.
I downvoted this post. I claim it's for the public good, maybe you find this strange, but let me explain my reasoning.
You've come on Less Wrong, a website that probably has more discussion of this than any other website on the internet. If you want to find arguments, they aren't hard to find. It's a bit like walking into a library and saying that you can't find a book to read.
The trouble isn't that you literally can't find any book/arguments, it's that you've got a bunch of unstated requirements that you want satisfied. Now that's perfectly fine, it's good to have standards. At the same time, you've asked the question in a maximally vague way. I don't expect you to be able to list all your requirements. That's probably impossible and when it is. possible, it's often a lot of work. At the same time, I do believe that it's possible to do better than maximally vague.
The problem with maximally vague questions is that they almost guarantee that any attempt to provide an answer will be unsatisfying both for the person answering and the person receiving the answer. Worse, you've framed the question in such a way that some people will likely feel compelled to attempt to answer anyway, lest people who think that there is such a risk come off as unable to respond to critics.
If that's the case, downvoting seems logical. Why support a game where no-one wins?
Sorry if this comes off as harsh, that's not my intent. I'm simply attempting to prompt reflection.
I have access to Gemini 1.5 Pro. Willing to run experiments if you provide me with an exact experiment to run, plus cover what they charge me (I'm assuming it's paid, I haven't used it yet).
“But also this person doesn't know about internal invariances in NN space or the compressivity of the parameter-function map (the latter in particular is crucial for reasoning about inductive biases), then I become extremely concerned”
Have you written about this anywhere?
Have you tried talking to professors about these ideas?
Is there anyone who understand GFlowNets who can provide a high-level summary of how they work?
Nabgure senzr gung zvtug or hfrshy:
Gurer'f n qvssrerapr orgjrra gur ahzore bs zngurzngvpny shapgvbaf gung vzcyrzrag n frg bs erdhverzragf naq gur ahzore bs cebtenzf gung vzcyrzrag gur frg bs erdhverzragf.
Fvzcyvpvgl vf nobhg gur ynggre, abg gur sbezre.
Gur rkvfgrapr bs n ynetr ahzore bs cebtenzf gung cebqhpr gur rknpg fnzr zngurzngvpny shapgvba pbagevohgrf gbjneqf fvzcyvpvgl.
I wrote up my views on the principle of indifference here:
https://www.lesswrong.com/posts/3PXBK2an9dcRoNoid/on-having-no-clue
I agree that it has certain philosophical issues, but I don’t believe that this is as fatal to counting arguments as you believe.
Towards the end I write:
“The problem is that we are making an assumption, but rather than owning it, we're trying to deny that we're making any assumption at all, ie. "I'm not assuming a priori A and B have equal probability based on my subjective judgement, I'm using the principle of indifference". Roll to disbelieve.
I feel less confident in my post than when I wrote it, but it still feels more credible than the position articulated in this post.
Otherwise: this was an interesting post. Well done on identifying some arguments that I need to digest.
Maybe just say that you're tracking the possibility?
Is there going to be a link to this from somewhere to make it accessible?
I think an important crux here is whether you think that we can build institutions which are reasonably good at checking the quality of AI safety work done by humans
Why is this an important crux? Is it necessarily the case that if we can reliably check AI safety work done by humans that we we reliably check AI safety work done by Ai's which may be optimising against us?
Updated
Second, it is also possible to robustly verify the outputs of a superhuman intelligence without superhuman intelligence.
Why do you believe that a superhuman intelligence wouldn't be able to deceive you by producing outputs that look correct instead of outputs that are correct?
I guess the main doubt I have with this strategy is that even if we shift the vast majority of people/companies towards more interpretable AI, there will still be some actors who pursue black-box AI. Wouldn't we just get screwed by those actors? I don't see how CoEm can be of equivalent power to purely black-box automation.
That said, there may be ways to integrate CoEm's into the Super Alignment strategy.
GPT-J token embeddings inhabit a zone in their 4096-dimensional embedding space formed by the intersection of two hyperspherical shells
You may want to update the TLDR if you agree with the comments that indicate that this might not be accurate.
If there's a 100 tokens for snow, it probably indicates that it's a particularly important concept for that language.
For Linear Tomography and Principle Component Analysis, I'm assuming that by unsupervised you mean that you don't use the labels for finding the vector, but that you do use them for determining which sign is true and which is false. If so, this might be worth clarifying in the table.
Agreed. Good counter-example.
I'm very curious as to whether Zac has a way of reformulating his claim to save it.
Well done for writing this up! Admissions like this are hard often hard to write.
Have you considered trying to use any credibility from helping to cofound vast for public outreach purposes?
Isn’t that just one batch?
Does GPT-4 directly handle the image input or is it converted to text by a separate model then fed into GPT-4?
A potential approach to tackle this could be to aim to discover features in smaller batches. After each batch of discovered features finishes learning we could freeze them and only calculate the orthogonality regularisation within the next batch, as well as between the next batch and the frozen features. Importantly we wouldn’t need to apply the regularisation within the already discovered features.
Wouldn't this still be quadratic?
You state that GPT-4 is multi-modal, but my understanding was that it wasn't natively multi-modal. I thought that the extra features like images and voice input were hacked on - ie. instead of generating an image itself it generates a query to be sent to DALLE. Is my understanding here incorrect?
In any case, it could just be a matter of scale. Maybe these kinds of tasks are rare enough in terms of internet data that it doesn't improve the loss of the models very much to be able to model them? And perhaps the instruction fine-tuning focused on more practical tasks?
"Previous post" links to localhost.
I think it's helping people realise:
a) That change is happening crazily fast
b) That the change will have major societal consequences, even if it is just a period of adjustment
c) That the speed makes it tricky for society and governments to navigate these consequences
It's worth noting that there are media reports that OpenAI is developing agents that will use your phone or computer. I suppose it's not surprising that this would be their next step given how far a video generation model takes you towards this, although I do wonder how they expect these agents to operate with any reliability given the propensity of ChatGPT to hallucinate.
It seems like there should be a connection here with Karl Friston's active inference. After all, both you and his theory involve taking a predictive engine and using it to produce actions.
IIRC, there was also evidence that Copilot was modulating code quality based on name ethnicity variations in code docs
You don't know where they heard that?
I'm not saying that people can't ground it out. I'm saying that if you try to think or communicate using really verbose terms it'll reduce your available working memory which will limit your ability to think new thoughts.
You can replace "optimal" with "artifact equilibrated under policy update operations"
I don't think most people can. If you don't like the connotations of existing terms, I think you need to come up with new terms and they can't be too verbose or people won't use them.
One thing that makes these discussions tricky is that the apt-ness of these names likely depends on your object-level position. If you hold the AI optimist position, then you likely feel these names are biasing people towards and incorrect conclusion. If you hold the AI pessimist position, you likely see many of these connotations as actually a positive, in terms to pointing people towards useful metaphors, even if people occasionally slip-up and reify the terms.
Also, have you tried having a moderated conversation with someone who disagrees with you? Sometimes that can help resolve communication barriers.
It might be useful to produce a bidirectional measure of similarity by taking the geometrical mean of the transference of A to B and of B to A.
Really cool results!
This ties in nicely with Wittgenstein’s notion of language games. TLDR: Look at the role the phrase serves, rather than the exact words.
I heard via via
How did you hear this?
One of the main challenges I see here is how to calibrate this. In other words, if I can't break a model despite adding an activation vector of strength x, what does this mean in terms of how safe we should consider the model to be? ie. How much extra adversarial prompting effort is that equivalent to or how should I modify my probabilities of the model being safe?
Gary Marcus has criticised the results here:
What [C] is referring to is a technique called Bonferroni correction, which statisticians have long used to guard against “fishing expeditions” in which a scientist tries out a zillion different post hoc correlations, with no clear a priori hypothesis, and reports the one random thing that sorta vaguely looks like it might be happening and makes a big deal of it, ignoring a whole bunch of other similar hypotheses that failed. (XKCD has a great cartoon about that sort of situation.)
But that’s not what is going on here, and as one recent review put it, Bonferroni should not be applied “routinely”. It makes sense to use it when there are many uncorrelated tests and no clear prior hypothesis, as in the XKCD cartion. But here there is an obvious a priori test: does using an LLM make people more accurate? That’s what the whole paper is about. You don’t need a Bonferroni correction for that, and shouldn’t be using it. Deliberately or not (my guess is not), OpenAI has misanalyzed their data1 in a way which underreports the potential risk. As a statistician friend put it “if somebody was just doing stats robotically, they might do it this way, but it is the wrong test for what we actually care about”.
In fact, if you simply collapsed all the measurements of accuracy, and did the single most obvious test here, a simple t-test, the results would (as Footnote C implies) be significant. A more sophisticated test would be an ANCOVA, which as another knowledgeable academic friend with statistical expertise put it, having read a draft of this essay, “would almost certainly support your point that an omnibus measure of AI boost (from a weighted sum of the five dependent variables) would show a massively significant main effect, given that 9 out of the 10 pairwise comparisons were in the same direction.
Also, there was likely an effect, but sample sizes were too small to detect this:
There were 50 experts; 25 with LLM access, 25 without. From the reprinted table we can see that 1 in 25 (4%) experts without LLMs succeeded in the formulation task, whereas 4 in 25 with LLM access succeeded (16%).
If I'm being honest, I don't see Beff as worthy of debating Yoshua Bengio.
Also: It seems like there would be an easier way to get this observation that this post makes, ie. directly showing that kV and V get mapped to the same point by layer norm (excluding the epsilon).
Don't get me wrong, the circle is cool, but seems like it's a bit of a detour.
Just to check I understand this correctly: from what I can gather it seems that this shows that LayerNorm is monosemantic if your residual stream activation is just that direction. It doesn't show that it is monosemantic for the purposes of doing vector addition where we want to stack multiple monosemantic directions at once. That is, if you want to represent other dimensions as well, these might push the LayerNormed vector into a different spline. Am I correct here?
That said, maybe we can model the other dimensions as random jostling in such as way that it all cancels out if a lot of dimensions are activated?
- What do you see as the low-hanging co-ordination fruit?
- Raising the counter-culture movement seems strange. I didn’t really see them as focused on co-ordination.
Do you think it is likely that techniques like RLHF result in over-developed persuasiveness relative to other capabilities? If so, do you think we can modify the training to make this less of an issue or that it is otherwise manageable?
Also: How are funding and attention "arbitrary" factors?
You mean where they said that it was unlikely to succeed?
Good on you for doing this research, but to me it's a lot less interesting because you had the supervisor say: "In theory you can send them fake protocol, or lie about the biosecurity risk level, but it's a gamble, they might notice it or they might not." Okay, they didn't explicitly say to lie, but they explicitly told the AI to consider that possibility.
Regardless of whether or not it's AI Safety Camp, I think it's important to have at least one intro-level research program, particularly because applications for programs like SERI MATS ask about previous research experience in the application.
I can see merit both in Oliver's views about the importance of nudging people down useful research directions and Linda's views on assuming that participants are adults. Still undecided on who I ultimately end up agreeing with, so would love to hear other people's opinions.
Having just read through this, one key point that I haven't seen people mentioning is that the results are for LLM's that need to be jail-broken.
So these results are more relevant to the release of a model over an API rather than open-source, where you'd just fine-tune away the safeguards or download a model without safeguards in the first place.
I think it’s worth also raising the possibility of a Kuhnian scenario where the “mature science” is actually missing something and further breakthrough is required after that to move it into in a new paradigm.