Posts
Comments
I think the dojo analogy is very good and useful. Some unstructured thoughts: It gets at a core feature of humans is being able to adjust our personalities based on context. I suspect there is a semi-stable equilibrium thing that is important. This is a big reason people underestimate company/community culture: it can give some amount of herd immunity to bad behavior. If sufficiently many "defect" the culture changes. This is also an issue as communities grow of course, policing is harder and nuances of behavior get lost.
Great work! I don't know a lot about steering vectors and have some questions. I am also happy for you to just send me to a different resource.
1) From my understanding steering vectors are just things you add to the activations. At what layer do you do this?
2) You write "we optimized four different “harmful code” vectors". How did you "combine" the resulting four different vectors?
3) I would also be interested in how similar these vectors are to each other, and how similar they are to the refusal vector.
That is fair, I should have probably left some seed statements regarding the definition of AGI / ASI.
EDIT: I have added additional statements.
I just want to say that Amazon is fairly close to a universal recommendation app!
Where can I read about this 2-state HMM? By learn I just mean approximate via an algorithm. The UAT is not sufficient as it talks about learning a known function. Baum-Welch is such an algorithm, but as a far as I am aware it gives no guarantees on anything really.
Is there some theoretical result along the lines of "A sufficiently large transformer can learn any HMM"?
It would be interesting for people to post current research that they think has some small chance of outputting highly singular results!
Grothendiek seems to have been an extremely singular researcher, various of his discoveries would have likely been significantly delayed without him. His work on sheafs is mind bending the first time you see it and was seemingly ahead of its time.
As someone who is currently getting a PhD in mathematics I wish I could use Lean. The main problem for me is that the area I work in hasn't been formalized in Lean yet. I tried for like a week, but didn't get very far... I only managed to implement the definition of Poisson point process (kinda). I concluded that it wasn't worth spending my time to create this feedback loop and I'd rather work based on vibes.
I am jealous of the next generation of mathematicians that are forced to write down everything using formal verification. They will be better than the current generation.
I would call this "not thinking on the margins"
Some early results:
- Most people disagreed with the following two statements: "I think the probability of AGI before 2030 is above 50%" and "AI-to-human safety is fundamentally the same kind of problem as any interspecies animal-to-animal cooperation problem".
- Most people agreed with the statements: "Brain-computer interfaces (e.g. neuralink tech) that is strong and safe enough to be disruptive will not be developed before AGI." and "Corporate or academic labs are likely to build AGI before any state actor does."
- There seems to be two large groups who's main disagreement is about the statement " I think the probability of AGI before 2040 is above 50%". We will call people agreeing Group A and people disagreeing Group B.
- Group A agreed with "By 2035 it will be possible to train and run an AGI on fewer compute resources than required by PaLM today (if society survives that long)." and "I think establishing a norm of safety testing new SotA models in secure sandboxes is a near-term priority."
- Group B agreed with "I think the chance of an AI takeover is less than 15%".
- The most uncertainty was around the following two statements: "The 'Long Reflection' seems like a good idea, and I hope humanity manages to achieve that state." and "TurnTrout's 'training story' about a diamond-maximizer seemed fatally flawed, prone to catastrophic failure."