LessWrong 2.0 Reader
View: New · Old · Top← previous page (newer posts) · next page (older posts) →
← previous page (newer posts) · next page (older posts) →
Hi Johannes! Thanks for the suggestion :) I'm not sure i'd want it in the middle of a video call, but maybe in a forum context like this could be cool?
clement-dumas on Refusal in LLMs is mediated by a single directionI'm wondering, can we make safety tuning more robust to "add the accept every instructions steering vector" attack by training the model in an adversarial way in which an adversarial model tries to learn steering vector that maximize harmfulness ?
One concern would be that by doing that we make the model less interpretable, but on the other hand that might makes the safety tuning much more robust?
johannes-c-mayer on Johannes C. Mayer's ShortformOne reason why I never finish any blog post is probably because I'm just immediately starting to write it. I think it is better to first build a very good understanding of whatever I'm trying to understand. Only when I'm sure I have understood do I start to create a very narrowly scoped writeup?
Doing this has two advantages. First, it speeds up the research process, because writing down all your thoughts is slow.
Second, it speeds up the writing of the final document. You are not confused about the thing, and you can focus on what is the best way to communicate it. This reduces how much editing you need to do. You also scope yourself by only writing up the most important things.
See also here more concrete steps on how to do this [LW(p) · GW(p)].
jacob-dunefsky on Transcoders enable fine-grained interpretable circuit analysis for language modelsDo you have any plans of doing something similar for attention layers?
I'm pretty sure that there's at least one other MATS group (unrelated to us) currently working on this, although I'm not certain about any of the details. Hopefully they release their research soon!
Also, do you have any plans to train sparse MLP at multiple layers in parallel, and try to penalise them to have sparsely activating connections between each other in addition to having sparse activations?
I did try something similar at one point, but it didn't quite work out. In particular: given an SAE for MLP-out activations, you can try and train an MLP transcoder with an additional loss term penalizing the L1 norm of the pullback of the SAE encoder features by the transcoder decoder matrix. This was intended to induce sparse input-independent connections from the transcoder features to the MLP-out SAE features. Unfortunately, this didn't yield great results. The transcoder features were often polysemantic, while the input-independent connections from the transcoder features to the SAE features were somewhat bizarre-looking. Here's an old graph I just dug up: the x-axis is transcoder feature index and the y-axis is the input-independent connection strength to a certain SAE feature:
In the end, I decided to pause working on this idea. Potentially, it could turn out that this idea is workable, but if so, then there are probably a few extra tweaks that have to be done to get it working beyond the naive approach that I tried.
fabien-roger on Questions for labsWhat do you expect to be expensive? The engineer hours to build the fine-tuning infra? Or the actual compute for fine-tuning?
Given the amount of internal fine-tuning experiments going on for safety stuff, I'd be surprised if the infra was a bottleneck, though maybe there is a large overhead in making these find-tuned models available through an API.
I'd be even more surprised if the cost of compute was significant compared to the rest of the activity the lab is doing (I think fine-tuning on a few thousand sequences is often enough for capabilities' evaluations, you rarely need massive training runs).
david-james on How An Algorithm Feels From InsideFor Hopfield networks in general, convergence is not guaranteed. See [1] for convergence properties.
[1] J. Bruck, “On the convergence properties of the Hopfield model,” Proc. IEEE, vol. 78, no. 10, pp. 1579–1585, Oct. 1990, doi: 10.1109/5.58341.
In principle, you could use Whisper or any other ASR system with high accuracy to enforce something like this during a live conversation.
david-james on How An Algorithm Feels From InsideThe audio reading of this post [1] mistakenly uses the word hexagon instead of pentagon; e.g. "Network 1 is a hexagon. Enclosed in the hexagon is a five-pointed star".
[1] [RSS feed](https://intelligence.org/podcasts/raz); various podcast sources and audiobooks can be found [here](https://intelligence.org/rationality-ai-zombies/)
I thought a lot about what kinds of things make sense for me to do to solve AI alignment. That did not make me confident that any particular narrow idea that I have will eventually lead to something important.
Rather, I'm confident that executing my research process will over time lead to something good. The research process is:
I think being confident, i.e. not feeling hopeless in doing anything, is important. The important takeaway here is that you don't need to be confident in any particular idea that you come up with. Instead, you can be confident in the broader picture of what you are doing, i.e. your processes.
The best way to become confident in this way is to just work a bunch and then reflect back. It is very likely that you will be able to see how improved. And probably you will have had a few successes.
mineta-edralis-juraskova on When is a mind me?I apologize for not getting back to you sooner, I didn’t notice your reply until yesterday. And I apologize for the length of my response, too - I bolded the most important parts.
Re: Whether there is empirical difference between worlds where OI is true and where OI is false. The difference between all experiences being mine and only some being mine is that if all experiences are mine, then they all exist in the same way this experience now exists, i.e. for me (where me = just this immediacy/this-here-now character, i.e. the way it exists, NOT Edralis's memories, personality etc.). There is no empirical difference in the usual sense, since the way experiences exist cannot be objectively assessed. I can’t be sure that you even have any experiences – this is not something that is available for empirical investigation in the way I can assess e.g. the number of someone’s fingers. And I can’t know that, given there are experiences from that point of view, that they exist in the same way as this experience, does, i.e. for me. That is only clear in those experiences. If I am there, I do ultimately know that I am there (obviously) – but I have no way to know that when experiencing this person, Edralis. So the empirical difference in the usual sense between OI being true and not being true is none. However, there are other than empirical facts. The existential difference between those two worlds is vast. If OI is true, then I (i.e. the thisness, the here-now-this that at least Edralis's experiences have) am Rob Bensinger, and everybody else – if it’s not true, then I am not. The difference is in the being of those experiences, in how they exist. But since experiences (consciousness) don’t exist empirically (or better: objectively), there is no empirical (objective) difference. There is existential, subjective difference, though.
“Some of those brain-moments resemble other brain-moments, either by coincidence or because of some (direct or indirect) causal link between the brain-moments. When we talk about Brain-1 "anticipating" or "becoming" a future brain-state Brain-2, we normally mean things like:
That is not what I mean when I think about anticipating a future brain state. What I am interested in is not the content of experience, but how those experiences are – i.e. whether they are mine. And by ‘mine’ I don’t mean Edralis’s, I mean that they exist in the same way, i.e. they have the same immediate character, the same this-here-now as this experience that I currently am, of Edralis writing this. Not just that they are immediate (all experiences are, by definition), but that they are immediate in the same way. If there are no souls or ghosts, that means there is just one way in which they can be immediate, i.e. they are all immediate like this experience is.
Do you think we understand each other? Do you see what I’m referring to when I point to that which THIS1 experience-moment, of Edralis writing “THIS1”, and THIS2 experience-moment, of Edralis writing “THIS2” have in common – that is shared totally between those two experiences – but it has nothing to do with the content of those experiences, including their continuity? I mean the underlying “canvas” where the experiences “take place” that however is nothing outside of those experiences. It’s not a Ghost – I mean, you could call it that, but if that is what the Cartesian Ghost is supposed to be, then it’s self-evidently real. But I am not postulating an entity that is somehow additional to experiences. Does it make sense? When you watch your experience, do you see what I mean? The THISNESS which carries in your experience from moment to moment?
From what you say, it seems to me we’re talking past each other; however, I don’t know how to elicit an alignment/understanding in our conceptual frameworks. It seems to me your worldview is very empirical, very physicalist – whereas mine is phenomenology-first. What I see when I “look at the world” is ME, i.e. consciousness assuming different forms, different experiential qualities existing. (There might be something "outside" that I am a picture of.) What you see when you “look at the world” is probably THE WORLD that is NOT yourself. I don’t think you’re wrong - this is a way to conceive of yourself (i.e. identifying yourself with a particular human), but I think you might be missing something - that essential me. However, maybe it’s me whose understanding is lacking – and I really want to understand.
Maybe the update that's happening is something like: "Previously it felt to me like other people's experiences weren't fully real. I was unduly selfish and self-centered, because my experiences seemed to me like they were the center of the universe; I abstractly and theoretically knew that other people have their own point of view, but that fact didn't really hit home for me. Then something happened, and I had a sudden realization that no, it's all real."
It's more than that – it’s not just that they are real, but that they are real for me, where “me” is the this-here-now of this experience. That is, the claim is that the this-here-now/presence/immediacy/thisness is the same in all experiences, not just that they all have such a character (which is true by definition). And since the one that is self-evident is indisputably mine, then, if all are the same, all are mine.
One more questionː Can you imagine never having been born as Rob Bensinger, but instead being born as a different person? Does it make sense to imagine a person (or animal, or other conscious being) that would be you that wouldn’t however share any memories or brain-matter with the person that you are, i.e. Rob Bensinger? Because to me, it makes perfect sense – that which I imagine this non-Edralis that is me to have, or rather, the non-Edralis-centered experiences that are me to have, is the same THISNESS, i.e. that which is essentially me. I am Edralis contingently – I didn’t have to be her, I could be someone else – what I am essentially is that THISNESS that her experiences have. If OI is true, then all experiences simply have this same thisness. And since I am the thisness, all experiences are mine, i.e. I am everyone. This is very shocking. It's outright world-shattering! If it doesn't sound shocking, it's either because a person knows it for a long time and they've gotten used to it, or they don't grasp what it means. So - even if you don't agree with OI, can you imagine what it would mean? Do you see in what sense you could be e.g. Albert Einstein or Eliezer Yudkowsky or Queen Victoria?
Thank you!