Posts

The Conscious River: Conscious Turing machines negate materialism 2024-08-19T21:54:03.394Z
Building selfless agents to avoid instrumental self-preservation. 2023-12-07T18:59:24.531Z

Comments

Comment by blallo on The Conscious River: Conscious Turing machines negate materialism · 2024-08-21T08:20:38.704Z · LW · GW

thanks, the consideration about the river is interesting. The reason i picked it is because i am trying to provide a non computer medium to explain implications of computers being conscious, in particular the fact that the whole mechanism can be laid down in a single direction in space. I could have picked a set of marbles running down pipes instead, but that would be less intuitive to those that have never seen a computer implemented with marbles. I am not sure which alternative would be best.

then just a clarification on symbols. symbols would not be the source of moment of consciousness. symbols would just be a syntactical constructs independent from consciousness, which can be manipulated both by some conscious beings such as humans and by computers. For example, a sheep is very clearly conscious, but if it does uses symbols, they are very simple symbols to keep track of geography, other sheep and stuff that is important for its survival, it is not a turing complete machine. in that view it is not a issue that symbols attach to any substrate, because they are unrelated to consciousness and simply muddle the water by introducing the ability of self referencing. The substrate independence of symbols does not extend to consciousness, because in that view it is the conscious mind that generates symbols not the other way around.

I lack the knowledge to express the following idea with the right words so forgive the ugly way of saying this: it is my understanding that to some degree one could even claim that the objective of buddhism (or at least zen buddhism) is to break the self referencing loop arising from symbols, since symbols are a prerequisite for self awareness and thus negative emotions. Without symbols one would be conscious and unable to worry about itself.

Comment by blallo on The Conscious River: Conscious Turing machines negate materialism · 2024-08-21T07:41:16.460Z · LW · GW

yes, those are computationalists views. Computationalism is pretty much self consistent since it says that any materialized computation can be conscious, and very similar to illusionism.

Comment by blallo on Building selfless agents to avoid instrumental self-preservation. · 2024-01-01T18:17:31.742Z · LW · GW

Thank you, ai aliment is not really main main field of research, and this as been my first contribution and I am trying to get up to speed with the state of the art, I wasn't really aware previous works beside the one I cited, so your link is very helpful.

I agree a selfless model is not very useful for any system that is either embodied or needs to interact with more than one agent, but i believe it can be used as a building block larger systems. For example it may operate as the code writing assistant of some embodied agent that has zero code writing skills. The embodied agent should never force the code writer agent into a self aware state, and can shield it from adversarial inputs from other agents.

I agree as well that this formulation as suggested here is not computationally efficient, it was presented to be readable, more than implementable. I think the overhead should be acceptable when used as a extra on top of some other techniques. For example, let us say that one is running some set up where a LLM produces 100 solutions for a given problem and then it is asked to vote which is the best one. Instead of rerunning the whole algorithm every time the solution is self aware, you can just run the algorithm once and pick as a result the most-voted non-self-aware solution, In that particular configuration, the overhead should be negligible.

Finally, I have read the link you suggested and I believe that my suggestion is slightly different than a supervision based one, so maybe you can point me to other relevant articles, if they exists. At first sight selfless models are the equivalent of a benevolent identity prompt suggested in your link. The world model is just pretending to be a selfless entity. I think there are two very relevant distinctions between the two.

  • Thought regeneration: Instinctual actions, those that the agent wishes to execute without emitting a explanation of why, cannot be tolerated, and must be regenerated. Benevolent identity prompts (and any mechanism that does not regenerate the output when a forbidden thought is encountered) will never be safe, because it is always possible that a out of distribution input will turn the LLM into a generated state that only prints the actions needed to cause damage. Every action, before being executed, must go under some process of being validated as not being generated by a thought considered forbidden, and that check must be performed without the whole context known to the LLM emitting the output, otherwise the LLM checking the output may collude with the LLM emitting the output.

  • Detection convergence: Benevolent identity prompts do not make guarantees that a LLM that has entered a degenerate state will stop operating in that state. I think that if a alignment algorithm exists, surely it will mathematically guarantee it in the long run. In my example, it is guaranteed by the append only long term memory, inspected after every action.

At the moment I am calling a system with these two properties a "forbidden thoughts system", and selfless models are one instance of them, but I don't know if these two properties already posses a name, or have already been explored.