Posts

Comments

Comment by David Jilk (david-jilk) on My AI Model Delta Compared To Yudkowsky · 2024-06-17T15:19:06.813Z · LW · GW

This is probably the wrong place to respond to the notion of incommensurable ontologies. Oh well, sorry.

While I agree that if an agent has a thoroughly incommensurable ontology, alignment is impossible (or perhaps even meaningless or incoherent), it also means that the agent has no access whatsoever to human science. If it can't understand what we want, it also can't understand what we've accomplished.  To be more concrete, it will not understand electrons from any of our books, because it won't understand our books. It won't understand our equations, because it won't understand equations nor will it have referents (neither theoretical nor observational) for the variables and entities contained there.

Consequently, it will have to develop science and technology from scratch. It took a long time for us to do that, and it will take that agent a long time to do it. Sure, it's "superintelligent," but understanding the physical world requires empirical work. That is time-consuming, it requires tools and technology, etc. Furthermore, an agent with an incommensurable ontology can't manipulate humans effectively - it doesn't understand us at all, aside from what it observes, which is a long, slow way to learn about us. Indeed it doesn't even know that we are a threat, nor does it know what a threat is.

Long story short, it will be a long time - decades? Centuries? before such an agent would be able to prevent us from simply unplugging it. Science does not and cannot proceed at the speed of computation, so all of the "exponential improvement" in its "intelligence" is limited by the pace of knowledge growth.

Now, what if it has some purchase on human ontology? Well, then, it seems likely that it can grow that to a sufficient subset and in that way we can understand each other sufficiently well - it can understand our science, but also it can understand our values.

The point if you have one you're likely to have the other. Of course, this does not mean that it will align with those values. But the incommensurable ontology argument just reduces to an argument for slow takeoff.

I've published this point as part of a paper in Informatica. https://www.informatica.si/index.php/informatica/article/view/1875

Comment by David Jilk (david-jilk) on Believing In · 2024-02-10T20:41:31.844Z · LW · GW

I've done some similar analysis on this question myself in the past, and I am running a long-term N=1 experiment by opting not to take the attitude of belief toward anything at all. Substituting words like prefer, anticipate, suspect, has worked just fine for me and removes the commitment and brittleness of thought associated with holding beliefs.

Also in looking into these questions, I learned that other languages do not have in one word the same set of disparate meanings (polysemy) of our word belief.  In particular, the way we use it in American English to "hedge" (i.e., meaning  "I think but I am not sure") is not a typical usage and my recollection (possibly flawed) is that it isn't in British English either.

Comment by David Jilk (david-jilk) on Goals selected from learned knowledge: an alternative to RL alignment · 2024-01-21T14:08:23.452Z · LW · GW

>> I’ve been trying to understand and express why I find natural language alignment ... so much more promising >> than any other alignment techniques I’ve found.

Could it be that we humans have millennia of experience aligning our new humans (children) using this method? Whereas every other method is entirely new to us, and has never been applied to a GI even if it has been tested on other AI systems; thus, predictions of outcomes are speculative.

But it still seems like there is something missing from specifying goals directly via expression through language or even representational manipulation. If the representations themselves do not contain any reference to motivational structure (i.e., they are "value free" representations), then the goals will not be particularly stable. Johnny knows that it's bad to hit his friends because Mommy told him so, but he only cares because it's Mommy who told him, and he has a rather strong psychological attachment to Mommy.,

Comment by David Jilk (david-jilk) on Systems that cannot be unsafe cannot be safe · 2023-05-02T18:27:48.700Z · LW · GW

It's worse than that. https://arxiv.org/abs/1604.06963