Posts
Comments
I like having a list of small, useful things to do that tend to pay off in the long run, like:
- go to the grocery store to make sure you have fresh fruits and vegetables
- mediate for 10 minutes
- do pushups and sit ups
- journal for 10 minutes
When my brain feels cluttered, it is nice to have a list of time-boxed simple tasks that don’t require planning or assessment.
Verify human designs and automatically create AI-generated designs which provably cannot be opened by mechanical picking.
Such a proof would be subject to its definition of "mechanical picking" and a sufficiently accurate physics model. (For example, would an electronically-controllable key-looking object with adjustable key-cut depths with pressure sensors qualify as a "pick"?)
I don't dispute the value of formal proofs for safety. If accomplished, they move the conversation to "is the proof correct?" and "are we proving the right thing?". Both are steps in the right direction, I think.
Thanks for the references; I'll need some time to review them. In the meanwhile, I'll make some quick responses.
As a side note, I'm not sure how tree search comes into play; in what way does tree search require unbounded steps that doesn't apply equally to linear search?
I intended tree search as just one example, since minimax tree search is a common example for game-based RL research.
No finite agent, recursive or otherwise, can plan over an unbounded number of steps in finite time...
In general, I agree. Though there are notable exceptions for cases such as (not mutually exclusive):
-
a closed form solution is found (for example, where a time-based simulation can calculate some quantity at an any arbitrary time step using the same amount of computation)
-
approximate solutions using a fixed number of computation steps are viable
-
a greedy algorithm can select the immediate next action that is equivalent to following a longer-term planning algorithm
... so it's not immediately clear to me how iteration/recursion is fundamentally different in practice.
Yes, like I said above, I agree in general and see your point.
As I'm confident we both know, some algorithms can be written more compactly when recursion/iteration are available. I don't know how much computation theory touches on this; i.e. what classes of problems this applies to and why. I would make an intuitive guess that it is conceptually related to my point earlier about closed-form solutions.
Note that this is different from the (also very interesting) question of what LLMs, or the transformer architecture, are capable of accomplishing in a single forward pass. Here we're talking about what they can do under typical auto-regressive conditions like chat.
I would appreciate if the community here could point me to research that agrees or disagrees with my claim and conclusions, below.
Claim: one pass through a transformer (of a given size) can only do a finite number of reasoning steps.
Therefore: If we want an agent that can plan over an unbounded number of steps (e.g. one that does tree-search), it will need some component that can do an arbitrary number of iterative or recursive steps.
Sub-claim: The above claim does not conflict with the Universal Approximation Theorem.
Claim: the degree to which the future is hard to predict has no bearing on the outer alignment problem.
- If one is a consequentialist (of some flavor), one can still construct a "desirability tree" over various possible various future states. Sure, the uncertainty makes the problem more complex in practice, but the algorithm is still very simple. So I don't think that that a more complex universe intrinsically has anything to do with alignment per se.
- Arguably, machines will have better computational ability to reason over a vast number of future states. In this sense, they will be more ethical according to consequentialism, provided their valuation of terminal states is aligned.
- To be clear, of course, alignment w.r.t. the valuation of terminal states is important. But I don't think this has anything to do with a harder to predict universe. All we do with consequentialism is evaluate a particular terminal state. The complexity of how we got there doesn't matter.
- (If you are detecting that I have doubts about the goodness and practicality of consequentialism, you would be right, but I don't think this is central to the argument here.)
- If humans don't really carry out consequentialism like we hope they would (and surely humans are not rational enough to adhere to consequentialist ethics -- perhaps not even in principle!), we can't blame this on outer alignment, can we? This would be better described as goal misspecification.
- If one subscribes to deontological ethics, then the problem becomes even easier. Why? One wouldn't have to reason probabilistically over various future states at all. The goodness of an action only has to do with the nature of the action itself.
- Do you want to discuss some other kind of ethics? Is there some other flavor that would operate differentially w.r.t. outer alignment in a more versus less predictable universe?
Want to try out a thought experiment? Put that same particular human (who wanted to specify goals for an agent) in the financial scenario you mention. Then ask: how well would they do? Compare the quality of how the person would act versus how well the agent might act.
This raises related questions:
- If the human doesn't know what they would want, it doesn't seem fair to blame the problem on alignment failure. In such a case, the problem would be a person's lack of clarity.
- Humans are notoriously good rationalizers and may downplay their own bad decisions. Making a fair comparison between "what the human would have done" versus "what the AI agent would have done" may be quite tricky. (See the Fundamental Attribution Error a.k.a. correspondence bias.
As I understand it, the argument above doesn't account for the agent using the best information available at the time (in the future, relative to its goal specification).
I think there is some confusion around a key point. For alignment, do we need to define what an agent will do in all future scenarios? It depends what you mean.
- In some sense, no, because in the future, the agent will have information we don't have now.
- In some sense, yes, because we want to know (to some degree) how the agent will act with future (unknown) information. Put another way, we want to guarantee that certain properties hold about its actions.
Let's say we define an aligned agent doing what we would want, provided that we were in its shoes (i.e. knowing what it knew). Under this definition, it is indeed possible that to specify an agent's decision rule in a way that doesn't rely on long-range predictions (where predictive power gets fuzzy, like Alejandro says, due to measurement error and complexity). See also the adjacent by comment about a thermostat by eggsyntax.
Note: I'm saying "decision rule" intentionally, because even an individual human does not have a well-defined utility function. (edited)
Nevertheless, it seems wrong to say that my liver is optimising my bank balance, and more right to say that it "detoxifies various metabolites, synthesizes proteins, and produces biochemicals necessary for digestion"---even though that gives a less precise account of the liver's behaviour.
I'm not following why this is a less precise account of the liver's behavior.
Here is an example of a systems dynamics diagram showing some of the key feedback loops I see. We could discuss various narratives around it and what to change (add, subtract, modify).
┌───── to the degree it is perceived as unsafe ◀──────────┐
│ ┌──── economic factors ◀─────────┐ │
│ + ▼ │ │
│ ┌───────┐ ┌───────────┐ │ │ ┌────────┐
│ │people │ │ effort to │ ┌───────┐ ┌─────────┐ │ AI │
▼ - │working│ + │make AI as │ + │ AI │ + │potential│ + │becomes │
├─────▶│ in │────▶│powerful as│─────▶│ power │───▶│ for │───▶│ too │
│ │general│ │ possible │ └───────┘ │unsafe AI│ │powerful│
│ │ AI │ └───────────┘ │ └─────────┘ └────────┘
│ └───────┘ │
│ │ net movement │ e.g. use AI to reason
│ + ▼ │ about AI safety
│ ┌────────┐ + ▼
│ │ people │ ┌────────┐ ┌─────────────┐ ┌──────────┐
│ + │working │ + │ effort │ + │understanding│ + │alignment │
└────▶│ in AI │────▶│for safe│─────▶│of AI safety │─────────────▶│ solved │
│ safety │ │ AI │ └─────────────┘ └──────────┘
└────────┘ └────────┘ │
+ ▲ │
└─── success begets interest ◀───┘
I find this style of thinking particularly constructive.
- For any two nodes, you can see a visual relationship (or lack thereof) and ask "what influence do these have on each other and why?".
- The act of summarization cuts out chaff.
- It is harder to fool yourself about the completeness of your analysis.
- It is easier to get to core areas of confusion or disagreement with others.
Personally, I find verbal reasoning workable for "local" (pairwise) reasoning but quite constraining for systemic thinking.
If nothing else, I hope this example shows how easily key feedback loops get overlooked. How many of us claim to have... (a) some technical expertise in positive and negative feedback? (b) interest in Bayes nets? So why don't we take the time to write out our diagrams? How can we do better?
P.S. There are major oversights in the diagram above, such as economic factors. This is not a limitation of the technique itself -- it is a limitation of the space and effort I've put into it. I have many other such diagrams in the works.
I’m curious if your argument, distilled, is: fewer people skilled in technical AI work is better? Such a claim must be examined closely! Think of it from a systems dynamics point of view. We must look at more than just one relationship. (I personally try to press people to share some kind of model that isn’t presented only in words.)
One important role of a criminal justice system is rehabilitation. Another, according to some, is retribution. Those in Azkaban suffer from perhaps the most of awful forms of retribution. Dementation renders a person incapable of rehabilitation.
Consider this if-then argument:
If:
- Justice is served without error (which is not true)
- The only purpose for criminal justice is retribution
Then: Azkabanian punishment is rational.
Otherwise, assuming there are other ways to protect society from the person, it is irrational to dement people.
Speaking broadly, putting aside the fictional word of Azkaban, there is an argument that suggests retribution for its own sake is wrong. It is simple: inflicting suffering is wrong, all other things equal. Retribution makes sense only to the extent it serves as a deterrent.
First, I encourage you to put credence in the current score of -40 and a moderator saying the post doesn't meet LessWrong's quality bar.
By LD you mean Lincoln-Douglas debate, right? If so, please continue reading.
Second, I'd like to put some additional ideas up for discussion and consideration -- not debate -- I don't want to debate you, certainly not in LD style. If you care about truth-seeking, I suggest taking a hard and critical look at LD. To what degree is Lincoln-Douglas debate organized around truth-seeking? How often does a participant in an LD debate change their position based on new evidence? In my understanding, in practice, LD is quite uninterested in the notion of being "less wrong". It seems to be about a particular kind of "rhetorical art" of fortifying one's position as much as possible while attacking another's. One might hope that somehow the LD debate process surfaces the truth. Maybe, in some cases. But generally speaking, I find it to be a woeful distortion of curious discussion and truth-seeking.
Surprisingly, perhaps, https://dl.acm.org/doi/book/10.5555/534975 has a free link to the full-text PDF.
Reinforcement learning is not required for the analysis above. Only evolutionary game theory is needed.
- In evolutionary game theory, the population's mix of strategies changes via replicator dynamics.
- In RL, each individual agent modifies its policy as it interacts with its environment using a learning algorithm.
Personally, I am most confident in 1, then 4, then 3, then 2 (in each case conditional on all the previous claims)
Oops. A previous version of this comment was wrong, so I edited it. The author’s confidence can be written as:
Also, independent of the author’s confidence:
thereby writing directly into your brain’s long-term storage and bypassing the cache that would otherwise get erased
What do we know about "writing directly" into long-term storage versus a short-term cache? What studies? Any theories about the mechanism(s)?
First, thank you for writing this. I would ask that you continue to think & refine and share back what you discover, prove, or disprove.
To me, it seems quite likely that B will have a lot of regularity to it. It will not be good code from the human perspective, but there will be a lot of structure I think, simply because that structure is in T and the environment.
I'm interested to see if we can (i) do more than claim this is likely and (ii) unpack reasons that might require that it be the case.
One argument for (ii) would go like this. Assume the generating process for A has a preference for shorter length programs. So we can think of a A as a tending to find shorter description lengths that match task T.
Claim: shorter (and correct) descriptions reflect some combination of environmental structure and compression.
- by 'environmental structure' I mean the laws underlying the task.
- by 'compression' I mean using information theory embodied in algorithms to make the program smaller
I think this claim is true, but let's not answer that too quickly. I'd like to probe this question more deeply.
- Are there more than two factors (environmental structure & compression)?
- Is it possible that the description gets the structure wrong but makes up for it with great compression? I think so. One can imagine a clever trick by which a small program expands itself into something like a big ball of mud that solves the task well.
- Any expansion process takes time and space. This makes me wonder if we should care not only about description length but also run time and space. If we pay attention to both, it might be possible to penalize programs that expand into a big ball of mud.
- However, penalizing run time and space might be unwise, depending on what we care about. One could imagine a program that start with first principles and derives higher-level approximations that are good enough to model the domain. It might be worth paying the cost of setting up the approximations because they are used frequently. (In other words, the amortized cost of the expansion is low.)
- Broadly, what mathematical tools can we use on this problem?
See also Nomic, a game by Peter Suber where a move in the game is a proposal to change the rules of the game.
I grant that legalese increases the total page count, but I don't think it necessarily changes the depth of the tree very much (by depth I mean how many documents refer back to other documents).
I've seen spaghetti towers written in very concise computer languages (such as Ruby) that nevertheless involve perhaps 50+ levels (in this context, a level is a function call).
In my experience, programming languages with {static or strong} typing are considerably easier to refactor in comparison to languages with {weak or dynamic} typing.*
* The {static vs dynamic} and {strong vs weak} dimensions are sometimes blurred together, but this Stack Overflow Q&A unpacks the differences pretty well.
No source code
I get the intended meaning, but I would like to made the words a little more precise. While we can find the executable source code (DNA) for an organism, that DNA is far from a high-level language.
I got minimal value from the article as written, but I'm hoping that a steel-man version might be useful. In that spirit, I can grant a narrower claim: Smart people have more capability to fool us, all other things equal. Why? Because increased intelligence brings increased capability for deception.
-
This is as close to a tautology as I've seen in a long time. What predictive benefit comes from tautologies? I can't think of any.
-
But why focus on capability? Probability of harm is a better metric.
-
Now, with that in mind, one should not assume a straight line between capability and probability of harm. One should look at all potential causal factors.
-
More broadly, the "all other things equal part" is problematic here. I will try to write more on this topic when I have time. My thoughts are not fleshed out yet, but I think my unease has to do with how ceteris paribus imposes constraints on a system. The claim I want to examine would go something like this: those constraints "bind" the system in ways that prevent proper observation and analysis.
If instead you keep deliberating until the balance of arguments supports your preferred conclusion, you're almost guaranteed to be satisfied eventually!
Inspired by the above, I offer the pseudo code version...
loop {
if assess(args, weights) > 1 { // assess active arguments
break; // preferred conclusion is "proved"
} else {
arg = biased_sample(remaining_args); // without replacement
args.insert(arg);
optimize(args, weights); // mutates weights to maximize `assess(args, weights)`
}
}
... the code above implements "the balance of arguments" as a function parameterized with weights. This allows for using an optimization process to reach one's desired conclusion more quickly :)
Thanks for your quick answer -- you answered before I was even done revising my question. :) I can personally relate to Dan Luu's examples. / This immediately makes me want to find potential solutions, but I won't jump to any right now. / For now, I'll just mention the ways in which Jacob Collier can explain music harmony at many levels.
Preface: I feel like I'm wearing the clown suit to a black tie event here. I'm new to LW and respect the high standards for discussion. So, I'll treat this an experiment. I'd rather be wrong, downvoted, and (hopefully) enlightened & persuaded than have this lingering suspicion that the emperor has no clothes.
I should also say that I personally care a lot about the topic of communication and brevity, because I have a tendency to say too much at one time and/or use the wrong medium in doing so. If anyone needs to learn how to be brief, it is me, and I'll write a few hundred words if necessary to persuade you of it.
Ok, that said, here are my top two concerns with the article: (1) This article strikes me as muddled and unclear. (i) I don't understand what "get" five words even means. (ii) I don't understand how coordination relates to the core claims or insight. My confusion leads to my second concern: (2) what can I take from this article?
Let's start with the second part. Is the author saying if I'm a CEO of a company of thousands I only "get" five words?
A quick aside: to me, "get" is an example of muddled language. What does the author mean w.r.t. (a) time period; (b) ... struggling for the right words here ... meaning? As to (a), do I "get" five words per message? Or five words some (unspecified) time frame? As to (b), is "get" a proxy for how many words the recipient/audience will read? But reading isn't enough for coordination, so I expect the author means something more. Does the author mean "read and understand" or "read and internalize" or "read and act on"?
Anyhow, due to the paragraph above, I don't know how to convert "You only get five words" into a prediction. In this sense, to me, the claim it isn't even wrong, because I don't know how to put it into practice.
Normally I would stop here, put the article aside, and move on. However, this article is featured here on LW and has many up-votes which suggests that others get a lot of value out of it. So I'm curious: what am I missing? Is there some connection to EA that makes this particularly salient, perhaps?
I have a guess that fans of the article have some translation layer that I'm missing. Perhaps if I could translate what the author means by get and coordination I would have the ah-ha moment.
To that end, would someone be so kind as to (a) summarize the key point(s) as simply as possible; with (b) clear intended meanings for "coordinate" and "get" (as in you only "get" X words) -- including what timeframe we're talking about -- and (c) the logic and evidence for the claims.
It is also possible that I'm not "calibrated" with the stated Epistemic Status:
all numbers are made up and/or sketchily sourced. Post errs on the side of simplistic poetry – take seriously but not literally."
Ok, but what does this mean for the reader? The standards of rationality still apply, right? There should still be some meaningful, clear, testable takeaway, right?
Would you please expand on how ai-plans.com addresses the question from the post above ... ?
Maybe let's try to make a smart counter-move and accelerate the development of for-profit AI Safety projects [...] ? With the obvious idea to pull some VC money, which is a different pool than AI safety philanthropic funds.
I took a look at ai-plans, but I have yet to find information about:
- How does it work?
- Who created it?
- What is the motivation for building it?
- What problem(s) will ai-plans help solve?
- Who controls / curates / moderates it?
- What is the process/algorithm for: curation? moderation? ranking?
I would suggest (i) answering these questions on the ai-plans website itself then (ii) adding links here.
Let's step back. This thread of the conversation is rooted in this claim: "Let's be honest: all fiction is a form of escapism.". Are we snared in the Disputing Definitions trap? To quote from that LW article:
if the issue arises, both sides should switch to describing the event in unambiguous lower-level constituents, like acoustic vibrations or auditory experiences. Or each side could designate a new word, like 'alberzle' and 'bargulum', to use for what they respectively used to call 'sound'; and then both sides could use the new words consistently. That way neither side has to back down or lose face, but they can still communicate. And of course you should try to keep track, at all times, of some testable proposition that the argument is actually about.
I propose that we recognize several lower-level testable claims, framed as questions. How many people read fiction to ...
- entertain?
- distract from an unpleasant reality?
- understand the human condition (including society)?
- think through alternative scenarios?
Now I will connect the conversation to these four points:
-
Luke_A_Somers wrote "Why would I ever want to escape from my wonderful life to go THERE?" which relates to #2.
-
thomblake mentions the The Philosophy of Horror. Consider this quote from the publisher's summary: "... horror not only arouses the senses but also raises profound questions about fear, safety, justice, and suffering. ... horror's ability to thrill has made it an integral part of modern entertainment." which suggests #1 and #3.
-
JonInstall pulls out the dictionary in the hopes of "settling" the debate. He's talking about #1.
-
Speaking for myself, when reading e.g. the embedded story The Tale of the Omegas in Life 3.0, my biggest takeaway was #4.
Does this sound about right?
If we know a meteor is about to hit earth, having only D days to prepare, what is rational for person P? Depending on P and D, any of the following might be rational: throw an end of the world party, prep to live underground, shoot ICBMs at the meteor, etc.
I listened to part of “Processor clock speeds are not how fast AIs think”, but I was disappointed by the lack of a human narrator. I am not interested in machine readings; I would prefer to go read the article.
For Hopfield networks in general, convergence is not guaranteed. See [1] for convergence properties.
[1] J. Bruck, “On the convergence properties of the Hopfield model,” Proc. IEEE, vol. 78, no. 10, pp. 1579–1585, Oct. 1990, doi: 10.1109/5.58341.
The audio reading of this post [1] mistakenly uses the word hexagon instead of pentagon; e.g. "Network 1 is a hexagon. Enclosed in the hexagon is a five-pointed star".
[1] [RSS feed](https://intelligence.org/podcasts/raz); various podcast sources and audiobooks can be found [here](https://intelligence.org/rationality-ai-zombies/)
I'm not so sure.
I would expect that a qualified, well-regarded leader is necessary, but I'm not confident it is sufficient. Other factors might dominate, such as: budget, sustained attention from higher-level political leaders, quality and quantity of supporting staff, project scoping, and exogenous factors (e.g. AI progress moving in a way that shifts how NIST wants to address the issue).
What are the most reliable signals for NIST producing useful work, particularly in a relatively new field? What does history show us? What kind of patterns do we find when NIST engages with: (a) academia; (b) industry; (c) the executive branch?
Another failure mode -- perhaps the elephant in the room from a governance perspective -- is national interests conflicting with humanity's interests. For example, actions done in the national interest of the US may ratchet up international competition (instead of collaboration).
Even if one puts aside short-term political disagreements, what passes for serious analysis around US national security seems rather limited in terms of (a) time horizon and (b) risk mitigation. Examples abound: e.g. support of one dictator until he becomes problematic, then switching support and/or spending massively to deal with the aftermath.
Even with sincere actors pursuing smart goals (such as long-term global stability), how can a nation with significant leadership shifts every 4 to 8 years hope to ensure a consistent long-term strategy? This question suggests that an instrumental goal for AI safety involves promoting institutions and mechanisms that promote long-term governance.
One failure mode could be a perception that the USG's support of evals is "enough" for now. Under such a perception, some leaders might relax their efforts in promoting all approaches towards AI safety.
perhaps I should apply Cantor’s Diagonal Argument to my clever construction, and of course it found a counterexample—the binary number (. . . 1111), which does not correspond to any finite whole number.
I’m not following despite having recently reviewed Cantor’s Diagonal Argument. I can imagine constructing a matrix such that the diagonal is all ones… but I don’t see how this connects up to the counterexample claim above.
Also, why worry that an infinite binary representation (of any kind) doesn’t correspond to a finite whole number? I suspect I’m missing something here. A little help please to help close this inferential distance?
How to interpret the comment above? Is it suggesting that EY's behavior was pompous? (As of this writing, the commenter only made one comment, this one, and does not seem to be around LessWrong at this time.) My take: >60% likely. Going "one level up", I would expect a majority of readers would at least wonder.
EY views other people’s irrationality as his problem, and it seems to me this discussion demonstrates a sincere effort to engage with someone he perceived as irrational. The conversation was respectful; as it progressed, each person clarified what they meant, and they ended with a handshake. If I were there at the outset of the conversation, I would not have expected this good of an outcome. (Updated on 2024-May-1)
Regarding the cost of a making an incorrect probability estimate, “Overconfidence is just as bad as underconfidence.” is not generally true. In binary classification contexts, one leads to more false positives and another to more false negatives. The costs of each are not equal in general for real world situations.
The author may simply mean that both are incorrect; this I accept.
My point is more than pedantic; there are too many examples of machine learning systems failing to recognize different misclassification costs.