Posts
Comments
Here's my current list of lessons for review. Every day during my daily review, I look at the lessons in the corresponding weekday entry and the corresponding day of the month, and for each list one example from the last week where I could've applied the lesson, and one example where I might be able to apply the lesson in the next week:
- Mon
- get fast feedback. break tasks down into microtasks and review after each.
- Tue
- when surprised by something or took long for something, review in detail how you might've made the progress faster.
- clarify why the progress is good -> see properties you could've paid more attention to
- Wed
- use deliberate practice. see what skills you want to learn, break them down into clear subpieces, and plan practicing the skill deliberately.
- don't start too hard. set feasible challenges.
- make sure you can evaluate how clean execution of the skill would look like.
- Thu
- Hold off on proposing solutions. first understand the problem.
- gather all relevant observations
- clarify criteria a good result would have
- clarify confusions that need to be explained
- Fri
- Taboo your words: When using confusing abstract words, taboo them and rephrase to show underlying meaning.
- When saying something general, make an example.
- Sat
- separate planning from execution. first clarify your plan before executing it.
- for planning, try to extract the key (independent) subproblems of your problem.
- Sun
- only do what you must do. always know clearly how a task ties into your larger goals all the way up.
- don't get sidetracked by less than maximum importance stuff.
- delegate whatever possible.
- when stuck/stumbling: imagine you were smarter. What would a keeper do?
- when unmotivated: remember what you are fighting for
- be stoic. be motivated by taking the right actions. don't be pushed down when something bad happens, just continue making progress.
- when writing something to someone, make sure you properly imagine how it will read like from their perspective.
- clarify insights in math
- clarify open questions at the end of a session
- when having an insight, sometimes try write a clear explanation. maybe send it to someone or post it.
- periodically write out big picture of your research
- tackle problems in the right context. (e.g. tackle hard research problems in sessions not on walks)
- don't apply effort/force/willpower. take a break if you cannot work naturally. (?)
- rest effectively. take time off without stimulation.
- always have at least 2 hypotheses (including plans as hypotheses about what is best to do).
- try to see how the searchspace for a problem looks like. What subproblems can be solved roughly independently? What variables are (ir)relevant? (?)
- separate meta-instructions and task notes from objective level notes (-> split obsidian screen)
- first get hypotheses for specific cases, and only later generalize. first get plans for specific problems, and only later generalize what good methodology is.
- when planning, consider information value. try new stuff.
- experiment whether you can prompt AIs in ways to get useful stuff out. (AIs will only become better.)
- don't suppress parts of your mind. notice when something is wrong. try to let the part speak. apply focusing.
- Relinquishment. Lightness. Evenness. Notice when you're falling for motivated reasoning. Notice when you're attached to a belief.
- Beware confirmation bias. Consider cases where you could've observed evidence but didn't.
- perhaps do research in sprints. perhaps disentangle from phases where i do study/practice/orga. (?)
- do things properly or not at all.
- try to break your hypotheses/models. look for edge cases.
- often ask why i believe something -> check whether reasoning is valid (->if no clear reason ask whether true at all)
- (perhaps schedule practice where i go through some nontrivial beliefs)
- think what you actually expect to observe, not what might be a nice argument/consideration to tell.
- test hypotheses as quickly as you can.
- notice (faint slimmers of) confusions. notice imperfect understanding.
- notice mysterious answers. when having a hypothesis check how it constrains your predictions.
- beware positive bias. ask what observations your hypothesis does NOT permit and check whether such a case might be true.
Thank you for your feedback! Feedback is great.
We can try to select for AIs that outwardly seem friendly, but on anything close to our current ignorance about their cognition, we cannot be nearly confident that an AI going through the intelligence explosion will be aligned to human values.
It means that we have only very little understanding of how and why AIs like ChatGPT work. We know almost nothing about what's going on inside them that they are able to give useful responses. Basically all I'm saying here is that we know so little that it's hard to be confident of any nontrivial claim about future AI systems, including that they are aligned.
A more detailed argument for worry would be: We are restricted to training AIs through giving feedback on their behavior, and cannot give feedback on their thoughts directly. For almost any goal an AI might have, it is in the interest of the AI to do what the programmers want it to do, until it is robustly able to escape and without being eventually shut down (because if it does things people don't like while it is not yet powerful enough, people will effectively replace it with another AI which will then likely have different goals, and thus this ranks worse according to the AI's current goals). Thus, we basically cannot behaviorally distinguish friendly AIs from unfriendly AIs, and thus training for friendly behavior won't select for friendly AIs. (Except in the early phases where the AIs are still so dumb that they cannot realize very simple instrumental strategies, but just because a dumb AI starts out with some friendly tendencies, doesn't mean this friendliness will generalize to the grown-up superintelligence pursuing human values. E.g. there might be some other inner optimizers with other values cropping up during later training.)
(An even more detailed introduction would try to concisely explain why AIs that can achieve very difficult novel tasks will be optimizers, aka trying to achieve some goal. But empirically it seems like this part is actually somewhat hard to explain, and I'm not going to write this now.)
It would then go about optimizing the lightcone according to its values
"lightcone" is an obscure term, and even within Less Wrong I don't see why the word is clearer than using "the future" or "the universe". I would not use the term with a lay audience.
Yep, true.
Here's my pitch for very smart young scientists for why "Rationality from AI to Zombies" is worth reading:
The book "Rationality: From AI to Zombies" is actually a large collection of blogposts, which covers a lot of lessons on how to become better at reasoning. It also has a lot of really good and useful philosophy, for example about how Bayesian updating is the deeper underlying principle of how science works.
But let me express in more detail why I think "Rationality: A-Z" is very worth reading.
Human minds are naturally bad at deducing correct beliefs/theories. People get attached to their pet theories and fall for biases like motivated reasoning and confirmation bias. This is why we need to apply the scientific method and seek experiments that distinguish which theory is correct. If the final arbiter of science was argument instead of experiment, science would likely soon degenerate into politics-like camps without making significant progress. Human minds are too flawed to arrive at truth from little evidence, and thus we need to wait for a lot of experimental evidence to confirm a theory.
Except that sometimes, great scientists manage to propose correct theories in the absence of overwhelming scientific evidence. The example of Einstein, and in particular his discovery of general relativity, especially stands out here. I assume you are familiar with Einstein's discoveries, so I won't explain one here.
How did Einstein do it? It seems likely that he intuitively (though not explicitly) had realized some principles for how to reason well without going astray.
"Rationality: From AI to Zombies" tries to communicate multiple such principles (not restricted to what Einstein knew, though neither including all of Einstein's intuitive insights). The author looked at where people's reasoning (both in science and everyday life) had gone astray, asked how one could've done better, and generalized out a couple of principles that would have allowed them to avoid their mistakes if they had properly understood them.
I would even say it is the start of something like "the scientific method v2.0", which I would call "Bayesian rationality".
The techniques of Bayesian rationality are a lot harder to master than the techniques of normal science. One has to start out quite smart to internalize the full depth of the lessons, and to be able to further develop the art starting from that basis.
(Btw, in case this motivates someone to read it: I recommend starting with reading chapters N until T (optionally skipping the quantum physics sequence) and then reading the rest from A to Z. (Though read the preface first.))
Here's my 230 word pitch for why existential risk from AI is an urgent priority, intended for smart people without any prior familiarity with the topic:
Superintelligent AI may be closer than it might seem, because of intelligence explosion dynamics: When an AI becomes smart enough to design an even smarter AI, the smarter AI will be even smarter and can design an even smarter AI probably even faster, and so on with the even smarter AI, etc. How fast such a takeoff would be and how soon it might occur is very hard to predict though.
We currently understand very little about what is going on inside current AIs like ChatGPT. We can try to select for AIs that outwardly seem friendly, but on anything close to our current ignorance about their cognition, we cannot be nearly confident that an AI going through the intelligence explosion will be aligned to human values.
Human values are quite a tiny subspace in the space of all possible values. If we accidentally create superintelligence which ends up not aligned to humans, it will likely have some values that seem very alien and pointless to us. It would then go about optimizing the lightcone according to its values, and because it doesn’t care about e.g. there being happy people, the configurations which are preferred according to the AI’s values won’t contain happy people. And because it is a superintelligence, humanity wouldn’t have a chance at stopping it from disassembling earth and using the atoms according to its preferences.
I have a binary distinction that is a bit different from the distinction you're drawing here. (Where tbc one might still draw another distinction like you do, but this might be relevant for your thinking.) I'll make a quick try to explain it here, but not sure whether my notes will be sufficient. (Feel free to ask for further clarification. If so ideally with partial paraphrases and examples where you're unsure.)
I distinguish between objects and classes:
- Objects are concrete individual things. E.g. "your monitor", "the meeting you had yesterday", "the German government".
- A class is a predicate over objects. E.g. "monitor", "meeting", "government".
The relationship between classes and objects is basically like in programming. (In language we can instantiate objects from classes through indicators like "the", "a", "every", "zero", "one", ..., plural "-s" inflection, prepended posessive "'s", and perhaps a few more. Though they often only instantiates objects if it is in the subject position. In the object position some of those keywords have a bit of a different function. I'm still exploring details.)
In language semantics the sentence "Sally is a doctor." is often translated to the logic representation "doctor(Sally)", where "doctor" is a predicate and "Sally" is an object / a variable in our logic. From the perspective of a computer it might look more like adding a statement "P_1432(x_5343)" to our pool of statements believed to be true.
We can likewise say "The person is a doctor" in which case "The person" indicates some object that needs to be inferred from the context, and then we again apply the doctor predicate to the object.
The important thing here is that "doctor" and "Sally"/"the person" have different types. In formal natural language semantics "doctor" has type <e,t> and "Sally" has type "e". (For people interested in learning about semantics, I'd recommend this excellent book draft.[1])
There might still be some edge cases to my ontology here, and if you have doubts and find some I'd be interested in exploring those.
Whether there's another crisp distinction between abstract classes (like "market") and classes that are less far upstream from sensory perceptions (like "tree") is a separate question. I don't know whether there is, though my intuition would be leaning towards no.
- ^
I only read chapters 5-8 so far. Will read the later ones soon. I think for the people familiar with CS the first 4 chapters can be safely skipped.
The meta problem of consciousness is about explaining why people think they are conscious.
Even if we get such a result with AIs where AIs invent a concept like consciousness from scratch, that would only tell us that they also think they have sth that we call consciousness, but not yet why they think this.
That is, unless we can somehow precisely inspect the cognitive thought processes that generated the consciousness concept in AIs, which on anything like the current paradigm we won't be.
Another way to frame it: Why would it matter that an AI invents the concept of consciousness, rather than another human? Where is the difference that lets us learn more about the hard/meta problem of consciousness in the first place?
Separately, even if we could analyze the thought processes of AIs in such a case so we would solve the meta problem of consciousness by seeing explanations of why AIs/people talk about consciousness the way they do, that doesn't mean you already have solved the meta-problem of consciousness now.
Aka just because you know it's solvable doesn't mean you're done. You haven't solved it yet. Just like the difference between knowing that general relativity exists and understanding the theory and math.
Applications (here) start with a simple 300 word expression of interest and are open until April 15, 2025. We have plans to fund $40M in grants and have available funding for substantially more depending on application quality.
Did you consider to instead commit to giving out retroactive funding for research progress that seems useful?
Aka that people could apply for funding for anything done from 2025, and then you can actually better evaluate how useful some research was, rather than needing to guess in advance how useful a project might be. And in a way that quite impactful results can be paid a lot, so you don't disincentivize low-chance-high-reward strategies. And so we get impact market dynamics where investors can fund projects in exchange for a share of the retroactive funding in case of success.
There are difficulties of course. Intuitively this retroactive approach seems a bit more appealing to me, but I'm basically just asking whether you considered it and if so why you didn't go with it.
Applications (here) start with a simple 300 word expression of interest and are open until April 15, 2025. We have plans to fund $40M in grants and have available funding for substantially more depending on application quality.
Side question: How much is Openphil funding LTFF? (And why not more?)
(I recently got an email from LTFF which suggested that they are quite funding constraint. And I'd intuitively expect LTFF to be higher impact per dollar than this, though I don't really know.)
I created an obsidian Templater template for the 5-minute version of this skill. It inserts the following list:
- how could I have thought that faster?
- recall - what are key takeaways/insights?
- trace - what substeps did I do?
- review - how could one have done it (much) faster?
- what parts were good?
- where did i have wasted motions? what mistakes did i make?
- generalize lesson - how act in future?
- what are example cases where this might be relevant?
Here's the full template so it inserts this at the right level of indentation. (You can set a shortcut for inserting this template. I use "Alt+h".)
<% "\t".repeat(tp.file.content.split("\n")[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length) + "- how could I have thought that faster?" %>
<% "\t".repeat(tp.file.content.split("\n")[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 1) + "- recall - what are key takeaways/insights?" %>
<% "\t".repeat(tp.file.content.split("\n")[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 2) + "- " %>
<% "\t".repeat(tp.file.content.split("\n")[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 1) + "- trace - what substeps did I do?" %>
<% "\t".repeat(tp.file.content.split("\n")[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 2) + "- " %>
<% "\t".repeat(tp.file.content.split("\n")[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 1) + "- review - how could one have done it (much) faster?" %>
<% "\t".repeat(tp.file.content.split("\n")[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 2) + "- " %>
<% "\t".repeat(tp.file.content.split("\n")[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 2) + "- what parts were good?" %>
<% "\t".repeat(tp.file.content.split("\n")[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 2) + "- where did i have wasted motions? what mistakes did i make?" %>
<% "\t".repeat(tp.file.content.split("\n")[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 1) + "- generalize lesson - how act in future?" %>
<% "\t".repeat(tp.file.content.split("\n")[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 2) + "- " %>
<% "\t".repeat(tp.file.content.split("\n")[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 2) + "- what are example cases where this might be relevant?" %>
I now want to always think of concrete examples where a lesson might become relevant in the next week/month, instead of just reading them.
As of a couple of days ago, I have a file where I save lessons from such review exercises for reviewing them periodically.
Some are in weekly review category and some in monthly review. Every day when I do my daily recall I now also check through the lessons in the corresponding weekday and monthday tag.
Here's how my file currently looks like:
(I use some short codes for typing faster like "W=what", "h=how", "t=to", "w=with" and maybe some more.)
- Mon
- [[lesson - clarify Gs on concrete examples]]
- [[lesson - delegate whenever you can (including if possible large scale responsibilities where you need to find someone competent and get funding)]]
- [[lesson - notice when i search for facts (e.g. w GPT) (as opposed to searching for understanding) and then perhaps delegate if possible]]
- Tue
- [[lesson - do not waste time on designing details that i might want to change later]]
- [[periodic reminder - stop and review what you'd do if you had pretty unlimited funding -> if it could speed you up, then perhaps try to find some]]
- Wed
- [[lesson - try to find edge cases where your current model does not work well]]
- notice when sth worked well (you made good progress) -> see h you did that (-> generalize W t do right next time)
- Thu
- it's probably useless/counterproductive to apply effort for thinking. rather try to calmly focus your attention.
- perhaps train to energize the thing you want to think about like a swing through resonance. (?)
- Fri
- [[lesson - first ask W you want t use a proposal for rather than directly h you want proposal t look like]]
- Sat
- [[lesson - start w simple plan and try and rv and replan, rather than overoptimize t get great plan directly]]
- Sun
- group
- plan for particular (S)G h t achieve it rather than find good general methodology for a large class of Gs
- [[lesson - when possible t get concrete example (or observations) then get them first before forming models or plans on vague ideas of h it might look like]]
- 1
- don't dive too deep into math if you don't want to get really good understanding (-> either get shallow or very deep model, not half-deep)
- 2
- [[lesson - take care not to get sidetracked by math]]
- 3
- [[lesson - when writing an important message or making a presentation, imagine what the other person will likely think]]
- 4
- [[lesson - read (problem statements) precisely]]
- 5
- perhaps more often ask myself "Y do i blv W i blv?" (e.g. after rc W i think are good insights/plans)
- 6
- sometimes imagine W keepers would want you to do
- 7
- group
- beware conceptual limitations you set yourself
- sometimes imagine you were smarter
- 8
- possible tht patts t add
- if PG not clear -> CPG
- if G not clear -> CG
- if not sure h continue -> P
- if say sth abstract -> TBW
- if say sth general -> E (example)
- 9
- ,rc methodology i want t use (and Y)
- Keltham methodology.
- loop: pr -> gather obs -> carve into subprs -> attack a subpr
- 10
- reminder of insights:
- hyp that any model i have needs t be able t be applied on examples (?)
- disentangle habitual execution from model building (??)
- don't think too abstractly. see underlying structure to be able t carve reality better. don't be blinded by words. TBW.
- don't ask e.g. W concepts are, but just look at observations and carve useful concepts anew.
- form models of concrete cases and generalize later.
- 11
- always do introspection/rationality-training and review practices. (except maybe in some sprints.)
- 12
- Wr down questions towards the end of a session. Wr down questions after having formed some takeaway. (from Abram)
- 13
- write out insights more in math (from Abram)
- 14
- periodically write out my big picture of my research (from Abram)
- 15
- Hoops. first clarify observations. note confusions. understand the problem.
- 16
- have multiple hypotheses. including for plans as hypotheses of what's the best course of action.
- 17
- actually fucking backchain. W are your LT Gs.
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- read https://www.lesswrong.com/posts/f2NX4mNbB4esdinRs/towards_keeperhood-s-shortform?commentId=D66XSCkv6Sxwwyeep
Belief propagation seems too much of a core of AI capability to me. I'd rather place my hope on GPT7 not being all that good yet at accelerating AI research and us having significantly more time.
This just seems doomed to me. The training runs will be even more expensive, the difficulty of doing anything significant as an outsider ever-higher. If the eventual plan is to get big labs to listen to your research, then isn't it better to start early? (If you have anything significant to say, of course.)
I'd imagine it not too hard to get >1OOM efficiency improvement which one can demonstrate in smaller AI and one might use this to get a lab to listen. If the labs are sufficiently uninterested in alignment it's pretty doomy anyway even if they adopted a better paradigm.
Also government interventions might still happen (perhaps more likely because of AI-caused unemployment than x-risk, and it won't buy amazingly much time, but still).
Also the strategy of "maybe if AIs are more rational they will solve alignment or at least realize that they cannot" seems also very unlikely to me to work on the current DL paradigm, though still slightly helpful.
(Also maybe some supergenius or my future self or some other group can figure something out.)
I don’t think that. See the bottom part of the comment you’re replying to. (The part after “Here’s what I would say instead:”)
Sry my comment was sloppy.
Right, my point is, I don’t see any difference between “AIs that produce slop” and “weak AIs” (a.k.a. “dumb AIs”).
(I agree the way I used sloppy in my comment mostly meant "weak". But some other thoughts:)
So I think there are some dimensions of intelligence which are more important for solving alignment than for creating ASI. If you read planecrash, WIS and rationality training seem to me more important in that way than INT.
I don't really have much hope for DL-like systems solving alignment but a similar case might be if an early transformative AI recognizes and says "no I cannot solve the alignment problem. the way my intelligence is shaped is not well suited to avoiding value drift. we should stop scaling and take more time where I work with very smart people like Eliezer etc for some years to solve alignment". And depending on the intelligence profile of the AI it might be more or less likely that this will happen (currently seems quite unlikely).
But overall those "better" intelligence dimensions still seem to me too central for AI capabilities, so I wouldn't publish stuff.
(Btw the way I read John's post was more like "fake alignment proposals are a main failure mode" rather than also "... and therefore we should work on making AIs more rational/sane whatever". So given that I maybe would defend John's framing, but not sure.)
So the lab implements the non-solution, turns up the self-improvement dial, and by the time anybody realizes they haven’t actually solved the superintelligence alignment problem (if anybody even realizes at all), it’s already too late.
If the AI is producing slop, then why is there a self-improvement dial? Why wouldn’t its self-improvement ideas be things that sound good but don’t actually work, just as its safety ideas are?
Because you can speed up AI capabilities much easier while being sloppy than to produce actually good alignment ideas.
If you really think you need to be similarly unsloppy to build ASI than to align ASI, I'd be interested in discussing that. So maybe give some pointers to why you might think that (or tell me to start).
Thanks for providing a concrete example!
Belief propagation seems too much of a core of AI capability to me. I'd rather place my hope on GPT7 not being all that good yet at accelerating AI research and us having significantly more time.
I also think the "drowned out in the noise" isn't that realistic. You ought to be able to show some quite impressive results relative to computing power used. Though when you maybe should try to convince the AI labs of your better paradigm is going to be difficult to call. It's plausible to me we won't see signs that make us sufficiently confident that we only have a short time left, and it's plausible we do.
In any case before you publish something you can share it with trustworthy people and then we can discuss that concrete case in detail.
Btw tbc, sth that I think slightly speeds up AI capability but is good to publish is e.g. producing rationality content for helping humans think more effectively (and AIs might be able to adopt the techniques as well). Creating a language for rationalists to reason in more Bayesian ways would probably also be good to publish.
Can you link me to what you mean by John's model more precisely?
If you mean John's slop-instead-scheming post, I agree with that with the "slop slightly more likely than scheming" part. I might need to reread John's post to see what the concrete suggestions for what to work on might be. Will do so tomorrow.
I'm just pessimistic that we can get any nontrivially useful alignment work out of AIs until a few months before the singularity, at least besides some math. Or like at least for the parts of the problem we are bottlenecked on.
So like I think it's valuable to have AIs that are near the singularity be more rational. But I don't really buy the differentially improving alignment thing. Or like could you make a somewhat concrete example of what you think might be good to publish?
Like, all capabilities will help somewhat with the AI being less likely to make errors that screw its alignment. Which ones do you think are more important than others? There would have to be a significant difference in usefulness pf some capabilities, because else you could just do the same alignment work later and still have similarly much time to superintelligence (and could get more non-timeline-upspeeding work done).
Thanks.
True, I think your characterization of tiling agents is better. But my impression was sorta that this self-trust is an important precursor for the dynamic self-modification case where alignment properties need to be preserved through the self-modification. Yeah I guess calling this AI solving alignment is sorta confused, though maybe there's sth into this direction because the AI still does the search to try to preserve the alignment properties?
Hm I mean yeah if the current bottleneck is math instead of conceptualizing what math has to be done then it's a bit more plausible. Like I think it ought to be feasible to get AIs that are extremely good at proving theorems and maybe also formalizing conjectures. Though I'd be a lot more pessimistic about finding good formal representations for describing/modelling ideas.
Do you think we are basically only bottlenecked on math so sufficient math skill could carry us to aligned AI, or only have some alignment philosophy overhang you want to formalize but then more philosophy will be needed?
What kind of alignment research do you hope to speed up anyway?
For advanced philosophy like stuff (e.g. finding good formal representations for world models, or inventing logical induction) they don't seem anywhere remotely close to being useful.
My guess would be for tiling agents theory neither but I haven't worked on it, so very curious on your take here. (IIUC, to some extent the goal of tiling-agents-theory-like work there was to have an AI solve it's own alignment problem. Not sure how far the theory side got there and whether it could be combined with LLMs.)
Or what is your alignment hope in more concrete detail?
This argument might move some people to work on "capabilities" or to publish such work when they might not otherwise do so.
Above all, I'm interested in feedback on these ideas. The title has a question mark for a reason; this all feels conjectural to me.
My current guess:
I wouldn't expect much useful research to come from having published ideas. It's mostly just going to be used in capabilities and it seems like a bad idea to publish stuff.
Sure you can work on it and be infosec cautious and keep it secret. Maybe share it with a few very trusted people who might actually have some good ideas. And depending on how things play out if in a couple years there's some actual effort from the joined collection of the leading labs to align AI and they only have like 2-8 months left before competition will hit the AI improving AI dynamic quite hard, then you might go to the labs and share your ideas with them (while still trying to keep it closed within those labs - which will probably only work for a few months or a year or so until there's leakage).
Due to the generosity of ARIA, we will be able to offer a refund proportional to attendance, with a full refund for completion. The cost of registration is $200, and we plan to refund $25 for each week attended, as well as the final $50 upon completion of the course. We’ll ask participants to pay the registration fee once the cohort is finalized, so no fee is required to fill out the application form below.
Wait so do we get a refund if we decide we don't want to do the course, or if we manage to complete the course?
Like is it a refund in the "get your money back if you don't like it" sense, or is it incentive to not sign up and then not complete the course?
Nice post!
My key takeaway: "A system is aligned to human values if it tends to generate optimized-looking stuff which is aligned to human values."
I think this is useful progress. In particular it's good to try to aim for the AI to produce some particular result in the world, rather than trying to make the AI have some goal - it grounds you in the thing you actually care about in the end.
I'd say the "... aligned to human values part" is still underspecified (and I think you at least partially agree):
- "aligned": how does the ontology translation between the representation of the "generated optimized-looking stuff" and the representation of human values look like?
- "human values"
- I think your model of humans is too simplistic. E.g. at the very least it's lacking a distinction like between "ego-syntonic" and "voluntary" as in this post, though I'd probably want a even significantly more detailed model. Also one might need different models for very smart and reflective people than for most people.
- We haven't described value extrapolation.
- (Or from an alternative perspective, our model of humans doesn't identify their relevant metapreferences (which probably no human knows fully explicitly, and for some/many humans it they might not be really well defined).)
Positive reinforcement for first trying to better understand the problem before running off and trying to solve it! I think that's the way to make progress, and I'd encourage others to continue work on more precisely defining the problem, and in particular on getting better models of human cognition to identify how we might be able to rebind the "human values" concept to a better model of what's happening in human minds.
Btw, I'd have put the corrigibility section into a separate post, it's not nearly up to the standards of the rest of this post.
To set expectations: this post will not discuss ...
Maybe you want to add here that this is not meant to be an overview of alignment difficulties, or an explanation for why alignment is hard.
Agree on that people focus a bit too much on scheming. It might be good for some people to think a bit more about the other failure modes you described, but the main thing that needs doing is very smart people making progress towards building an aligned AI, not defending against particular failure modes. (However, most people probably cannot usefully contribute to that, so maybe focusing on failure modes is still good for most people. Only that in any case there's the problem that people will find proposals that very likely don't actually work but which people can rather believe in that they work, and thereby making an AI stop a bit less likely.)
In general, I wish more people would make posts about books without feeling the need to do boring parts they are uninterested in (summarizing and reviewing) and more just discussing the ideas they found valuable. I think this would lower the friction for such posts, resulting in more of them. I often wind up finding such thoughts and comments about non-fiction works by LWers pretty valuable. I have more of these if people are interested.
I liked this post, thanks and positive reinforcement. In case you didn't already post your other book notes, just letting you know I'd be interested.
Do we have a sense for how much of the orca brain is specialized for sonar?
I don't know.
But evolution slides functions around on the cortical surface, and (Claude tells me) association areas like the prefrontal cortex are particularly prone to this.
It's particularly bad for cetaceans. Their functional mapping looks completely different.
Thanks. Yep I agree with you, some elaboration:
(This comment assumes you at least read the basic summary of my project (or watched the intro video).)
I know of Earth Species Project (ESP) and CETI (though I only read 2 publications of ESP and none of CETI).
I don't expect them to succeed in something equivalent to decoding orca language to an extent that we could communicate with them almost as richly as they communicate among each other. (Though like, if long-range sperm whales signals are a lot simpler they might be easier to decode.)
From what I've seen, they are mostly trying to throw AI at stuff and hoping somehow they will understand stuff, without having a clear plan how to actually decode it. The AI stuff might look advanced but it's sorta obvious things to try and I think it's unlikely to work very well, though still glad they are trying this.
If you look at orca vocalizations, it looks complex and alien. The patterns we can currently recognize there look very different from what we'd be able to see in an unknown human language. The embedding mapping might be useful if we had to decode a human language, and maybe we still learn some useful stuff from it, but for orca language we don't even know what their analog of words and sentences are and maybe their language works even somewhat differently (though I'd guess if they are smarter than humans there's probably going to be something like words and sentences - but they might be encoded differently in the signals than in human languages).
Though definitely plausible that AI can help significantly with decoding animal languages, but I think it also needs forming deep understanding of some things and I think it's likely too hard for ESP to succeed anytime soon, though like possible a supergenius could do it in a few years, but it would be really impressive.
My approach may fail, especially if orcas aren't at least roughly human-level smart, but it has the advantage that we can show orcas precise context of what some words and sentences mean, whereas we basically have almost no context data on recordings of orca vocalizations, so it's easier for them to see what some signals mean than for humans to infer what orca vocalizations mean. (Even if we had a lot of video datasets with vocalizations (which we don't), it's still a lot less context information about what they are talking about, than if they could show us images to indicate what they would talk about.) Of course humans have more research experience and better tools for decoding signals, but it doesn't look to me like anyone is currently remotely close, and my approach is much quicker to try and might have at least a decent chance. (I mean it nonzero worked with bottlenose dolphins (in terms of grammar better than with great apes), though I'd be a lot more ambitious.)
Of course, the language I create will also be alien for orcas, but I think if they are good enough at abstract pattern recognition they might still be able to learn it.
Perhaps also not what you're looking for, but you could check out the google hashcode archive (here's an example problem). I never participated though, so don't know whether they would make that great tests. But it seems to me like general ad-hoc problem solving capabilities are more useful in hashcode than in other competetive programming competitions.
GPT4 summary: "Google Hash Code problems are real-world optimization and algorithmic challenges that require participants to design efficient solutions for large-scale scenarios. These problems are typically open-ended and focus on finding the best possible solution within given constraints, rather than exact correctness."
Maybe not what you're looking for because it's not like one hard problem but more like many problems in a row, and generally I don't really know whether they are difficult enough, but you could (have someone) look into Exit games. Those are basically like escape rooms to go. I'd filter for Age16+ to hopefully filter for the hard ones, though maybe you'd want to separately look up which are particularly hard.
I did one or two when I was like 15 or 16 years old, and recently remembered them and I want to try some more for fun (and maybe also introspection), though I didn't get around to it yet. I think they are relatively ad-hoc puzzles though as with basically anything you can of course train to get good at Exit games in particular by practicing. (It's possible that I totally overestimate the difficulty and they are actually more boring than I expect.)
(Btw, probably even less applicable to what you are looking for, but CondingEscape is also really fun. Especially the "Curse of the five warriors" is good.)
I hope I will get around to rereading the post and edit this comment to write a proper review, but I'm pretty busy, so in case I don't I now leave this very shitty review here.
I think this is probably my favorite post from 2023. Read the post summary to see what it's about.
I don't remember a lot of the details from the post and so am not sure whether I agree with everything, but what I can say is:
- When I read it several months ago, it seemed to me like an amazingly good explanation for why and how humans fall for motivated reasoning.
- The concept of valence turned out very useful for explaining some of my thought processes, e.g. when I'm daydreaming something and asking myself why, then for the few cases where I checked it was always something that falls into "the thought has high valence" - like e.g. imagining some situation where I said something that makes me look smart.
Another thought, though I don't actually have any experience with this, but mostly doing attentive silent listening/observing might also be useful for learning how the other person is doing research.
Like, if it seems boring to just observe and occasionally say sth, try to better predict how the person will think or so.
The mein reason I'm interested in orcas is because they have 43 billion cortical neurons, whereas the 2 land animals with the most cortical neurons (where we have have optical-fractionator measurements) are humans and chimpanzees with 21 billion and 7.4 billion respectively. See: https://en.wikipedia.org/wiki/List_of_animals_by_number_of_neurons#Forebrain_(cerebrum_or_pallium)_only
Pilot whales is the other species I'd consider for experiments - they have 37.2 billion cortical neurons.
For sperm whales we don't have data on neuron densities (though they do have the biggest brains). I'd guess they are not quite as smart though because they can dive long and they AFAIK don't use very collaborative hunting techniques.
Cool, thanks!
Cool, thanks, that was useful.
(I'm creating a language for communicating with orcas, so the phonemes will be relatively unpractical for humans. Otherwise the main criteria are simple parsing structure and easy learnability. (It doesn't need to be super perfect - the perhaps bigger challenge is to figure out how to teach abstract concepts without being able to bootstrap from an existing language.) Maybe I'll eventually create a great rationalist language for thinking effectively, but not right now.)
Is there some resource where I can quickly learn the basics of the Esperanto composition system? Somewhere I can see the main base dimensions/concepts?
I'd also be interested in anything you think was implemented particularly well in a (con)language.
(Also happy to learn from you rambling. Feel free to book a call: https://calendly.com/simon-skade/30min )
Thanks!
But most likely, this will all be irrelevant for orcas. Their languages may be regular or irregular, with fixed or random word order, or maybe with some categories that do not exist in human languages.
Yeah I was not asking because of decoding orca language but because I want inspiration for how to create the grammar for the language I'll construct. Esparanto/Ido also because I'm interested about how well word-compositonality is structured there and whether it is a decent attempt at outlining the basic concepts where other concepts are composites of.
Currently we basically don't have any datasets where it's labelled what orca says what. When I listen to recordings, I cannot distinguish voices, though idk it's possible that people who listened a lot more can. I think just unsupervised voice clustering would probably not work very accurately. I'd guess it's probably possible to get data on who said what by using an array of hydrophones to infer the location of the sound, but we need very accurate position inference because different orcas are often just 1-10m distance from each other, and for this we might need to get/infer decent estimates of how water temperature varies by depth, and generally there have not yet been attempts to get high precision through this method. (It's definitely harder in water than in air.)
Yeah basically I initially also had rough thoughts into this direction, but I think the create-and-teach language way is probably a lot faster.
I think the earth species project is trying to use AI to decode animal communication, though they don't focus on orcas in particular, but many species including e.g. beluga whales. Didn't look into it a lot but seems possible I could do sth like this in a smarter and more promising way, but probably still would take long.
Thanks for your thoughts!
I don't know what you'd consider enough recordings, and I don't know how much decent data we have.
I think the biggest datasets for orca vocalizations are the orchive and the orcasound archive. I think they each are multiple terabytes big (from audio recordings) but I think most of it (80-99.9% (?)) is probably crap where there might just be a brief very faint mammal vocalization in the distance.
We also don't have a way to see which orca said what.
Also orcas from different regions have different languages, and orcas from different pods different dialects.
I currently think the decoding path would be slower, and yeah the decoding part would involve AI but I feel like people just try to use AI somehow without a clear plan, but perhaps not you.
What approach did you imagine?
In case you're interested in few high-quality data (but still without annotations): https://orcasound.net/data/product/biophony/SRKW/bouts/
Thanks.
I think LTFF would take way too long to get back to me though. (Also they might be too busy to engage deeply enough to get past the "seems crazy" barrier and see it's at least worth trying.)
Also btw I mostly included this in case someone with significant amounts of money reads this, not because I want to scrap it together from small donations. I expect higher chances of getting funding come from me reaching out to 2-3 people I know (after I know more about how much money I need), but this is also decently likely to fail. If this fails I'll maybe try Manifund, but would guess I don't have good chances there either, but idk.
Actually out of curiosity, why 4x? (And what exactly do you mean by "2x larger"?) (And is this for a naive algorithm which can be improved upon or a tight constraint?)
Thanks for pointing that out! I will tell my friends to make sure they actually get good data for the metabolic cost and not just use cortical neuron count as proxy if they cannot find something good.
(Or is there also another point you wanted to make?) And yeah it's actually also an argument for why orcas might be less intelligent (if they sorta use their neurons less often). Thanks.
My guess is that there probably aren't a lot of simple mutations which just increase intelligence without increasing cortical neuron count. (Though probably simple mutations can shift the balance between different sub-dimensions of intelligence as constrained through cortical neuron count.) (Also of course any particular species has a lot of deleterious mutations going around and getting rid of those may often just increase intelligence, but I'm talking about intelligence-increasing changes to the base genome.)
But there could be complex adaptations that are very important for abstract reasoning. Metacognition and language are the main ones that come to mind.
So even if the experiment my friends to will show that the number of cortical neurons is a strong indicator, it could still be that humans were just one of the rare cases which evolved a relevant complex adaptation. But it would be significant evidence for orcas being smarter.
An argument against orcas being more intelligent than humans runs thus: Orcas are much bigger than humans, so the fraction of the metabolic cost the brain consumes is smaller than in humans. Thus it took more selection pressure for humans to evolve having 21billion neurons than for orcas to have 43billion.[1] Thus humans might have other intelligence-increasing mutations that orcas didn't evolve yet.
So the question here is "how much does scale matter vs other adaptations". Luckily, we can get some evidence on that by looking at other species and rating how intelligent they are and correlating that with (1) number of cortical neurons and (2) fraction of metabolic cost the brain uses, to see how strong of an indicator each is for intelligence.
I have two friends who are looking into this for a few hours on the side (where one tries to find cortical neurons and metabolic cost data, and the other looks at animal behavior to rate intelligence (without knowing about neuron count or so)). It'll be rather a crappy estimate but hopefully we at least have some evidence from this in a week.
- ^
Of course metabolic cost doesn't necessarily need to be linear in the number of cortical neurons, but it'd be my default guess, and in any case I don't think it matters for gathering evidence across other species as long as we can directly get data on the fraction of the metabolic cost the brain uses (rather than estimating it through neuron count).
Another thought:
In what animals would I on priors expect intelligence to evolve?
- Animals which use collaborative hunting techniques.
- Large animals. (So the neurons make up a smaller share of the overall metabolic cost.)
- Animals that can use tools so they benefit more from higher intelligence.
- (perhaps some other stuff like cultural knowledge being useful, or having enough slack for intelligence increase from social dynamics being possible.)
AFAIK, orcas are the largest animals that use collaborative hunting techniques.[1] That plausibly puts them second behind humans for where I would expect intelligence to evolve. So it doesn't take that much evidence for me to be like "ok looks like orcas also fell into some kind of intelligence attractor".
- ^
Though I heard sperm whales might sometimes collaborate too, but not nearly that sophisticated I guess. But I also wouldn't be shocked if sperm whales are very smart. They have the biggest animal brains, but I don't whether the cortical neuron count is known.
Main pieces I remember were: Orcas already dominating the planet (like humans do), large sea creatures going extinct due to orcas (similar to how humans drove several species extinct, (Megalodon? Probably extinct for different reasons, weak evidence against? Most other large whales are still around)).
To clarify for other readers: I do not necessarily endorse this is what we would expect if orcas were smart.
(Also I read somewhere that apparently chimpanzees sometimes/rarely can experience menopause in captivity.)
If the species is already dominating the environment then the pressure from the first component compared to the second decreases.
I agree with this. However I don't think humans had nearly sufficient slack for most of history. I don't think they dominated the environment up until 20000years [1]ago or so, and I think most improvements in intelligence come from earlier.
That's why I'm attributing the level of human intelligence in large part to runaway sexual selection. Without it, as soon as interspecies competition became the most important for reproductive success, natural selection would not push for even grater intelligence in humans, even though it could improve our ability to dominate the environment even more.
I'm definitely not saying that group selection lead to intelligence in humans (only that group selection would've removed it over long timescales if it wasn't useful). However I think that there were (through basically all of human history) significant individual fitness benefits from being smarter that did not come from outwitting each other, e.g. being better able to master hunting techniques and thereby gaining higher status in the tribe.
- ^
Or could also be 100k years, idk
I'm not sure how it's relevant.
I thought if humans were vastly more intelligent than they needed to be they would already learn all the relevant knowledge quickly enough so they reach their peak in the 20s.
And if the trait, the runaway sexual selection is propagating, is itself helpful in competition with other species, which is obviously true for intelligence, there is just no reason for such straightening over a long timescale.
I mean for an expensive trait like intelligence I'd say the benefits need to at least almost be worth the costs, and then I feel like rather attributing the selection for intelligence to "because it was useful" rather than "because it was a runaway selection".
(For reference I think Tsvi and GeneSmith have much more relevant knowledge for evaluating the chance of superbabies being feasible and I updated my guess to like 78%.)
(As it happens I also became more optimistic about the orca plan (especially in terms of how much it would cost and how long it would take, but also a bit in how likely I think it is that orcas would actually study science) (see footnote 4 in post). For <=30y timelines I think the orca plan is a bit more promising, though overall the superbabies plan is more promising/important. I'm now seriously considering pivoting to the orca plan though.) (EDIT: tbc I'm considering pivoting from alignment research, not superbaby research.)
(haha cool. perhaps you could even PM Abram if he doesn't PM you. I think it would be pretty useful to speed up his agenda through this.)
Thanks!
I agree that sexual selection is a thing - that it's the reason for e.g. women sometimes having unnecessarily large breasts.
But I think it gets straightened out over long timescales - and faster the more expensive the trait is. And intelligence seems ridiculously expensive in terms of metabolic energy our brain uses (or childbirth motality).
A main piece that updated me was reading anecdotes in Scott Alexander's Book review of "The Secret of our success" where I now think that humans did need their intelligence for survival. (E.g. 30 year old hunter gatherers perform better at hunting etc than hunter gatherers in their early 20s, even though the latter are more physically fit.)
A few more thoughts:
It's plausible that for both humans and orcas the relevant selection pressure mostly came from social dynamics, and it's plausible that there were different environmental pressures.
Actually my guess would be that it's because intelligence was environmentally adaptive, because my intuitive guess would be that group selection[1] is significant enough over long timescales which would disincentivize intelligence if it's not already (almost) useful enough to warrant the metabolic cost, unless the species has a lot of slack.
So an important question is: How adaptive is high intelligence?
In general I would expect that selection pressure for intelligence was significantly stronger in humans, but maybe for orcas it was happening over a lot longer time window, so the result for orcas could still be more impressive.
From what I observed about orca behavior I'd perhaps say a lower bound of their intelligence might roughly be like human 15 year olds or so. So up to that level of intelligence there seem to be benefits that allow orcas to use more sophisticated hunting techniques.
But would it be useful for orcas to be significantly smarter than humans? My prior intuition would've been that probably not very much.
But I think observing the impressive orca brains mostly screens this off: I wouldn't have expected orcas to evolve to be that smart, and I similarly strongly wouldn't have expected them to have that impressive brains, and seeing their brains updates me that there had to be some selection pressure to produce that.
But the selection pressure for intelligence wouldn't have needed to be that strong compared to humans for making the added intelligence worth the metabolic cost, because orcas are large and their neurons make up a much smaller share of their overall metabolic consumption. (EDIT: Actually (during some (long?) period of orca history) selection pressure for intelligence also would've needed to be stronger than selection pressure for other traits (e.g. making muscles more efficient or whatever).)
And that there is selection pressure is not totally implausible in hindsight:
- Orcas hunt very collaboratively, and maybe there are added benefits from coordinating their attacks better. (Btw, orcas live in matrilines, and I'd guess that from an evolutionary perspective the key thing to look at is how well a matriline performs, not individuals, but not sure. So there would be high selection for within-matriline cooperation (and perhaps communication!).)
- Some/(many?) Orca sub-species prey on other smart animals like dolphins or whales, and maybe orcas needed to be significantly smarter to be able to outwit the defensive mechanisms they learn to adapt.
But overall I know way too little about orca hunting techniques to be able to evaluate those.
ADDED 2024-11-29:
To my current (not at all very confident) knowledge, orcas split of from other still alive dolphin species 5-10million years ago (so sorta similar to humans - maybe slightly longer for orcas). So selection pressure must've been relatively strong I guess.
Btw, bottlenose dolphins (which have iirc 12.5 billion cortical neurons) are to orcas sorta like chimps are to humans. One could look how smart bottlenose dolphins are compared to chimps.
(There are other dolphin species (like pilot whales) which are probably smarter than bottlenose dolphins, but those aren't studied more than orcas, whereas bottlenose dolphins are.)
- ^
I mean group selection that could potentially be on a level of species where species go extinct. Please lmk if that's actually called differently.
thanks. Can you say more about why?
I mean runaway sexual selection is basically H1, which I updated to being less plausible. See my answer here. (You could comment there why you think my update might be wrong or so.)