Posts
Comments
Interesting thoughts, ty.
A difficulty to common understanding I see here is that you're talking of "good" or "bad" paragraphs in the absolute, but didn't particularly define "good" or "bad" paragraph by some objective standard, so you're relying on your own understanding of what's good or bad. If you were defining good or bad relatively, you'd look for a 100 paragraphs, and post the worse 10 as bad. I'd be interested in seeing what were the worse paragraphs you found, some 50 percentile ones, and what were the best, then I'd tell you if I have the same absolute standards as you have.
Enjoyed this post.
Fyi, from the front page I just hovered this post "The shallow bench" and was immediately spoiled on Project Hail Mary (which I had started listening to, but didn't get far into). Maybe add some spoiler tag or warning directly after the title?
Without removing from the importance of getting the default right, and with some deliberate daring to feature creep, I think adding a customization feature (select colour) in personal profiles is relatively low effort and maintenance, so would solve the accessibility problem.
There's tacit knowledge in bay rationalist conversation norms that I'm discovering and thinking about, here's an observation and related thought. (I put the example later after the generalisation because that's my preferred style, feel free to read the other way).
Willingness to argue righteously and hash out things to the end, repeated over many conversations, makes it more salient when you're going for a dead end argument. This salience can inspire you to do argue more concisely and to the point over time.
Going to the end of things generates ground data on which to update your models of arguing and conversation paths, instead of leaving things unanswered.
So, though it's skilful to know when not to "waste" time on details and unimportant disagreements, the norm of "frequently enough going through til everyone agrees on things" seems profoundly virtuous.
Short example from today, I say "good morning". They point out it's not morning (it's 12:02). I comment about how 2 minutes is not that much. They argue that 2 minutes is definitely more than zero and that's the important cut-off.
I realize that "2 minutes is not that much" was not my true rebuttal, that this next token my brain generated was mostly defensive reasoning rather than curious exploration of why they disagreed with my statement. Next time I could instead note they're using "morning" to have a different definition/central cluster than I, appreciate that they pointed this out, and decide if I want to explore this discrepancy or not.
Many things don't make sense if you're just doing them for local effect, but do when you consider long term gains. (something something naive consequentialism vs virtue ethics flavored stuff)
I don't strongly disagree but do weakly disagree on some points so I guess I'll answer
Re first- if you buy into automated alignment work by human level AGI, then trying to align ASI now seems less worth it. The strongest counterargument to this I see is that "human level AGI" is impossible to get with our current understanding, as it will be superhuman in some things and weirdly bad at others.
Re second- disagreements might be nitpicking on "few other approaches" vs "few currently pursued approaches". There are probably a bunch of things that would allow fundamental understanding if they panned out (various agent foundations agendas, probably safe ai agendas like davidad's), though one can argue they won't apply to deep learning or are less promising to explore than SLT
I don't think your second footnote sufficiently addresses the large variance in 3D visualization abilities (note that I do say visualization, which includes seeing 2D video in your mind of a 3D object and manipulating that smoothly), and overall I'm not sure where you're getting at if you don't ground your post in specific predictions about what you expect people can and cannot do thanks to their ability to visualize 3D.
You might be ~conceptually right that our eyes see "2D" and add depth, but *um ackshually*, two eyes each receiving 2D data means you've received 4D input (using ML standards, you've got 4 input dimensions per time unit, 5 overall in your tensor). It's very redundant, and that redundancy mostly allows you to extract depth using a local algo, which allows you to create a 3D map in your mental representation. I don't get why you claim we don't have a 3D map at the end.
Back to concrete predictions, are there things you expect a strong human visualizer couldn't do? To give intuition I'd say a strong visualizer has at least the equivalent visualizing, modifying and measuring capabilities of solidworks/blender in their mind. You tell one to visualize a 3D object they know, and they can tell you anything about it.
It seems to me the most important thing you noticed is that in real life we don't that often see past the surface of things (because the spectrum of light we see doesn't penetrate most material) and thus most people don't know the inside of 3D things very well, but that can be explained by lack of exposure rather than inability to understand 3D.
Fwiw looking at the spheres I guessed an approx 2.5 volume ratio. I'm curious, if you visualized yourself picking up these two spheres, imagining them made of a dense metal, one after the other, could you feel one is 2.3 times heavier than the previous?
I'll give fake internet points to whoever actually follows the instructions and posts photographic proof.
The naming might be confusing because pivotal act sounds like a one time action, but in most cases getting to a stable world without any threat from AI requires constant pivotal processes. This makes almost all the destructive approaches moot (and they're probably already bad for ethical concerns and many others already discussed) because you'll make yourself a pariah.
The most promising venue for a pivotal act/pivotal process that I know of is doing good research so that ASI risks are known and proven, doing good outreach and education so most world leaders and decision makers are well aware of this, and helping setup good governance worldwide to monitor and limit the development of AGI and ASI until we can control it.
I recently played Outer Wilds and Subnautica, and the exercise I recommend for both of these games is : Get to the end of the game without ever failing.
In subnautica that's dying once, in Outer Wilds it's a spoiler to describe what failing is (successfully getting to the end could certainly be argued to be a fail).
I failed in both of these. I played Outer Wilds first and was surprised at my fail, which inspired me to play Subnautica without dying. I got pretty far but also died from a mix of 1 unexpected game mechanic, uncareful measure of another mechanic, lack of redundancy in my contingency plans.
Oh wow, makes sense. It felt weird that you'd spend so much time on posts, yet if you didn't spend much time it would mean you write at least as fast as Scott Alexander. Well, thanks for putting in the work. I probably don't publish much because I want it to not be much work to do good posts but you're reassuring it's normal it does.
(aside : I generally like your posts' scope and clarity, mind saying how long it takes you to write something of this length?)
Self modeling is a really important skill, and you can measure how good you are at it by writing predictions about yourself. (Modelling A notably important one for people who have difficulty with motivation is predicting your own motivation - will you be motivated to do X in situation Y?
If you can answer that one generally, you can plan to actually anything you could theoretically do, using the following algorithm : from current situation A, to achieve wanted outcome Z, find a predecessor situation Y from which you'll be motivated to get to Z (eg. have written 3 paragraphs of 4 of an essay), and a predecessor situational X from which you'll get to Y, iterate til you get to A (or forward chain, from A to Z). Check that indeed you'll be motivated each step of the way.
How can the above plan fail? Either you were mistaken about yourself, or about the world. Figure out which and iterate.
Appreciate the highlight of identity as this import/crucial self fulfilling prophecy, I use that frame a lot.
What does the title mean? Since they all disagree I don't see one as being more of a minority than the other.
Nice talk!
When you talk about the most important interventions for the three scenarios, I wanna highlight that in the case of nationalization, you can also, if you're a citizen of one of these countries nationalizing AI, work for the government and be on those teams working and advocating for safe AI.
In my case I should have measurable results like higher salary, higher life satisfaction, more activity, more productivity as measured by myself and friends/flatmates. I was very low so it'll be easy to see progress. The difficulty was finding something that'd work, but it won't be measuring if it does.
Some people have short ai timelines based inner models that don't communicate well. They might say "I think if company X trains according to new technique Y it should scale well and lead to AGI, and I expect them to use technique Y in the next few years", and the reasons for why they think technique Y should work are some kind of deep understanding built from years of reading ml papers, that's not particularly easy to transmit or debate.
In those cases, I want to avoid going into details and arguing directly, but would suggest that they use their deep knowledge of ML to predict existing recent results before looking at them. This would be easy to cheat, so I mostly suggest this for people to check themselves, or check people you trust to be honorable. Concretely, it'd be nice if when some new ml paper with a new technique comes out, someone compilés a list of questions answered by that paper (eg is technique A better than technique B for a particular result) and posts it to LW so people can track how well they understand ML, and thus (to some extent) short timelines.
For example a recent paper examinés how data affects performance on a bunch of benchmarks, and notably tested training either on an duplicated dataset (a bunch of common crawls), or deduplixated (the same except remove same documents that were shared between crawls). Do you expect deduplication in this case raises or lowers performance on benchmarks? If we could have similar questions when new results come out it's be nice.
Thank you for sharing, it really helps to pile on these stories (and nice to have some trust they're real, more difficult to get from reddit - on which note are there non doxing receipts you can show for this story being true? I have no reason to doubt you in particular but I guess it's good hygiene when on the internet to ask for evidence)
It also makes me wanna share a bit of my story. I read The Mind Illuminated, I did only small amounts of meditation, yet the framing the book offers has been changing my thinking and motivational systems. There aren't many things I'd call info hazards, but in my experience even just reading the book seems to be enough to contribute to profound changes, that would not be obviously be considered positive by the previous me. (They're not obviously negative either, I happen to be hopeful, but I'm waiting on results another year later to say)
Might be good to have a dialogue format with other people who agree/disagree to flesh out scenarios and countermeasures
Hi, I'm currently evaluating the cost effectiveness of various projects and would be interested in knowing, if you're willing to disclose, approximately how much this program costs MATS in total? By this I mean the summer cohort, includings ops before and after necessary for it to happen, but not counting the extension.
"It's true that we don't want women to be driven off by a bunch of awkward men asking them out, but if we make everyone read a document that says 'Don't ask a woman out the first time you meet her', then we'll immediately give the impression that we have a problem with men awkwardly asking women out too much — which will put women off anyway."
This seems a weak response to me, at best only defensible considering yourself to be on the margin and without thought for longterm growth and your ability to clarify intentions (you have more than 3 words when interacting with people irl).
To be clear explicitly writing "don't ask women out the first time you meet her" would be terrible writing and if that's the best writing members of that group can do for guidelines, then maybe nothing is better than that. Still, it reeks of "we've tried for 30 seconds and are all out of ideas" energy.
A guidelines document can give high level guidance to the vibe you want (eg. truth seeking, not too much aggressiveness, giving space when people feel uncomfortable, communicating around norms explicitly), all in the positive way (eg. you say what you want, not what you don't want), and can refer to sub-documents to give examples and be quite concrete if you have socially impaired people around that need to learn this explicitly.
Note Existential is a term of art different from Extinction.
The Precipice cites Bostrome and defines it such:
"An existential catastrophe is the destruction of humanity’s longterm potential.
An existential risk is a risk that threatens the destruction of humanity’s longterm potential."
Disempowerment is generally considered an existential risk in the literature.
I participated in the previous edition of AISC and found it very valuable to my involvement in AI Safety. I acquired knowledge (on standards and the standards process), got experience, contacts. I appreciate how much coordination AISC enables, with groups forming, which enable many to have their first hands on experience and step up their involvement.
Thanks, and thank you for this post in the first place!
Jonathan Claybrough
Actually no, I think the project lead here is jonachro@gmail.com which I guess sounds a bit like me, but isn't me ^^
Would be up for this project. As is, I downvoted Trevor's post for how rambly and repetitive it is. There's a nugget of idea, that AI can be used for psychological/information warfare that I was interested in learning about, but the post doesn't seem to have much substantive argument to it, so I'd be interested in someone both doing an incredibly shorter version which argued for its case with some sources.
It's a nice pic and moment, I very much like this comic and the original scene. It might be exaggerating a trait (here by having the girl be particularly young) for comedic effect but the Hogfather seems right.
I think I was around 9 when I got my first sword, around 10 for a sharp knife. I have a scar in my left palm from stabbing myself with that sharp knife as a child while whittling wood for a bow. It hurt for a bit, and I learned to whittle away from me or do so more carefully. I'm pretty sure my life is better for it and (from having this nice story attached to it) I like the scar.
This story still presents the endless conundrum between avoiding hurt and letting people learn and gain skills.
Assuming the world was mostly the same as nowadays, by the time your children are parenting, would they have the skills to notice sharp corners if they never experienced them ?
I think my intuitive approach here would be to put some not too soft padding (which is effectively close to what you did, it's still an unpleasant experience hitting against that even with the cloth).
What's missing is how to teach against existential risks. There's an extent to which actually bleeding profusely from a sharp corner can help learn walking carefully, anticipating dangers, and that these skills do generalize to many situations and allows one to live a long fruitful life. (This last sentence does not pertain to the actual age of your children and doesn't address ideal ages at which you can actually learn the correct and generalizable thing). If you have control on the future, remove all the sharp edges forever.
If you don't, you remove the hard edges when they're young, instore them again when they can/should learn to recognize what typically are hard edges and must be accounted for.
Are people losing ability to use and communicate in previous ontologies after getting Insight from meditation ? (Or maybe they never had the understanding I'm expecting of them ?) Should I be worried myself, in my practice of meditation ?
Today I reread Kensho by @Valentine, which presents Looking, and the ensuing conversation in the comments between @Said Achmiz and @dsatan, where Said asks for concrete benefits we can observe and mostly fails to get them. I also noticed interesting comments by @Ruby who in contrast was still be able to communicate in the more typical LW ontology, but hadn't meditated to the point of Enlightenment. Is Enlightenment bad? Different ?
My impression is that people don't become drastically better (at epistemology, rationality, social interaction, actually achieving your goals and values robustly) very fast through meditating or getting Enlightened, though they may acquire useful skills that could help to get better. If that's the case, it's safe for me to continue practicing meditation, getting into Jhanas, Insight etc (I'm following The Mind Illuminated), as the failings of Valentine/dsatan to communicate their points could just be attributed to them not being able to before either.
But I remain wary that people spend so much time engaging and believing in the models and practices taught in meditation material that they actually change their minds for the worse in certain respects. It looks like meditation ontologies/practices are Out to Get You and I don't want to get Got.
I focused my answer on the morally charged side, not emotional. The quoted statement said A and B so as long as B is mostly true for vegans, A and B is mostly true for (a sub-group) of vegans.
I'd agree with the characterization "it’s deeply emotionally and morally charged for one side in a conversation, and often emotional to the other." because most people don't have small identities and do feel attacked by others behaving differently indeed.
It's standard that the morally charged side in a veganism conversation is from people who argue for veganism.
Your response reads as snarky, since you pretend to have understood the contrary. You're illustrating op's point, that certain vegans are emotionally attached to their cause and jump to the occasion to defend their tribe. If you disagree to being pictured a certain way, at least act so it isn't accurate to depict you that way.
Did you know about "by default, GPTs think in plain sight"?
It doesn't explicitly talk about agentized GPTs but was discussing the impact this has on GPTs for AGI and how it affects the risks, and what we should do about it (eg. maybe rlhf is dangerous)
To not be misinterpreted, I didn't say I'm sure it's more the format than the content that's causing the upvotes (open question), nor that this post doesn't meet the absolute quality bar that normally warrants 100+ upvote (to each reader their opinion).
If you're open to object level discussing this, I can point on concrete disagreement with the content. Most importantly, this should not be seen as a paradigm shift, because it does not invalidate any of the previous threat models - it would only be so if it rendered impossible to do AGI any other way. I also don't think this should "change the alignment landscape" because it's just another part of it, one which was known and has been worked on for years (Anthropic and OpenAI have been "aligning" LLMs and I'd bet 10:1 anticipated these would be used to do agents like most people I know in alignment).
To clarify, I do think it's really important and great people work on this, and that in order this will be the first x-risk stuff we see. But we could solve the GPT-agent problem and still die to unalignment AGI 3 months afterwards. The fact that the world trajectory we're on is throwing additional problems in the mix (keeping the world safe from short term misuse and unaligned GPT-agents) doesn't make the existing ones simpler. There still is pressure to built autonomous AGI, there might still be mesa optimizers, there might still be deception, etc. We need the manpower to work on all of these, and not "shift the alignment landscape" to just focus on the short term risks.
I'd recommend to not worry much about PR risk, just ask the direct question: Even if this post is only ever read by LW folk, does the "break all encryption" add to the conversation? Causing people to take time to debunk certain suggestions isn't productive even without the context of PR risk
Overall I'd like some feedback on my tone, if it's too direct/aggressive to you of it's fine. I can adapt.
You can read "reward is not the optimization target" for why a GPT system probably won't be goal oriented to become the best at predicting tokens, and thus wouldn't do the things you suggested (capturing humans). The way we train AI matters for what their behaviours look like, and text transformers trained on prediction loss seem to behave more like Simulators. This doesn't make them not dangerous, as they could be prompted to simulate misaligned agents (by misuses or accident), or have inner misaligned mesa-optimisers.
I've linked some good resources for directly answering your question, but otherwise to read more broadly on AI safety I can point you towards the AGI Safety Fundamentals course which you can read online, or join a reading group. Generally you can head over to AI Safety Support, check out their "lots of links" page and join the AI Alignment Slack, which has a channel for question too.
Finally, how does complexity emerge from simplicity? Hard to answer the details for AI, and you probably need to delve into those details to have real picture, but there's at least strong reason to think it's possible : we exist. Life originated from "simple" processes (at least in the sense of being mechanistic, non agentic), chemical reactions etc. It evolved to cells, multi cells, grew etc. Look into the history of life and evolution and you'll have one answer to how simplicity (optimize for reproductive fitness) led to self improvement and self awareness
Quick meta comment to express I'm uncertain that posting things in lists of 10 is a good direction. The advantages might be real, easy to post, quick feedback, easy interaction, etc.
But the main disadvantage is that this comparatively drowns out other better posts (with more thought and value in them). I'm unsure if the content of the post was also importantly missing from the conversation (to many readers) and that's why this got upvoted so fast or if it's a lot the format... Even if this post isn't bad (and I'd argue it is for the suggestions it promotes), this is early warning of a possible trend that people with less thought out takes quickly post highly accessible content, get comparatively more upvotes than should, and it's harder to find good content.
(Additional disclosure, some of my bad taste for this post come from the fact its call to break all encryption is being cited on Twitter as representative of the alignment community - I'd have liked to answer that obviously no, but it got many upvotes! This makes my meta point also seem to be motivated by PR/optics which is why it felt necessary to disclose but let's mostly focus on consequences inside the community)
First a quick response on your dead man switch proposal : I'd generally say I support something in that direction. You can find existing literature considering the subject and expanding in different directions in the "multi level boxing" paper by Alexey Turchin https://philpapers.org/rec/TURCTT , I think you'll find it interesting considering your proposal and it might give a better idea of what the state of the art is on proposals (though we don't have any implementation afaik)
Back to "why are the predicted probabilities so extreme that for most objectives, the optimal resolution ends with humans dead or worse". I suggest considering a few simple objectives we could give ai (that it should maximise) and what happens, and over trials you see that it's pretty hard to specify anything which actually keeps humans alive in some good shape, and that even when we can sorta do that, it might not be robust or trainable.
For example, what happens if you ask an ASI to maximize a company's profit ? To maximize human smiles? To maximize law enforcement ? Most of these things don't actually require humans, so to maximize, you should use the atoms human are made of in order to fulfill your maximization goal.
What happens if you ask an ASI to maximize number of human lives ? (probably poor conditions). What happens if you ask it to maximize hedonistic pleasure ? (probably value lock in, plus a world which we don't actually endorse, and may contain astronomical suffering too, it's not like that was specified out was it?).
So it seems maximising agents with simple utility functions (over few variables) mostly end up with dead humans or worse. So it seems approaches which ask for much less, eg. doing an agi that just tries to secure the world from existential risk (a pivotal act) and solve some basic problems (like dying) then gives us time for a long reflection to actually decide what future we want, and be corrigible so it lets us do that, seems safer and more approachable.
I watched the video, and appreciate that he seems to know the literature quite well and has thought about this a fair bit - he did a really good introduction of some of the known problems.
This particular video doesn't go into much detail on his proposal, and I'd have to read his papers to delve further - this seems worthwhile so I'll add some to my reading list.
I can still point out the biggest ways in which I see him being overconfident :
- Only considering the multi-agent world. Though he's right that there already are and will be many many deployed AI systems, that doe not translate to there being many deployed state of the art systems. As long as training costs and inference costs continue increasing (as they have), then on the contrary fewer and fewer actors will be able to afford state of the art system training and deployment, leading to very few (or one) significantly powerful AGI (as compared to the others, for example GPT4 vs GPT2)
- Not considering the impact that governance and policies could have on this. This isn't just a tech thing where tech people can do whatever they want forever, regulation is coming. If we think we have higher chances of survival in highly regulated worlds, then the ai safety community will do a bunch of work to ensure fast and effective regulation (to the extent possible). The genie is not out of the bag for powerful AGI, governments can control compute and regulate powerful AI as weapons, and setup international agreements to ensure this.
- The hope that game theory ensures that AI developed under his principles would be good for humans. There's a crucial gap between going from real world to math models. Game theory might predict good results under certain conditions and rules and assumptions, but many of these aren't true of the real world and simple game theory does not yield accurate world predictions (eg. make people play various social games and they won't act like how game theory says). Stated strongly, putting your hope on game theory is about as hard on putting your hope on alignment. There's nothing magical about game theory which makes it work simpler than alignment, and it's been studied extensively by ai researchers (eg. why Eliezer calls himself a decision theorist and writes a lot about economics) with no clear "we've found a theory which empirically works robustly and in which we can put the fate of humanity in"
I work in AI strategy and governance, and feel we have better chances of survival in a world where powerful AI is limited to extremely few actors, with international supervision and cooperation for the guidance and use of these systems, making extreme efforts in engineering safety, in corrigibility, etc. I am not trustworthy of predictions on how complex systems turn out (which is the case of real multi agent problems) and don't think we can control these well in most relevant cases.
Writing down predictions. The main caveat is that these predictions are predictions about how the author will resolve these questions, not my beliefs about how these techniques will work in the future. I am pretty confident at this stage that value editing can work very well in LLMs when we figure it out, but not so much that the first try will have panned out.
- Algebraic value editing works (for at least one "X vector") in LMs: 90 %
- Algebraic value editing works better for larger models, all else equal 75 %
- If value edits work well, they are also composable 80 %
- If value edits work at all, they are hard to make without substantially degrading capabilities 25 %
- We will claim we found an X-vector which qualitatively modifies completions in a range of situations, for X =
- "truth-telling" 10 %
- "love" 70 %
- "accepting death" 20%
- "speaking French" 80%
I don't think reasoning about others' beliefs and thoughts is helping you be correct about the world here. Can you instead try to engage with the arguments themselves and point out at what step you you don't see a concrete way for that to happen ?
You don't show much sign of having read the article so I'll copy paste the part with explanations of how AIs start acting in the physical space.
In this scenario, the AIs face a challenge: if it becomes obvious to everyone that they are trying to defeat humanity, humans could attack or shut down a few concentrated areas where most of the servers are, and hence drastically reduce AIs' numbers. So the AIs need a way of getting one or more "AI headquarters": property they control where they can safely operate servers and factories, do research, make plans and construct robots/drones/other military equipment.
Their goal is ultimately to have enough AIs, robots, etc. to be able to defeat the rest of humanity combined. This might mean constructing overwhelming amounts of military equipment, or thoroughly infiltrating computer systems worldwide to the point where they can disable or control most others' equipment, or researching and deploying extremely powerful weapons (e.g., bioweapons), or a combination.
Here are some ways they could get to that point:
- They could recruit human allies through many different methods - manipulation, deception, blackmail and other threats, genuine promises along the lines of "We're probably going to end up in charge somehow, and we'll treat you better when we do."
- Human allies could be given valuable intellectual property (developed by AIs), given instructions for making lots of money, and asked to rent their own servers and acquire their own property where an "AI headquarters" can be set up. Since the "AI headquarters" would officially be human property, it could be very hard for authorities to detect and respond to the danger.
- Via threats, AIs might be able to get key humans to cooperate with them - such as political leaders, or the CEOs of companies running lots of AIs. This would open up further strategies.
- As assumed above, particular companies are running huge numbers of AIs. The AIs being run by these companies might find security holes in the companies' servers (this isn't the topic of this piece, but my general impression is that security holes are widespread and that reasonably competent people can find many of them)15, and thereby might find opportunities to create durable "fakery" about what they're up to.
- E.g., they might set things up so that as far as humans can tell, it looks like all of the AI systems are hard at work creating profit-making opportunities for the company, when in fact they're essentially using the server farm as their headquarters - and/or trying to establish a headquarters somewhere else (by recruiting human allies, sending money to outside bank accounts, using that money to acquire property and servers, etc.)
- If AIs are in wide enough use, they might already be operating lots of drones and other military equipment, in which case it could be pretty straightforward to be able to defend some piece of territory - or to strike a deal with some government to enlist its help in doing so.
- AIs could mix-and-match the above methods and others: for example, creating "fakery" long enough to recruit some key human allies, then attempting to threaten and control humans in key positions of power to the point where they control solid amounts of military resources, then using this to establish a "headquarters."
So is there anything here you don't think is possible ?
Getting human allies ? Being in control of large sums of compute while staying undercover ? Doing science, and getting human contractors/allies to produce the results ? etc
I think this post would benefit from being more explicit on its target. This problem concerns AGI labs and their employees on one hand, and anyone trying to build a solution to Alignment/AI Safety on the other.
By narrowing the scope to the labs, we can better evaluate the proposed solutions (for example to improve decision making we'll need to influence decision makers therein), make them more focused (to the point of being lab specific, analyzing each's pressures), and think of new solutions (inoculating ourselves/other decision makers on AI about believing stuff that come from those labs by adding a strong dose of healthy skepticism).
By narrowing the scope to people working on AI Safety who's status or monetary support relies on giving impressions of progress, we come up with different solutions (try to explicitly reward honesty, truthfulness, clarity over hype and story making). A general recommendation I'd have is to have some kind of reviews that check against "Wizard of Oz'ing" for flagging the behavior and suggesting corrections. Currently I'd say the diversity of LW and norms for truth seeking are doing quite well at this, so posting on here publicly is a great way to control this. It highlights the importance of this place and of upkeeping these norms.
Thanks for the reply !
The main reason I didn't understand (despite some things being listed) is I assumed none of that was happening at Lightcone (because I guessed you would filter out EAs with bad takes in favor of rationalists for example). The fact that some people in EA (a huge broad community) are probably wrong about some things didn't seem to be an argument that Lightcone Offices would be ineffective as (AFAIK) you could filter people at your discretion.
More specifically, I had no idea "a huge component of the Lightcone Offices was causing people to work at those organizations". That's strikingly more of a debatable move but I'm curious why that happened in the first place ? In my field building in France we talk of x-risk and alignment and people don't want to accelerate the labs but do want to slow down or do alignment work. I feel a bit preachy here but internally it just feels like the obvious move is "stop doing the probably bad thing", but I do understand if you got in this situation unexpectedly that you'll have a better chance burning this place down and creating a fresh one with better norms.
Overall I get a weird feeling of "the people doing bad stuff are being protected again, we should name more precisely who's doing the bad stuff and why we think it's bad" (because I feel aimed at by vague descriptions like field-building, even though I certainly don't feel like I contributed to any of the bad stuff being pointed at)
No, this does not characterize my opinion very well. I don't think "worrying about downside risk" is a good pointer to what I think will help, and I wouldn't characterize the problem that people have spent too little effort or too little time on worrying about downside risk. I think people do care about downside risk, I just also think there are consistent and predictable biases that cause people to be unable to stop, or be unable to properly notice certain types of downside risk, though that statement feels in my mind kind of vacuous and like it just doesn't capture the vast majority of the interesting detail of my model.
So it's not a problem of not caring, but of not succeeding at the task. I assume the kind of errors you're pointing at are things which should happen less with more practiced rationalists ? I guess then we can either filter to only have people who are already pretty good rationalists, or train them (I don't know if there are good results on that side per CFAR).
I don't think cost had that much to do with the decision, I expect that Open Philanthropy thought it was worth the money and would have been willing to continue funding at this price point.
In general I think the correct response to uncertainty is not half-speed. In my opinion it was the right call to spend this amount of funding on the office for the last ~6 months of its existence even when we thought we'd likely do something quite different afterwards, because it was still marginally worth doing it and the cost-effectiveness calculations for the use of billions of dollars of x-risk money on the current margin are typically quite extreme.
You're probably not the one to rant to about funding but I guess while the conversation is open I could use additional feedback and some reasons for why OpenPhil wouldn't be irresponsible in spending the money that way. (I only talk about OpenPhil and not particularly Lightcone, maybe you couldn't think of better ways to spend the money and didn't have other options)
Cost effectiveness calculations for reducing x-risk kinda always favor x-risk reduction so looking at it in the absolute isn't relevant. Currently AI x-risk reduction work is limited because of severe funding restrictions (there are many useful things to do that no one is doing for lack of money) which should warrant carefully done triage (and in particular considering the counterfactual).
I assume the average Lightcone office resident would be doing the same work with slightly reduced productivity (let's say 1/3) if they didn't have that office space (notably because many are rich enough to get other shared office space from their own pocket). Assuming 30 full time workers in the office, that's 10 man months per month of extra x-risk reduction work.
For contrast, on the same time period, $185k/month could provide for salary, lodging and office space for 50 people in Europe, all who counterfactually would not be doing that work otherwise, for which I claim 50 man months per month of extra x-risk reduction work. The biggest difference I see is incubation time would be longer than for the Lightcone offices, but if I start now with $20k/month I'd find 5 people and scale it up to 50 by the end of the year.
I've multiple times been perplexed as to what the past events which can lead to this kind of take (over 7 years ago, EA/Rationality community's influence probably accelerated openAI's creation) have to do with today's shutting down of the offices.
Are there current, present day things going on in the EA and rationality community which you think warrant suspecting them of being incredibly net negative (causing worse worlds, conditioned on the current setup)? Things done in the last 6 months ? At Lightcone Offices ? (Though I'd appreciate specific examples, I'd already greatly appreciate knowing if there is something in the abstract and prefer a quick response to that level of precision than nothing)
I've imagined an answer, is the following on your mind ?
"EAs are more about saying they care about numbers than actually caring about numbers, and didn't calculate downside risk enough in the past. The past events reveal this attitude and because it's not expected to have changed, we can expect it to still be affecting current EAs, who will continue causing great harm because of not actually caring for downside risk enough. "
I mostly endorse having one office concentrate on one research agenda and be able to have high quality conversations on it, and the stated numbers of maybe 10 to 20% people working on strategy/meta sounds fine in that context. Still I want to emphasize how crucial they are - If you have no one to figure out the path between your technical work and overall reducing risk, you're probably missing better paths and approaches (and maybe not realizing your work is useless).
Overall I'd say we don't have enough strategy work being done, and believe it's warranted to have spaces with 70% of people working on strategy/meta. I don't think it was bad if the Lightcone office had a lot of strategy work. (We probably also don't have enough technical alignment work, having more of both is probably good, if we coordinate properly)
Other commenters have argued about the correctness of using Shoggoth. I think it's mostly a correct term if you take it in the Lovcraftian sense, and that currently we don't understand LLMs that much. Interpretability might work and we might progress so we're not sure they actually are incomprehensible like Shoggoths (though according to wikipedia, they're made of physics, so probably advanced civilizations could get to a point where they could understand them, the analogy holds surprisingly well!)
Anyhow it's a good meme and useful to say "hey, we don't understand these things as well as you might imagine from interacting with the smiley face" to describe our current state of knowledge.
Now for trying to construct some idea of what it is.
I'll argue a bit against calling an LLM as a pile of masks, as that seems to carry implications which I don't believe in. The question we're asking ourselves is something like "what kind of algorithms/patterns do we expect to see appear when an LLM is trained? Do those look like a pile of masks, or some more general simulator that creates masks on the fly?" and the answer depends on specifics and optimization pressure. I wanna sketch out different stages we could hope to see and understand better (and I'd like for us to test this empirically and find out how true this is). Earlier stages don't disappear, as they're still useful at all times, though other things start weighing more in the next token predictions.
Level 0 : Incompetent, random weights, no information about the real world or text space.
Level 1 "Statistics 101" : Dumb heuristics doesn't take word positions into account.
It knows facts about the world like token distribution and uses that.
Level 2 "Statistics 201" : Better heuristics, some equivalent to Markov chains.
Its knowledge of text space increases, it produces idioms, reproduces common patterns. At this stage it already contains huge amount of information about the world. It "knows" stuff like mirrors are more likely to break and cause 7 years of jinx.
Level 3 "+ Simple algorithms": Some pretty specific algorithms appear (like Indirect Object Identification), which can search for certain information and transfer it in more sophisticated ways. Some of these algorithms are good enough they might not be properly described as heuristics anymore, but instead really representing the actual rules as strongly as they exist in language (like rules of grammar properly applied). Note these circuits appear multiple times and tradeoff against other things so overall behavior is still stochastic, there are heuristics on how much to weight these algorithms and other info.
Level 4 "Simulating what created that text" : This is where it starts to have more and more powerful in context learning, ie. its weights represent algorithms which do in context search (and combines with its vast encyclopedic knowledge of texts, tropes, genres) and figure out consistencies in characters or concepts introduced in the prompt. For example it'll pick up on Alice and Bobs' different backgrounds, latent knowledge on them, their accents.
But it only does that because that's what authors generally do, and it has the same reasoning errors common to tropes. That's because it simulates not the content of the text (the characters in the story), but the thing which generates the story (the writer, who themselves have some simulation of the characters).
So uh, do masks or pile of masks fit anywhere in this story ? Not that I see. The mask is a specific metaphor for the RLHF finetunning which causes mode collapse and makes the LLM mostly only play the nice assistant (and its opposites). It's a constraint or bridle or something, but if the training is light (doesn't affect the weights too much), then we expect the LLM to mostly be the same, and that was not masks.
Nor are there piles of masks. It's a bunch of weights really good at token prediction, learning more and more sophisticated strategies for this. It encodes stereotypes at different places (maybe french=seduction or SF=techbro), but I don't believe these map out to different characters. Instead, I expect it at level 4, there's a more general algorithm which pieces together the different knowledge, that it in context learns to simulate certain agents. Thus, if you just take mask to mean "character", the LLM isn't a pile of them, but a machine which can produce them on demand.
(In this view of LLMs, x-risk happens because we feed some input where the LLM simulates an agentic deceptive self aware agent which steers the outputs until it escapes the box)
Cool that you wanna get involved! I recommend the most important thing to do is coordinate with other people already working on AI safety, because they might have plans and projects already going on you can help with, and to avoid the unilateralist's curse.
So, a bunch of places to look into to both understand the field of AI safety better and find people to collaborate with :
http://aisafety.world/tiles/ (lists different people and institutions working on AI safety)
https://coda.io/@alignmentdev/alignmentecosystemdevelopment (lists AI safety communities, you might join some international ones or local ones near you)
I have an agenda around outreach (convincing relevant people to take AI safety seriously) and think it can be done productively, though it wouldn't look like 'screaming on the rooftops', but more expert discussion with relevant evidence.
I'm happy to give an introduction to the field and give initial advice on promising directions, anyone interested dm me and we can schedule that.
I generally explain my interest in doing good and considering ethics (despite being anti realist) something like your point 5, and I don't agree with or fully get your refutation that it's not a good explanation, so I'll engage with it and hope for clarifications.
My position, despite anti-realism and moral relativism, is I do happen to have values (which I can "personal values", they're mine and I don't think there's an absolute reason for anyone else to have them, though I will advocate for them to some extent) and epistemics (despite the problem of the criterion) that have initialized in a space where I want to do Good, I want to know what is Good, I want to iterate at improving my understanding and actions doing Good.
A quick question - when you say "Personally, though, I collect stamps", do you mean your personal approach to ethics is descriptive and exploratory (and you're collecting stamps in the sense of physics vs stamp collection image), and that you don't identify as systematizer ?
I wouldn't identify as "systematizer for its sake" either, it's not a terminal value, but it's an instrumental value for achieving my goal of doing Good. I happen to have priors and heuristics saying I can do more Good by systematizing better so I do, and I get positive feedback from it so I continue this.
Re "conspicuous absence of subject-matter" - true for an anti realist considering "absolute ethics", but this doesn't stop an anti realist considering what they'll call "my ethics". There can be as much subject-matter there as in realist absolute ethics, because you can simulate absolute ethics in "my ethics" with : "I epistemically believe there is no true absolute ethics, but my personal ethics is that I should adopt what I imagine would be the absolute real ethics if it existed". I assume this is an existing theorized position but not knowing if it already has another standard name, I call this being a "quasi realist", which is how I'd describe myself currently.
I don't buy Anti realists treating consistency as absolute, so there's nothing to explain. I view valuing consistency as being instrumental and it happens to win all the time (every ethics has it) because of the math that you can't rule out anything otherwise. I think the person who answers "eh, I don’t care that much about being ethically consistent" is correct that it's not in their terminal values, but miscalculates (they actually should value it instrumentally), it's a good mistake to point out.
I agree that someone who tries to justify their intransitivities by saying "oh I'm inconsistent" is throwing out the baby with the bathwater when they could simply say "I'm deciding to be intransitive here because it better fits my points". Again, it's a good mistake to point out.
I see anti realists as just picking up consistency because it's a good property to have for useful ethics, not because "Ethics" forced it onto them (it couldn't, it doesn't exist).
On the final paragraph, I would write my position as : “I do ethics, as an anti-realist, because I have a brute, personal preference to Doing Good (a cluster of helping other people, reducing suffering, anything that stems from Veil of Uncertainty which is intuitively appealing), and that this is self reinforcing (I consider it Good to want to do Good and to improve and doing Good), so I want to improve my ethics. There exists zones of value space where I'm in the dark and have no intuition (eg. population ethics/repugnant conclusion) so I use good properties (consistency, ..) to craft a curve which extends my ethics, not because of personal preference for blah-structural-properties, but by belief that this will satisfy my preferences to Doing Good the best".
If a dilemma comes up pitting object level stakes and some abstract structural constraint, I weigh my belief that my intuition on "is this Good" is correct against my belief that "the model of ethics I constructed from other points is correct" and I'll probably update one or both. Because of the problem of the criterion, I'm neither gonna trust my ethics or my data points as absolute. I have uncertainty on the position of all my points and on the best shape of the curve, so sometimes I move my estimate of the point position because it fits the curve better, and sometimes I move the curve shape because I'm pretty sure the point should be there.
I hope that's a fully satisfying answer to "Why do ethical anti-realists do ethics".
I wouldn't say there's an absolute reason why ethical anti-realists should do ethics.
Mostly unrelated - I'm curious about the page you linked to https://acsresearch.org/
As far as I see this is a fun site with a network simulation without any explanation. I'd have liked to see an about page with the stated goals of acs (or simply a link to your introductory post) so I can point to that site when talking about you.
I don't dispute that at some point in time we want to solve alignment (to come out of the precipice period), but I disputed it's more dangerous to know how to point AI before having solved what perfect goal to give it.
In fact, I think it's less dangerous because we at minimum gain more time, to work and solve alignment, and at best can use existing near human-level AGI to help us solve alignment too. The main reason to believe this is to reason that near human-level AGI is a particular zone where we can detect deception, where it can't easily unbox itself and takeover, yet is still useful. The longer we stay in this zone, the more relatively safe progress we can make (including on alignment)
Thanks for the answers. It seems they mostly point to you valuing stuff like freedom/autonomy/self-realization, and that violations of that are distasteful. I think your answers are pretty reasonable and though I might not have exact same level of sensitivity I agree with the ballpark and ranking (brainwashing is worse than explaining, teaching chess exclusively feels a little too heavy handed..)
So where our intuitions differ is probably that you're applying these heuristics about valuing freedom/autonomy/self-realization to AI systems we train ? Do you see them as people, or more abstractly as moral patients (because of them probably being conscious or something)?
I won't get into moral weeds too fast, I'd point out that though I do currently mostly believe consciousness and moral patienthood is quite achievable "in silico", that doesn't mean that all intelligent system is conscious or a moral patient, and we might create AGI that isn't of that kind. If you suppose AGI is conscious and a moral patient, then yeah I guess you can argue against it being pointed somewhere, but I'd mostly counter argue from moral relativism that "letting it point anywhere" is not fundamentally more good than "pointed somewhere", so because we exist and have morals, let's point it to our morals anyway.