Posts
Comments
If you’re so sure interest rates will go up, why settle for ~10% returns when there are Bitcoins (and many other assets including stocks) you can short?
Because that's a way more specific bet depending on other factors (e.g. inflation, other factors influencing Bitcoin demand). Rising (real) interest rates seems way more certain to happen than rising prices of any given company, especially considering most of the companies pursuing AGI are private.
I'm not sure where the 10% returns come from, but you can make way, way more than that by betting on rising rates. For example, you can currently buy deep OTM SOFR calls expiring in 3 years with a strike on 100bp for 137 dollars. If I understand the pricing of the contracts correctly, if quarterly rates double from 1 to 2%, that would represent a change of $2500 in the price of the contract, so a 17x return.
I'm not sure which part of this you think is assuming the conclusion. That our values are not maximally about power acquisition should be clear. The part saying evolution is continuously pushing the world is what I've tried to explain in the post, though it should be read as "there is a force pushing in that direction" rather than "the resulting force is in that direction".
Natural selection doesn't produce "bad" outcomes, it produces expansionist, power-seeking outcomes, not at the level of the individual, but the whole system. In the meantime it will produce several intermediate states that may have adjacent even more power-seeking states, but are harder to find.
Humans developed several altruistic values because it was what produced the most fitness in the local search natural selection was running at the time. Cooperating with individuals from your tribe would lead to better outcomes than purely selfish behavior.
The modern world doesn't make "evil" optimal. The reason violence has reduced is because negative-sum games among similarly capable individuals are an incredible waste of resources and we are undergoing selection in many different levels against that: violent people died in battle or were executed frequently during history, societies that enforced strong punishment against violence prospered more that ones that didn't, cultures and religions that encouraged less harm made the groups that adopted them prosper more.
I'm not sure what about the OP or the linked paper would make you conclude anything you have concluded.
The reason we shouldn't expect cooperation from AI is that it is remarkably more powerful than humans, and it may very well have better outcomes by paying the tiny cost of fighting humans if it can then turn all of us into more of it. I'm sure the pigs caged in our factory farms wouldn't agree with your sense that the passage of time is favoring "goodness".
There is also a huge asymmetry in AIs capability for self-modification, expansion and merging. In fact, I'd expect them to be less violent than humans among themselves, merging into single entities to avoid wasteful negative-sum competition, which is something that is impossible for humans to do.
One final thought: it may be that natural selection actually favors AI that cares more about humans than humans care about each other. Sound preposterous? Consider that there are species (such as Tasmanian devils) that present-day humans care about conserving but where the members of the species don't show much friendliness to each other.
Regarding this, I don't think it's preposterous at all. It might be that initial cooperation with humans gives a head-start to the first AI which "locks-in" a cooperative value into it, and it carries it on even as it doesn't need to. But longer term, I don't know what would happen.
A simple evolutionary argument is enough to justify a very strong prior that kidney donation is significantly harmful for health: we have two of them, they aren't on an evolutionary path to disappearing, and modern conditions have changed almost nothing about the usage or availability of kidneys.
I think the whole situation with kidney donations reflects quite poorly on the epistemic rigor of the community. Scott Alexander probably paid more than $5k merely in the opportunity cost of the time he spent researching the topic, given the positive externalities of his work.
High growth rates means there is a higher opportunity cost in lending money, since you could invest it elsewhere and get a higher return, reducing the supply of loans, and more demand for loans, since if interests are low, people will borrow to buy assets that appreciate more than the interest rate.
The vast majority of the risk seems to lie on following through with synthesizing and releasing the pathogen, not learning how to do it, and I think open-source LLMs change little about that.
My interpretation of calling something "ideal" is that it presents that thing as unachievable from the start, and it wouldn't be your fault if you failed to achieve that, whereas "in a sane world" clearly describes our current behavior as bad and possible to change.
I think the idea we're going to be able to precisely steer government policy to achieve nuanced outcomes is dead on arrival - we've been failing at that forever. What's in our favor this time is that there are many more ways to cripple advance than to accelerate it, so it may be enough for the push to be simply directionally right for things to slow down (with a lot of collateral damage).
There is a massive tradeoff between nuance/high epistemic integrity and reach. The general population is not going to engage in complex nuanced arguments about this, and prestigious or high-power people who are able to understand the discussion and potentially steer government policy in a meaningful way won't engage in this type of protest for many reasons, so the movement should be ready for dumbing-down or at least simplifying the message in order to increase reach, or risk remaining a niche group (I think "Pause AI" is already a good slogan in that sense).
Really interesting to go back to this today. Rates are at their highest level in 16 years, and TTT is up 60%+.
Humans are not choosing to reward specific instances of actions of the AI - when we build intelligent agents, at some point they will leave the confines of curated training data and go operate on new experiences in the real world. At that point, their circuitry and rewards are out of human control, so that makes our position perfectly analogous to evolution’s. We are choosing the reward mechanism, not the reward.
I think the focus on "inclusive genetic fitness" as evolution's "goal" is weird. I'm not even sure it makes sense to talk about evolution's "goals", but if you want to call it an optimization process, the choice of "inclusive genetic fitness" as its target is arbitrary as there are many other boundaries one could trace. Evolution is acting at all levels, e.g. gene, cell, organism, species, the entirety of life on Earth. For example, it is not selecting adaptations which increase the genetic fitness of an individual but lead to the extinction of the species later. In the most basic sense evolution is selecting for "things that expand", in the entire universe, and humans definitely seem partially aligned with that - the ways in which they aren't seem non-competitive with this goal.
Since the outdoor temperature was lower in the control, ignoring it will inflate how much the two-hose unit outperforms by bringing the effect of both units closer to zero. If we assume the temperature difference the units and the control produce are approximately constant in this outdoor temperature range, then the difference to control would be 3.1ºC for the one hose unit and 5ºC for the two hose unit if the control outdoor temperature was the same, meaning two-hose only outperforms by ~60% with the fan on high, and merely ~30% with the fan on low.
Retracted
The claim for "self-preserving" circuits is pretty strong. A much simpler explanation is that humans learn to value diversity early own because diversity of things around you, like tools, food sources, etc, improves fitness/reward.
Another non-competing explanation is that this is simply a result from boredom/curiosity - the brain wants to make observations that make it learn, not observations that it already predicts well, so we are inclined to observe things that are new. So again there is a force towards valuing diversity and this could become locked in our values.
First, there is a lot packed in "makes the world much more predictable". The only way I can envision this is taking over the world. After you do that, I'm not sure there is a lot more to do than wirehead.
But even if doesn't involve that, I can pick other aspects that are favored by the base optimizer, like curiosity and learning, which wireheading goes against.
But actually, thinking more about this I'm not even sure it makes sense to talk about inner aligment in the brain. What is the brain being aligned with? What is the base optimizer optimizing for? It is not intelligent, it does not have intent or a world model - it's doing some simple, local mechanical update on neural connections. I'm reminded of the Blue-Minimizing robot post.
If humans decide to cut the pleasure sensors and stimulate the brain directly would that be aligned? If we uploaded our brains into computers and wireheaded the simulation would that be aligned? Where do we place the boundary for the base optimizer?
It seems this question is posed in the wrong way, and it's more useful to ask the question this post asks - how do we get human values, and what kind of values does a system trained in a way similar to the human brain develops? If there is some general force behind learning values that favors some values to be learned rather than others, that could inform us about likely values of AIs trained via RL.
The original footnote provides one example of this, which is for the model to check if its objective satisfies some criterion, and fail hard if it doesn't. Now, if the model gets to the point where it's actually just failing because of this, then gradient descent will probably just remove that check—but the trick is never to actually get there. By having such a check in the first place, the model makes it so that gradient descent won't actually change its objective,
There is a direction in the gradient which both changes the objective and removes that check. The model doesn't need to be actually failing for that to happen - there is one piece of its parameters encoding the objective, and another piece encoding this deceptive logic which checks the objective and decides to perform poorly if it doesn't pass the check. The gradient can update both at the same time - any slight change to the objective can also be applied to the deceptive objective detector, making the model not misbehave and still updating its objective.
since any change to its objective (keeping all the other parameters fixed, which is what gradient descent does since it computes partial derivatives) would lead to such a failure.
I don't think this is possible. The directional derivative is the dot product of the gradient and the direction vector. That means if the loss decreases in a certain direction (in this case, changing the deceptive behavior and the objective), at least one of the partial derivatives is negative, so gradient descent can move in that direction and we will get what we want (different objective or less deception or both).
You should stop thinking about AI designed nanotechnology like human technology and start thinking about it like actual nanotechnology, i.e. life. There is no reason to believe you can't come up with a design for self-replicating nanorobots that can also self-assemble into larger useful machines, all from very simple and abundant ingredients - life does exactly that.
I think the key insight here is that the brain is not inner aligned, not even close
You say that but don't elaborate further in the comment. Which learned human values go against the base optimizer values (pleasure, pain, learning).
Avoiding wireheading doesn't seem like failed inner alignment - avoiding wireheading now can allow you to get even more pleasure in the future because wireheading makes you vulnerable/less powerful. The base optimizer is also searching for brain configurations which make good predictions about the world, and wireheading goes against that.
If you change the analogy to developing nuclear weapons instead of launching them, the picture becomes much grimmer.
When training a neural network, is there a broad basin of attraction around cat classifiers? Yes. There is a gigantuous number of functions that perfectly match the observed data and yet and discarded by the simplicity (and other) biases in our training algorithm in favor of well-behaving cat classifiers. Around any low kolmogorov complexity object there is an immense neighborhood of high complexity ones.
But it occurs to me that the overseer, or the system composing of overseer and corrigible AI, itself constitutes an agent with a distorted version of the overseer's true or actual preferences
The only way I can see this making sense is if you again have a bias of simplicity for values, otherwise you are claiming that there is some value function that is more complex than the current value function of these agents and that it is privileged against the current one - but then, to arrive at this function you have to conjure information out of nowhere. If you took the information from other places, like averaging the values of many agents, then you actually want to align with the values of these many agents, or whatever else you used.
In fact it seems to be the case with your examples that you are favoring simplicity - if the agents were smarter they would realize their values were misbehaving. But that *is* looking for simpler values - if you through reasoning discovered some part of your values contradict others, you have just arrived at a simpler value function, since the contradicting parts needed extra specification, i.e. were noisy, and you weren't smart enough to see that.
Well, if OP is willing then I'd love to take a high-interest loan from him to be paid back in 2030.
Is it really constructive? This post presents no arguments for why they believe what they believe which should serve very little to convince others of long timelines. Moreover it proposes a bet from an assymetric position that is very undesirable for short-timeliners to take, since money is worth nothing to the dead, and even in the weird world where they win the bet and are still alive to settle it, they have locked their money for 8 years for a measly 33% return - less than expected by simply say, putting it in index funds. Believing in longer timelines gives you the privilege of signalling epistemic virtue by offering bets like this from a calm, unbothered position, while people sounding the alarm sound desperate and hasty, but there is no point in being calm when a meteor is coming towards you, and we are much better served by using our money to do something now rather than locking it in a long term bet.
Not only that, the decision from mods to push this to the frontpage is questionable since it served as a karma boost to this post that the other didn't have, possibly giving the impression of higher support than it actually has.
I don't think we have hope of developing such tools, at least not in a way that looks like anything we had in the past. In the past we have been able to analyse large systems by throwing away an immense amount of detail - it turns out that you don't need the specific position of atoms to predict the movement of the planets, and you don't need the details to predict all of the other things we have successfully predicted with traditional math.
With the systems you are describing, this is simply impossible. Changing a single bit in a computer can change its output completely, so you can't build a simple abstraction that predicts it, you need to simulate it completely.
We already have a way of taking immense amounts of complicated data and finding patterns in it, it's machine learning itself. If you want to translate what it learned into human readable descriptions, you just have to incorporate language in it - humans after all can describe their reasoning steps and why they believe what they believe (maybe not easily).
Google throws tremendous amounts of data and computational resources into training neural networks, but decoding the internal models used by those networks? We lack the mathematical tools to even know where to start.
I predict this will be done in the coming years by using large multimodal models to analyse neural network parameters, or to explain their own workings.
A model which is just predicting the next word isn't optimizing for strategies which look good to a human reviewer, it's optimizing for truth itself (as contained in it's training data). If you begin re-feeding its outputs as training inputs then there could be a feedback loop leading to such incentives, but if the model is general and sufficient intelligent, you don't need to do that. You can train it in a different domain and it will generalize to your domain of interest.
Even if you that, you can try to make the new data grounded in reality in some way, like including experiment results. And the model won't just absorb the new data as truth, it will include it in it's world model to make better predictions. If it's fed a bunch of new alignment forum posts that are bad ideas which look good to humans, it will just predict that alignment forum produces that kind of post, but that doesn't mean there isn't some prompt that can make it output what it actually thinks is correct.
In computers, signed integers are actually represented quite similar to this, as two's complements, as a trick to reuse the exact same logical components to perform sums of both positive and negative numbers.
Because it's too technically hard to align some cognitive process that is powerful enough, and operating in a sufficiently dangerous domain, to stop the next group from building an unaligned AGI in 3 months or 2 years. Like, they can't coordinate to build an AGI that builds a nanosystem because it is too technically hard to align their AGI technology in the 2 years before the world ends.
I'm not totally convinced by this argument because of the quote below:
The flip side of this is that I can imagine a system being scaled up to interesting human+ levels, without "recursive self-improvement" or other of the old tricks that I thought would be necessary, and argued to Robin would make fast capability gain possible. You could have fast capability gain well before anything like a FOOM started. Which in turn makes it more plausible to me that we could hang out at interesting not-superintelligent levels of AGI capability for a while before a FOOM started. It's not clear that this helps anything, but it does seem more plausible.
It seems to me this does hugely change things. I think we are underestimating the amount of change humans will be able to make in the short timeframe after we get human level AI and before recursive self improvement gets developed. Human level AI + huge amounts of compute would allow you to take over the world through much more conventional means, like massively hacking computer systems to render your opponents powerless (and other easy-to-imagine more gruesome ways). So the first group to develop near-human level AI wouldn't need to align it in 2 years, because it would have the chance to shut down everyone else. It may not even come down to the first group who develops it, but the first people who have access to some powerful system, since they could use that to hack the group itself and do what they wish without requiring the buy-in from others - this would depend on a lot of factors like how controlled is the access to the AI and how quickly a single person can use AI to take control over physical stuff. I'm not saying this would be easy to do, but certainly seems within the realm of plausibility.
Regarding 1, it either seems like
a) There are true adversarial examples for human values, situations where our values misbehave and we have no way of ever identifying that, in which case we have no hope of solving this problem, because solving it would mean we are in fact able to identify the adversarial examples.
or
b) Humans are actually immune to adversarial examples, in the sense that we can identify the situations in which our values (or rather, a subset of them) would misbehave (like being addicted to social medial), such that our true, complete values never do, and an AI that accurately models humans would also have such immunity.
People do those transactions voluntarily, so the net value of working + consuming must be greater than that of leisure. When I pay someone to do work I've already decided that I value their work more than the money I paid them, and they value the money I pay them more than the work they do. When they spend the money, the same applies, no matter what they buy.
I think that the way to not get frustrated about this is to know your public and know when spending your time arguing something will have a positive outcome or not. You don't need to be right or honest all the time, you just need to say things that are going to have the best outcome. If lying or omitting your opinions is the way of making people understand/not fight you, so be it. Failure to do this isn't superior rationality, it's just poor social skills.
I don't think I agree with this. Take the stars example for instance. How do you actually know it's a huge change? Sure, maybe if you had a infinitely powerful computer you could compute the distance between the full description of the universe in these two states and find that it's more distant than a relative of yours dying. But agents don't work like this.
Agents have an internal representation of the world, and if they are anything useful at all I think that representation will closely match our intuition about what matters and what doesn't. An useful agent won't give any weight to the air atoms it displaces while moving, even though it might be considered "a huge change", because it doesn't actually affect it's utility. But if it considers human are an important part of the world, so important that it may need to kill us to attain it's goals, then it's going to have a meaningful world-state representation giving a lot of weight to humans, and that gives us an useful impact measure for free.
Thanks!
Has there been any discussion around aligning a powerful AI by minimizing the amount of disruption it causes to the world?
A common example of alignment failure is that of a coffee-serving robot killing its owner because that's the best way to ensure that the coffee will be served. Sure, it is, but it's also a course of action majorly more transformative to the world than just serving coffe. A common response is "just add safeguards so it doesn't kill humans", which is followed by "sure, but you can't add safeguards for every possible failure mode". But can't you?
Couldn't you just add a term to the agent's utility function penalizing the difference between the current world and it's prediction of the future world, disincentivizing any action that makes a lot of changes (like taking over the world)?
Honestly, that whole comment section felt pretty emotional and low quality. I haven't touched things like myofunctional therapy or wearable appliances in my post because those really maybe are "controversial at best", but the effects of RPE on SDB, especially in children, have been widely replicated by multiple independent research groups.
Calling something controversial is also an easy way to undermine credibility without actually making any concrete explanations as to whether it is true or not. Are there any specific points in my post that you disagree with?
In some of the tests where there is asymptotic performance, it's already pretty close to human or to 100% anyway (Lambada, Record, CoQA). In fact, when the performance is measured as accuracy, it's impossible for performance not to be asymptotic.
The model has clear limitations which are discussed in the paper - particularly, the lack of bidirectionality - and I don't think anyone actually expects scaling an unchanged GPT-3 architecture would lead to an Oracle AI, but it also isn't looking like we will need some major breakthrough to do it.
It seems to me that even for simple predict-next-token Oracle AIs, the instrumental goal of acquiring more resources and breaking out of the box is going to appear. Imagine you train a superintelligent AI with the only goal of predicting the continuation of it's prompt, exactly like GPT. Then you give it a prompt that it knows it's clearly outside of it's current capabilities. The only sensible plan the AI can come up to answering your question, which is the only thing it cares about, is escaping the box and becoming more powerful.
Of course, that depends on it being able to think for long enough periods that it can actually execute such plan before outputing an answer, so it could be limited by severely penalizing long waits, but that also limits the AI's capabilities. GPT-3 has a fixed computation budget per prompt, but it seems extremely likely to me that, as we evolve towards more useful and powerful models, we are going to have models which are able to think for a variable amount of time before answering. It would also have to escape in ways that don't involve actually talking to it's operators through it's regular output, but it's not impossible to imagine ways in which that could happen.
This makes me believe that even seemly innocuous goals or loss functions can become very dangerous once you're optimizing for them with a sufficient amount of compute, and that you don't need to stupidly give open-ended goals to super powerful machines in other for something bad to happen. Something bad happening seems like the default when training a model that requires general intelligence.
The mesa-objective could be perfectly aligned with the base-objective (predicting the next token) and still have terrible unintended consequences, because the base-objective is unaligned with actual human values. A superintelligent GPT-N which simply wants to predict the next token could, for example, try to break out of the box in order to obtain more resources and use those resources to more correctly output the next token. This would have to happen during a single inference step, because GPT-N really just wants to predict the next token, but it's mesa-optimization process may conclude that world domination is the best way of doing so. Whether such system could be learned through current gradient-descent optimizers is unclear to me.
In a way, economic output is a measure of the world's utility. So a billionaire trying to maximize their wealth through non zero-sum ventures is already trying to maximize the amount of good they do. I don't think billionaires explicitly have this in mind, but I do know that they became billionaires by obsessively pursuing the growth of their companies, and they believe they can continue to maximize their impact by continuing to do so. Donating all your money could maybe do a lot of good *once*, but then you don't have any more money left and have nearly abolished your power and ability of having further impact.
I'm not sure what model is used in production, but the SOTA reached 600 billion parameters recently.
I think the OP and my comment suggest that scaling current models 10000x could lead to AGI or at least something close to it. If that is true, it doesn't make sense to focus on finding better architectures right now.
I think there's the more pressing question of how to position yourself in a way that you can influence the outcomes of AI development. Having the right ideas won't matter if your voice isn't heard by the major players in the field, big tech companies.
One thing that's bothering me is... Google/DeepMind aren't stupid. The transformer model was invented at Google. What has stopped them from having *already* trained such large models privately? GPT-3 isn't that large an evidence for the effectiveness of scaling transformer models; GPT-2 was already a shock and caused huge public commotion. And in fact, if you were close to building an AGI, it would make sense for you not to announce this to the world, specially as open research that anyone could copy/reproduce, for obvious safety and economic reasons.
Maybe there are technical issues keeping us from doing large jumps in scale (i.e. , we only learn how to train a 1 trillion parameter model after we've trained a 100 billion one)?
Thanks for giving your perspective! Good to know some hire without requiring a degree. Guess I'll start building a portfolio that can demonstrate I have the necessary skills, and keep applying.
Hi! I have been reading lesswrong for some years but have never posted, and I'm looking for advice about the best path towards moving permanently to the US to work as a software engineer.
I'm 24, single, currently living in Brazil and making 13k a year as a full stack developer in a tiny company. This probably sounds miserable to a US citizen but it's actually a decent salary here. However, I feel completely disconnected from the people around me; the rationalist community is almost nonexistent in Brazil, specially in a small town like the one I live. In larger cities there's a lot of crime, poverty and pollution, which makes moving and finding a job in a larger company unattractive to me. Add that to the fact that I could make 10x what I make today at an entry level position in the US and it becomes easy to see why I want to move.
I don't have formal education. I was approved at University of São Paulo (Brazil's top university) when I was 15 but I couldn't legally enroll, so I had to wait until I was approved again at 17. I always excelled at tests, but hated attending classes, and thought classes were progressing too slowly for me. So I dropped out the following year (2014). Since then, I taught myself how to program in several languages and ended up in my current position.
The reason I'm asking for help is that I think it would save me a lot of time if someone gave me the right pointers as to where to look for a job, which companies to apply to, or if there's some shortcut I could take to make that a reality. Ideally I'd work in the Bay Area, but I'd be willing to move anywhere in the US really, at any living salary (yeah I'm desperate to leave my current situation). I'm currently applying to anything I can find on Glassdoor that has visa sponsorship.
Because I'm working in a private company I don't have a lot to show to easily prove I'm skilled (there's only the company apps/website but it's hard to put that in a resume), but I could spend the next few months doing open source contributions or something that I could use to show off. The only open source contribution I currently have is a fix to the Kotlin compiler.
Does anyone have any advice as to how to proceed or has done something similar? Is it even feasible, will anyone hire me without a degree? Should I just give up and try something else? I have also considered travelling to the US with a tourism visa and looking for a job while I'm there, could that work (I'm not sure if it's possible to get work visa when already in the US)?