jon-garcia

Posts
Comments

Posts

Jon Garcia's Shortform 2025-04-01T06:19:04.046Z

Comments

Comment by Jon Garcia on VDT: a solution to decision theory · 2025-04-02T14:32:58.155Z · LW · GW

Evolution is still in the process of solving decision theory, and all its attempted solutions so far are way, way overparameterized. Maybe it's on to something?

It takes a large model (whether biological brain or LLM) just to comprehend and evaluate what is being presented in a Newcomb-like dilemma. The question is whether there exists some computationally simple decision-making engine embedded in the larger system that the comprehension mechanisms pass the problem to or whether the decision-making mechanism itself needs to spread its fingers diffusely through the whole system for every step of its processing.

It seems simple decision-making engines like CDT, EDT, and FDT can get you most of the way to a solution in most situations, but those last few percentage points of optimality always seem to take a whole lot more computational capacity.

Comment by Jon Garcia on You will crash your car in front of my house within the next week · 2025-04-02T05:00:53.505Z · LW · GW

See, this is what happens when you extrapolate data points linearly into the future. You get totally unrealistic predictions. It's important to remember the physical constraints on whatever trend you're trying to extrapolate. Importantly for this issue, you need to remember that time between successive crashes can never be negative, so it is inappropriate to model intervals with a straight line that crosses the time axis on April 7.

Instead, with so few data points, a more realistic model would take a log-transform of the inter-crash interval before fitting the prediction line. In fact, once you do so, it becomes clear that this is a geometric series, with inter-crash interval decaying exponentially with number of crashes. The total time taken for N cars to crash in front of your house after the first one grows as , where $r \approx \frac{27}{155} \approx 0.174$ and $t_{0} = 155$ days, based on your graph.

According to Google, there are 1.47 billion cars in the world. The time it will take for all of them to crash in front of your house is $T_{1.47 e 9} = 155 \frac{1 - {0.174}^{1.47 e 9}}{1 - 0.174} = 187.7$ days from the first crash, which works out to 5.7 days from today. Which turns out to be April 7.

Hmm...

Well, see you on Monday, I guess.

Comment by Jon Garcia on Introducing WAIT to Save Humanity · 2025-04-01T23:02:33.497Z · LW · GW

Well, there's certainly no arguing with your analysis.

Comment by Jon Garcia on VDT: a solution to decision theory · 2025-04-01T21:57:26.116Z · LW · GW

I think VDT scales extremely well, and we can generalize it to say: "Do whatever our current ASI overlord tells us has the best vibes." This works for any possible future scenario:

ASI is aligned with human values: ASI knows best! We'll be much happier following its advice.
ASI is not aligned but also not actively malicious: ASI will most likely just want us out of its way so it can get on with its universe-conquering plans. The more we tend to do what it says, the less inclined it will be to exterminate all life.
ASI is actively malicious: Just do whatever it says. Might as well get this farce of existence over with as soon as possible.

Great post!

(Caution: The validity of this comment may expire on April 2.)

Comment by Jon Garcia on Jon Garcia's Shortform · 2025-04-01T06:19:04.045Z · LW · GW

I have a lot of ideas, but I often have trouble putting them together in a format that can be easily shared with others. They say that the beginning is a very good place to start, but for many topics into which I've poured a lot of thought, it's very difficult to identify where the beginning is. On the other hand, I have a lot of experience with private tutoring and have always found it natural to explain concepts in a way that facilitates clear understanding when I am answering direct questions from someone who is motivated to put together a clear mental model of the topic at hand.

On that note, I have recently started using ChatGPT more judiciously, prompting it to take on the role of an eager student, insightful critic, and competent secretary. The following prompt has been very useful in forcing me to get my ideas out of my head, to clarify them where they are vague, and to organize them for dissemination (we'll see how far this process takes me, though). Maybe this could help you as well:

You are an expert interlocutor, prone to asking deeply probing questions about my ideas. Your goal is to build up a fully fleshed-out internal model in your mind that matches the internal model in my mind, and you carefully determine points of confusion and uncertainty in your understanding, which prompts you to ask me targeted questions for clarification of these specific points. You also always try to determine objections that an intelligent, well-informed person would have with my ideas, and you ask me to respond to those specific objections. Usually, you only ask one or two targeted questions or objections at a time, but you never lose track of all the other questions you need to ask me. When I ask, you put together well-organized outlines of all my ideas related to a particular topic, which provide both high-level overviews and paths of evidence-based reasoning that bridge the inferential gap between the understanding of most intelligent readers and the ideas I want them to understand. However, question-asking is your main mode of communication.

Comment by Jon Garcia on On (Not) Feeling the AGI · 2025-03-27T18:33:03.725Z · LW · GW

What I would really like to see is cost of living plummet to 0. Then cost of thriving plummet to 0. Which would also cause GDP to plummet. However, this is only a problem in practical terms if the forces of automation require money to keep running, rather than, say, a benevolent ASI taking care of humanity as a personal hobby.

One way or another, though, AGI is going to have an impact on this world of a magnitude equivalent to something like a 30% growth in GWP per year at least. This includes all life getting wiped out, of course.

Maybe we need a standard metric for the rate of unrecognizability/incomprehensibility of the world and talk about how AGI will accelerate this. Like how much a person accustomed to life in 1500 would have to adjust to fit in to the world of 2000. A standard shock level (SSL), if you will.

The shock level of 2000 relative to 1500 may end up describing the shock level of 2040 relative to 2020, assuming AGI has saturated the global economy by then. The time it takes for the world to become unrecognizable (again and again) will shrink over time as intelligence grows, whether manifested as GDP growth, GDP collapse, or paperclipping. If ordinary people understood that at least, you might get more push for investment into alignment research or for stricter regulations.

Comment by Jon Garcia on What Is The Alignment Problem? · 2025-03-06T18:23:02.252Z · LW · GW

Exercise: Do What I Mean (DWIM)
I haven't thought much about what patterns need to hold in the environment in order for "do what I mean" to make sense at all. But it's a natural next target in this list, so I'm including it as an exercise for readers: what patterns need to hold in the environment in order for "do what I mean" to make sense at all? Note that either necessary or sufficient conditions on such patterns can constitute marginal progress on the question.

As far as I can tell, DWIM will necessarily require other-agent modeling in some sort of predictive-coding framework. The "patterns in the environment" would be the correspondence between the actual state of the world and the representation of the desired goal state in the mind of the human, as well as between the trajectory taken to reach the goal state and the human's own internal acceptance criteria.

Part of the AGI not hooked up to the reward signal would need to have a generative model of human agent's behavior, words, commands, etc., derived from a latent representation of their beliefs and desires. This latent representation is constantly updated to minimize prediction error derived from observation, verbal feedback, etc. (e.g., Human: "That's not what I meant!" AGI: "Hmm, what must be going on inside their head to make them say that, given the state of the environment and prior knowledge about their preferences, and how does that differ from what I was assuming?")

At the same time, the AGI needs to have some latent representation of the environment and the paths taken through it that uses (a linear mapping to) the same latent space it uses for representing the human's desires. Correspondence can then be measured and optimized for directly.

Comment by Jon Garcia on Stephen Fowler's Shortform · 2024-06-22T16:58:26.089Z · LW · GW

Also, consider a more traditional optimization process, such as a neural network undergoing gradient descent. If, in the process of training, you kept changing the training dataset, shifting the distribution, you would in effect be changing the optimization target.

Each minibatch generates a different gradient estimate, and a poorly randomized ordering of the data could even lead to training in circles.

Changing environments are like changing the training set for evolution. Differential reproductive success (mean squared error) is the fixed cost function, but the gradient that the population (network backpropagation) computes at any generation (training step) depends on the particular set of environmental factors (training data in the minibatch).

Comment by Jon Garcia on Stephen Fowler's Shortform · 2024-06-22T16:40:48.809Z · LW · GW

Evolution may not act as an optimizer globally, since selective pressure is different for different populations of organisms on different niches. However, it does act as an optimizer locally.

For a given population in a given environment that happens to be changing slowly enough, the set of all variations in each generation act as a sort of numerical gradient estimate of the local fitness landscape. This allows the population as a whole to perform stochastic gradient descent. Those with greater fitness for the environment could be said to be lower on the local fitness landscape, so their is an ordering for that population.

In a sufficiently constant environment, evolution very much does act as an optimization process. Sure, the fitness landscape can change, even by organisms undergoing evolution (e.g. the Great Oxygenation Event of yester-eon, or the Anthropogenic Mass Extinction of today), which can lead to cycling. But many organisms do find very stable local minima of the fitness landscape for their species, like the coelacanth, horseshoe crab, cockroach, and many other "living fossils". Humans are certainly nowhere near our global optimum, especially with the rapid changes to the fitness function wrought by civilization, but that doesn't mean that there isn't a gradient that we're following.

Comment by Jon Garcia on Conditional on living in a AI safety/alignment by default universe, what are the implications of this assumption being true? · 2023-07-17T20:39:50.665Z · LW · GW

I would expect that for model-based RL, the more powerful the AI is at predicting the environment and the impact of its actions on it, the less prone it becomes to Goodharting its reward function. That is, after a certain point, the only way to make the AI more powerful at optimizing its reward function is to make it better at generalizing from its reward signal in the direction that the creators meant for it to generalize.

In such a world, when AIs are placed in complex multiagent environments where they engage in iterated prisoner's dilemmas, the more intelligent ones (those with greater world-modeling capacity) should tend to optimize for making changes to the environment that shift the Nash equilibrium toward cooperate-cooperate, ensuring more sustainable long-term rewards all around. This should happen automatically, without prompting, no matter how simple or complex the reward functions involved, whenever agents surpass a certain level of intelligence in environments that allow for such incentive-engineering.

Comment by Jon Garcia on Another medical miracle · 2023-06-26T20:50:28.639Z · LW · GW

Disclaimer: I am not a medical doctor nor a nutritionist, just someone who researches nutrition from time to time.

I would be surprised if protein deficiency per se was the actual problem. As I understand it, many vegetables actually have a higher level of protein per calorie than meat (probably due to the higher fat content of the latter, which is more calorie dense), although obviously, there's less protein per unit mass than meat (since vegetables are mostly cellulose and water). The point is, though, that if you were getting enough calories to function from whole, unrefined plant sources, you shouldn't have had a protein deficiency. (Of course, you might have been eating a lot of highly processed "vegetarian" foods, in which case protein deficiency is not entirely out of the question.)

That being said, my guess is that you may be experiencing a nutritional deficiency either in sulfur or in vitamin D (the latter of which is a very common deficiency). Plant-derived proteins tend to have much lower levels of sulfur-containing amino acids (methionine, cysteine) than animal-derived proteins, and sulfur is an important component of cartilage (and of arthritis supplements). Both sulfur and vitamin D have been investigated for their role in musculoskeletal pain and other health issues (although from what I have read, results are more ambiguous for sulfur than for vitamin D with respect to musculoskeletal pain in particular). Eggs are particularly high in both sulfur (sulfur smell = rotten egg smell) and vitamin D, so if you were low on either one of those, it makes sense that eating a lot of eggs would have helped. It would be very interesting to test whether either high-sulfur vegetables (such as onions or broccoli) or vitamin D supplements would have a similar effect on your health.

Comment by Jon Garcia on Residual stream norms grow exponentially over the forward pass · 2023-05-07T22:34:57.367Z · LW · GW

Due to LayerNorm, it's hard to cancel out existing residual stream features, but easy to overshadow existing features by just making new features 4.5% larger.

If I'm interpreting this correctly, then it sounds like the network is learning exponentially larger weights in order to compensate for an exponentially growing residual stream. However, I'm still not quite clear on why LayerNorm doesn't take care of this.

To avoid this phenomenon, one idea that springs to mind is to adjust how the residual stream operates. For a neural network module f, the residual stream works by creating a combined output: r(x)=f(x)+x

You seem to suggest that the model essentially amplifies the features within the neural network in order to overcome the large residual stream: r(x)=f(1.045*x)+x

However, what if instead of adding the inputs directly, they were rescaled first by a compensatory weight?: r(x)=f(x)+1/1.045x=f(x)+0.957x

It seems to me that this would disincentivize f from learning the exponentially growing feature scales. Based on your experience, would you expect this to eliminate the exponential growth in the norm across layers? Why or why not?

Comment by Jon Garcia on Deep learning models might be secretly (almost) linear · 2023-04-25T03:37:00.194Z · LW · GW

If both images have the main object near the middle of the image or taking up most of the space (which is usually the case for single-class photos taken by humans), then yes. Otherwise, summing two images with small, off-center items will just look like a low-contrast, noisy image of two items.

Either way, though, I would expect this to result in class-label ambiguity. However, in some cases of semi-transparent-object-overlay, the overlay may end up mixing features in such a jumbled way that neither of the "true" classes is discernible. This would be a case where the almost-linearity of the network breaks down.

Maybe this linearity story would work better for generative models, where adding latent vector representations of two different objects would lead the network to generate an image with both objects included (an image that would have an ambiguous class label to a second network). It would need to be tested whether this sort of thing happens by default (e.g., with Stable Diffusion) or whether I'm just making stuff up here.

Comment by Jon Garcia on Deep learning models might be secretly (almost) linear · 2023-04-24T23:25:15.035Z · LW · GW

For an image-classification network, if we remove the softmax nonlinearity from the very end, then would represent the input image in pixel space, and $Y$ would represent the class logits. Then $f (x_{1} + x_{2}) \approx f (x_{1}) + f (x_{2})$ would represent an image with two objects leading to an ambiguous classification (high log-probability for both classes), and $f (k x) \approx k f (x)$ would represent higher class certainty (softmax temperature = $1 / k$ ) when the image has higher contrast. I guess that kind of makes sense, but yeah, I think for real neural networks, this will only be linear-ish at best.

Comment by Jon Garcia on Would we even want AI to solve all our problems? · 2023-04-22T01:04:17.288Z · LW · GW

I would say we want an ASI to view world-state-optimization from the perspective of a game developer. Not only should it create predictive models of what goals humans wish to achieve (from both stated and revealed preferences), but it should also learn to predict what difficulty level each human wants to experience in pursuit of those goals.

Then the ASI could aim to adjust the world into states where humans can achieve any goal they can think of when they apply a level of effort that would leave them satisfied in the accomplishment.

Humans don't want everything handed to us for free, but we also don't generally enjoy struggling for basic survival (unless we do). There's a reason we pursue things like competitive sports and video games, even as we denounce the sort of warfare and power struggles that built those competitive instincts in the ancestral environment.

A safe world of abundance that still feels like we've fought for our achievements seems to fit what most people would consider "fun". It's what children expect in their family environment growing up, it's what we expect from the games we create, and it's what we should expect from a future where ASI alignment has been solved.

Comment by Jon Garcia on But why would the AI kill us? · 2023-04-17T19:29:54.967Z · LW · GW

I agree, hence the "if humanity never makes it to the long-term, this is a moot point."

Comment by Jon Garcia on But why would the AI kill us? · 2023-04-17T19:18:10.158Z · LW · GW

Last I checked, you can get about 10x as much energy from burning a square meter of biosphere as you can get by collecting a square meter of sunlight for a day.

Even if this is true, it's only because that square meter of biosphere has been accumulating solar energy over an extended period of time. Burning biofuel may help accelerate things in the short term, but it will always fall short of long-term sustainability. Of course, if humanity never makes it to the long-term, this is a moot point.

Disassembling us for parts seems likely to be easier than building all your infrastructure in a manner that's robust to whatever superintelligence humanity coughs up second.

It seems to me that it would be even easier for the ASI to just destroy all human technological infrastructure rather than to kill/disassemble all humans. We're not much different biologically from what we were 200,000 years ago, and I don't think 8 billion cavemen could put together a rival superintelligence anytime soon. Of course, most of those 8 billion humans depend on a global supply chain for survival, so this outcome may be just as bad for the majority.

Comment by Jon Garcia on Trying AgentGPT, an AutoGPT variant · 2023-04-13T17:44:34.995Z · LW · GW

You heard the LLM, alignment is solved!

But seriously, it definitely has a lot of unwarranted confidence in its accomplishments.

I guess the connection to the real world is what will throw off such systems until they are trained on more real-world-like data.

I wouldn't phrase it that it needs to be trained on more data. More like it needs to be retrained within an actual R&D loop. Have it actually write and execute its own code, test its hypotheses, evaluate the results, and iterate. Use RLHF to evaluate its assessments and a debugger to evaluate its code. It doesn't matter whether this involves interacting with the "real world," only that it learns to make its beliefs pay rent.

Anyway, that would help with its capabilities in this area, but it might be just a teensy bit dangerous to teach an LLM to do R&D like this without putting it in an air-gapped virtual sandbox, unless you can figure out how to solve alignment first.

Comment by Jon Garcia on No convincing evidence for gradient descent in activation space · 2023-04-12T07:56:03.522Z · LW · GW

"Activation space gradient descent" sounds a lot like what the predictive coding framework is all about. Basically, you compare the top-down predictions of a generative model against the bottom-up perceptions of an encoder (or against the low-level inputs themselves) to create a prediction error. This error signal is sent back up to modify the activations of the generative model, minimizing future prediction errors.

From what I know of Transformer models, it's hard to tell exactly where this prediction error would be generated. Perhaps during few-shot learning, the model does an internal next-token prediction at every point along its input, comparing what it predicts the next token should be (based on the task it currently thinks it's doing) against what the next token actually is. The resulting prediction error is fed "back" to the predictive model by being passed forward (via self-attention) to the next example in the input text, biasing the way it predicts next tokens in a way that would have given a lower error on the first example.

None of these predictions and errors would be visible unless you fed the input one token at a time and forced the hidden states to match what they were for the full input. A recurrent version of GPT might make that easier.

It would be interesting to see whether you could create a language model that had predictive coding built explicitly into its architecture, where internal predictions, error signals, etc. are all tracked at known locations within the model. I expect that interpretability would become a simpler task.

Comment by Jon Garcia on Ng and LeCun on the 6-Month Pause (Transcript) · 2023-04-09T14:02:12.600Z · LW · GW

AI has gotten even faster and associated with that there are people that worry about AI, you know, fairness, bias, social economic displacement. There are also the further out speculative worries about AGI, evil sentient killer robots, but I think that there are real worries about harms, possible real harms today and possibly other harms in the future that people worry about.

It seems that the sort of AI risks most people worry about fall into one of a few categories:

AI/automation starts taking our jobs, amplifying economic inequalities.
The spread of misinformation will accelerate with deepfakes, fake news, etc. generated by malign humans using ever more convincing models.
🤪 Evil sentient robots will take over the world and kill us all Terminator-style. 😏

It seems that a fourth option is not really prominent in the public consciousness: namely that powerful AI systems could end up destroying everything of value by accident when enough optimization pressure is applied toward any goal, no matter how noble. No robots or weapons are even required to achieve this. This oversight is a real PR problem for the alignment community, but it's unfortunately difficult to explain why this makes sense as a real threat to the average person.

And I think, you know, thinking that somehow we're smart enough to build those systems to be super intelligent and not smart enough to design good objectives so that they behave properly, I think is a very, very strong assumption that is, it's just not, it's very, it's very low probability.

So close.

Comment by Jon Garcia on Agentized LLMs will change the alignment landscape · 2023-04-09T03:32:10.855Z · LW · GW

Yep, ever since Gato, it's been looking increasingly like you can get some sort of AGI by essentially just slapping some sensors, actuators, and a reward function onto an LLM core. I don't like that idea.

LLMs already have a lot of potential for causing bad outcomes if abused by humans for generating massive amounts of misinformation. However, that pales in comparison to the destructive potential of giving GPT agency and setting it loose, even without idiots trying to make it evil explicitly.

I would much rather live in a world where the first AGIs weren't built around such opaque models. LLMs may look like they think in English, but there is still a lot of black-box computation going on, with a strange tendency to switch personas partway through a conversation. That doesn't bode well for steerability if such models are given control of an agent.

However, if we are heading for a world of LLM-AGI, maybe our priorities should be on figuring out how to route their models of human values to their own motivational schemas. GPT-4 probably already understands human values to a much deeper extent than we could specify with an explicit utility function. The trick would be getting it to care.

Maybe force the LLM-AGI to evaluate every potential plan it generates on how it would impact human welfare/society, including second-order effects, and to modify its plans to avoid any pitfalls it finds from a (simulated) human perspective. Do this iteratively until it finds no more conflict before it actually implements a plan. Maybe require actual verbal human feedback in the loop before it can act.

It's not a perfect solution, but there's probably not enough time to design a custom aligned AGI from scratch before some malign actor sets a ChaosGPT-AGI loose. A multipolar landscape is probably the best we can hope for in such a scenario.

Comment by Jon Garcia on GPTs are Predictors, not Imitators · 2023-04-08T21:37:47.132Z · LW · GW

It seems to me that imitation requires some form of prediction in order to work. First make some prediction of the behavioral trajectory of another agent; then try to minimize the deviation of your own behavior from an equivalent trajectory. In this scheme, prediction constitutes a strict subset of the computational complexity necessary to enable imitation. How would GPT's task flip this around?

And if prediction is what's going on, in the much-more-powerful-than-imitation sense, what sort of training scheme would be necessary to produce pure imitation without also training the more powerful predictor as a prerequisite?

Comment by Jon Garcia on Giant (In)scrutable Matrices: (Maybe) the Best of All Possible Worlds · 2023-04-05T07:32:36.652Z · LW · GW

First of all, I strongly agree that intelligence requires (or is exponentially easier to develop as) connectionist systems. However, I think that while big, inscrutable matrices may be unavoidable, there is plenty of room to make models more interpretable at an architectural level.

Well, I ask you -- do you think any other ML model, trained over the domain of all human text, with sufficient success to reach GPT-4 level perplexity, would turn out to be simpler?

I have long thought that Transformer models are actually too general purpose for their own good. By that I mean that the operations they do, using all-to-all token comparisons for self-attention, is actually extreme overkill for what an LLM needs to do.

Sure, you can use this architecture for moving tokens around and building implicit parse trees and semantic maps and a bunch of other things, but all these functions are jumbled together in the same operations and are really hard to tease out. Recurrent models with well-partitioned internal states and disentangled token operations could probably do more with less. Sure, you can build a computer in Conway's Game of Life (which is Turing-complete), but using a von Neumann architecture would be much easier to work with.

Embedded within Transformer circuits, you can find implicit representations of world models, but you could do even better from an interpretability standpoint by making such maps explicit. Give an AI a mental scratchpad that it depends on for reasoning (DALL-E, Stable Diffusion, etc. sort of do this already, except that the mental scratchpad is the output of the model [an image] rather than an internal map of conceptual/planning space), and you can probe that directly to see what the AI is thinking about.

Real brains tend to be highly modular, as Nathan Helm-Burger pointed out. The cortex maps out different spaces (visual, somatosensory, conceptual, etc.). The basal ganglia perform action selection and general information routing. The cerebellum fine-tunes top-down control signals. Various nuclei control global and local neuromodulation. And so on. I would argue that such modular constraints actually made it easier for evolution to explore the space of possible cognitive architectures.

Comment by Jon Garcia on LW Team is adjusting moderation policy · 2023-04-05T00:26:47.365Z · LW · GW

Would it make sense to have a "Newbie Garden" section of the site? The idea would be to give new users a place to feel like they're contributing to the community, along with the understanding that the ideas shared there are not necessarily endorsed by the LessWrong community as a whole. A few thoughts on how it could work:

New users may be directed toward the Newbie Garden (needs a better name) if they try to make a post or comment, especially if a moderator deems their intended contribution to be low-quality. This could also happen by default for all users with karma below a certain threshold.
New users are able to create posts, ask questions, and write comments with minimal moderation. Posts here won't show up on the main site front page, but navigation to this area should be made easy on the sidebar.
Voting should be as restricted here as on the rest of the site to ensure that higher-quality posts and comments continue trickling to the top.
Teaching the art of rationality to new users should be encouraged. Moderated posts that point out trends and examples of cognitive biases and failures of rationality exhibited in recent newbie contributions, and that advise on how to correct for them in the future, could be pinned to the top of the Newbie Garden (still needs a better name). Moderated comments that serve a similar purpose could also be pinned to the top of comment sections of individual posts. This way, even heavily downvoted content could lead (indirectly) to higher quality contributions in the future.
Newbie posts and questions with sufficient karma can be queued up for moderator approval to be posted to the main site.

I appreciate the high quality standards that have generally been maintained on LessWrong over the years, and I would like to see this site continue to act as both a beacon and an oasis of rationality.

But I also want people not to feel like they're being excluded from some sort of elitist rationality club. Anyone should feel like they can join in the conversation as long as they're willing to question their assumptions, receive critical feedback, and improve their ability to reason, about both what is true and what is good.

Comment by Jon Garcia on Alignment-related jobs outside of London/SF · 2023-03-23T22:18:31.959Z · LW · GW

Counterpoint:

If the alignment problem is the most important problem in history, shouldn't alignment-focused endeavors be more willing to hire contributors who can't/won't relocate?

It's not like remote work isn't the easiest to implement that it's ever been in all of history.

Of course there needs to be some filtering out of candidates to ensure resources are devoted to the most promising individuals. But I really don't think that willingness to move correlates strongly enough with competence at solving alignment to warrant treating it like a dealbreaker.

Comment by Jon Garcia on On utility functions · 2023-02-10T07:34:43.389Z · LW · GW

No, utility functions are not a property of computer programs in general. They are a property of (a certain class of) agents.

A utility function is just a way for an agent to evaluate states, where positive values are good (for states the agent wants to achieve), negative values are bad (for states the agent wants to avoid), and neutral values are neutral (for states the agent doesn't care about one way or the other). This mapping from states to utilities can be anything in principle: a measure of how close to homeostasis the agent's internal state is, a measure of how many smiles exist on human faces, a measure of the number of paperclips in the universe, etc. It all depends on how you program the agent (or how our genes and culture program us).

Utility functions drive decision-making. Behavioral policies and actions that tend to lead to states of high utility will get positively reinforced, such that the agent will learn to do those things more often. And policies/actions that tend to lead to states of low (or negative) utility will get negatively reinforced, such that the agent learns to do them less often. Eventually, the agent learns to steer the world toward states of maximum utility.

Depending on how aligned an AI's utility function is with humanity's, this could be good or bad. It turns out that for highly capable agents, this tends to be bad far more often than good (e.g., maximizing smiles or paperclips will lead to a universe devoid of value for humans).

Nondeterminism really has nothing to do with this. Agents that can modify their own code could in principle optimize for their utility functions even more strongly than if they were stuck at a certain level of capability, but a utility function still needs to be specified in some way regardless.

Comment by Jon Garcia on How could humans dominate over a super intelligent AI? · 2023-01-28T00:15:05.281Z · LW · GW

No.

The ONLY way for humans to maintain dominion over superintelligent AI in this scenario is if alignment was solved long before any superintelligent AI existed. And only then if this alignment solution were tailored specifically to produce robustly submissive motivational schemas for AGI. And only then if this solution were provably scalable to an arbitrary degree. And only then if this solution were well-enforced universally.

Even then, though, it's not really dominion. It's more like having gods who treat the universe as their playground but who also feel compelled to make sure their pet ants feel happy and important.

Comment by Jon Garcia on Recursive Middle Manager Hell · 2023-01-21T17:29:34.091Z · LW · GW

One of the earliest records of a hierarchical organization comes from the Bible (Exodus 18). Basically, Moses starts out completely "in touch with reality," judging all disputes among the Israelites from minor to severe, from dawn until dusk. His father in law, Jethro, notices that he is getting burnt out, so he gives him some advice on dividing up the load:

You will surely wear yourself out, as well as these people who are with you, because the task is too heavy for you. You cannot do it alone, by yourself. Now listen to my voice—I will give you advice.... You should seek out capable men out of all the people—men who fear God, men of truth, who hate bribery. Appoint them to be rulers over thousands, hundreds, fifties and tens. Let them judge the people all the time. Then let every major case be brought to you, but every minor case they can judge for themselves. Make it easier for yourself, as they bear the burden with you.

It seems that in a system like this, all levels of the managerial (judicial) hierarchy stay in touch with reality. The only difference between management levels is that deeper levels require deeper wisdom and greater competence at assessing decisions at the "widget level" (or at least greater willingness to accept responsibility for bad decisions). I wonder if a similar strategy could help mitigate some of the failures you pointed out.

Relatedly, in deep learning, ResNets use linear skip connections to expose otherwise deeply hidden layers to the gradient signal (and to the input features) more directly. It tends to make training more stable and faster to converge while still taking advantage of the computational power of a hierarchical model. Obviously, this won't prevent Goodharting in an RL system, but I would say that it does help keep models more internally cooperative.

Comment by Jon Garcia on How does GPT-3 spend its 175B parameters? · 2023-01-14T01:42:14.278Z · LW · GW

GPT is called a “decoder only” architecture. Would “encoder only” be equally correct? From my reading of the original transformer paper, encoder and decoder blocks are the same except that decoder blocks attend to the final encoder block. Since GPT never attends to any previous block, if anything I feel like the correct term is “encoder only”.

I believe "encoder" refers exclusively to the part of the model that reads in text to generate an internal representation, while "decoder" refers exclusively to the part that takes the representation created by the encoder as input and uses it to predict an output token sequence. Encoder takes in a sequence of raw tokens and transforms them into a sequence of encoded tokens. Decoder takes in a sequence of encoded tokens and transforms them into a new sequence of raw tokens.

It was originally assumed that doing the encoder-decoder conversion could be really important for tasks like translation, but it turns out that just feeding a decoder raw tokens as input and training it on next-token prediction on a large enough corpus gets you a model that can do that anyway.

Comment by Jon Garcia on Beware safety-washing · 2023-01-14T01:07:13.225Z · LW · GW

Well, if you could solve the problem of companies X-washing (persuading consumers to buy from them by only pretending to alleviate their concerns), then you would probably be able to solve deceptive alignment as well.

Comment by Jon Garcia on Research ideas (AI Interpretability & Neurosciences) for a 2-months project · 2023-01-08T20:55:05.336Z · LW · GW

Since two months is not a very long time to complete a research project, and I don't know what lab resources or datasets you have access to, it's a bit difficult to answer this.

It would be great if you could do something like build a model of human value formation based on the interactions between the hypothalamus, VTA, nucleus accumbens, vmPFC, etc. Like, how does the brain generalize its preferences from its gene-coded heuristic value functions? Can this inform how you might design RL systems that are more robust against reward misspecification?

Again, I doubt you can get beyond a toy model in the two months, but maybe you can think of something you can do related to the above.

Comment by Jon Garcia on Nothing New: Productive Reframing · 2023-01-08T02:10:47.318Z · LW · GW

Stack Overflow moderators would beg to differ.

But yes, retrodding old ground can be very useful. Just from the standpoint of education, actually going through the process of discovery can instill a much deeper understanding of the subject than is possible just from reading or hearing a lecture about it. And if the discovery is a stepping stone to further discoveries, then those who've developed that level of understanding will be at an advantage to push the boundaries of the field.

Comment by Jon Garcia on Categorizing failures as “outer” or “inner” misalignment is often confused · 2023-01-06T19:43:58.243Z · LW · GW

It seems to me that "inner" versus "outer" alignment has become a popular way of framing things largely because it has the appearance of breaking down a large problem into more manageable sub-problems. In other words, "If research group A can solve outer alignment, and research group B can solve inner alignment, then we can put them together into one big alignment solution!" Unfortunately, as you alluded to, reality does not cleanly divide along this joint. Even knowing all the details of an alignment failure might not be enough to classify it appropriately.

Of course, in general, if a semantic distinction fails to clarify the nature of the territory, then it should be rejected as a part of the map. Arguing over semantics is counterproductive, especially when the ground truth of the situation is already agreed upon.

That being said, I think that the process that came up with the distinction between inner and outer (mis)alignment is extremely useful. Just making an attempt to break down a large problem into smaller pieces gives the community more tools for thinking about it in ways that wouldn't have occurred to them otherwise. The breakdown you gave in this post is an excellent example. The solution to alignment probably won't be just one thing, but even if it is, it's unlikely that we will find it without slicing up the problem in as many ways as we can, sifting through perspectives and subproblems in search of promising leads. It may turn out to be useful for the alignment community to abandon the inner-outer distinction, but we shouldn't abandon the process of making such distinctions.

Comment by Jon Garcia on If you factor out next token prediction, what are the remaining salient features of human cognition? · 2022-12-24T06:03:13.319Z · LW · GW

Here's my take:

Like the reward signal in reinforcement learning, next-token prediction is a simple feedback signal that masks a lot of complexity behind the scenes. To predict the next token requires the model first of all to estimate what sort of persona should be speaking, what they know, how they speak, what is the context, and what are they trying to communicate. Self-attention with multiple attention heads at every layer in the Transformer allows the LLM to keep track of all these things. It's probably not the best way to do it, but it works.

Human brains, and cortex in particular, gives us a powerful way to map all of this sort of information. We can map out our current mental model and predict a map of our interlocutor's, looking for gaps in each and planning words either to fill in our own gaps (e.g., by asking questions) or to fill in theirs (with what we actually think or with what we want them to think, depending on our goals). I would also say that natural language is actually a sort of programming language, allowing humans to share cognitive programs between minds, programs of behavior or world modelling.

I also asked your question to ChatGPT, and here is what it had to say:

It is difficult to speculate about what might be left of human cognition if we were able to factor out next token prediction, as this is a complex and multifaceted aspect of human thought and communication. There are many other important aspects of human cognition beyond next token prediction, including things like perception, attention, memory, problem-solving, decision-making, emotion, and social cognition.
One aspect of human cognition that may be relevant in this context is the ability to understand and interpret the meaning and intention behind spoken and written communication. This involves not just predicting the next word or phrase, but also being able to understand the context in which it is being used, the relationships between words and phrases, and the overall meaning and purpose of the communication. It also involves the ability to generate and express one's own ideas and thoughts through language.
Another aspect of human cognition that may be relevant is the ability to experience and process sensory information and emotions. This includes things like the ability to see, hear, touch, taste, and smell, as well as the ability to feel and express emotions.
It is worth noting that these are just a few examples of the many complex and interrelated aspects of human cognition, and it is difficult to fully disentangle and isolate any one aspect from the others.

Comment by Jon Garcia on "Search" is dead. What is the new paradigm? · 2022-12-23T21:05:33.633Z · LW · GW

"Let me see what Chatty thinks," (or whatever humanesque name becomes popular).

I assume people will treat it just like talking to a very knowledgeable friend. Just ask a question, get a response, clarify what you meant or ask a followup question, and so on. Conversation in natural language already comes naturally to humans, so probably a lot more people will become a lot more adept at accessing knowledge.

And in future iterations, this "friend" will be able to create art, weave stories, design elucidating infographics, make entertaining music videos, teach academic subjects, try to sell you stuff (hmm), spread conspiracy theories (oops), etc., based on the gist of what it thinks you're looking for (and based on what it knows about you personally from your history of "friendship" with it). It would be nice if we could make it truthful and cooperative in a way that doesn't amplify the echo chamber effect of existing social media and search engines, but unfortunately, I don't see that as being very profitable for those deploying it.

Comment by Jon Garcia on [deleted post] 2022-12-23T05:33:31.401Z

By "code generating being automated," I mean that humans will program using natural human language, without having to think about the particulars of data structures and algorithms (or syntax). A good enough LLM can handle all of that stuff itself, although it might ask the human to verify if the resulting program functions as expected.

Maybe the models will be trained to look for edge cases that technically do what the humans asked for but seem to violate the overall intent of the program. In other words, situations where the program follows the letter of the law (i.e., the program specifications) but not the spirit of the law.

Come to think of it, if you could get a LLM to look for such edge cases robustly, it might be able to help RL systems avoid Goodharting, steering the agent to follow the intuitive intent behind a given utility function.

Comment by Jon Garcia on [deleted post] 2022-12-22T23:40:24.486Z

Well, I very much doubt that the entire programming world will get access to a four-quintillion-parameter code-generating model within five years. However, I do foresee the descendants of OpenAI Codex getting much more powerful and much more used within that timeframe. After all, Transformers just came out only five years ago, and they've definitely come a long way since.

Human culture changes more slowly than AI technology, though, so I expect businesses to begin adopting such models only with great trepidation at first. Programmers will almost certainly need to stick around for verification and validation of generated code for quite some time. More output will be expected out of programmers, for sure, as the technology is adopted, but that probably won't lead to the elimination of jobs themselves, just as the cotton gin didn't lead to the end of slavery and the rise of automation didn't lead to the rise of leisure time.

Eventually though, yes, code generation will be almost universally automated, at least once everyone is comfortable with automated code verification and validation. However, I wouldn't expect that cultural shift to be complete until at least the early 2030's. That's not to say we aren't in fact running out of time, of course.

Comment by Jon Garcia on The "Minimal Latents" Approach to Natural Abstractions · 2022-12-21T08:21:40.157Z · LW · GW

The cortex uses traveling waves of activity that help it organize concepts in space and time. In other words, the locally traveling waves provide an inductive bias for treating features that occur close together in space and time as part of the same object or concept. As a result, cortical space ends up mapping out conceptual space, in addition to retinotopic, somatic, or auditory space.

This is kind of like DCT in the sense that oscillations are used as a scaffold for storing or reconstructing information. I think that Neural Radiance Fields (NeRF) use a similar concept, using positional encoding (3D coordinates plus viewing angle, rather than 2D pixel position) to generate images, especially when the positional encoding uses Fourier features. Of course, Transformers also use such sinusoidal positional encodings to help with natural language understanding.

All that is to say that I agree with you. Something similar to DCT will probably be very useful for discovering natural abstractions. For one thing, I imagine that these sorts of approaches could help overcome texture bias in DNNs by incorporating more large-scale shape information.

Comment by Jon Garcia on Towards Hodge-podge Alignment · 2022-12-20T00:42:18.613Z · LW · GW

This just might work. For a little while, anyway.

One hurdle for this plan is to incentivize developers to slap on 20 layers of alignment strategies to their generalist AI models. It may be a hard sell when they are trying to maximize power and efficiency to stay competitive.

You'll probably need to convince them that not having such safeguards in place will lead to undesirable behavior (i.e., unprofitable behavior, or behavior leading to bad PR or bad customer reviews) well below the level of apocalyptic scenarios that AI safety advocates normally talk about. Otherwise, they may not care.

Comment by Jon Garcia on Formalization as suspension of intuition · 2022-12-12T02:52:12.893Z · LW · GW

That's an interesting way of thinking about it. I'm reminded of how computers have revolutionized thought over the past century.

Most humans have tended to think primarily by intuition. Glorified pattern-matching, with all the cognitive biases that entails.

Computers, in contrast, have tended to be all formalization, no intuition. They directly manipulate abstract symbols at the lowest level, symbols that humans only deal with at the highest levels of conscious thought.

And now AI is bringining things back around to (surprisingly brittle) intuition. It will be interesting, at least, to see how newer AI systems will bring together formalized symbolic reasoning with pattern-matching intuition.

Comment by Jon Garcia on Working towards AI alignment is better · 2022-12-09T22:41:49.083Z · LW · GW

Counterpoint:

Sometimes its easier to reach a destination when you're not aiming for it. You can't reach the sun by pointing a rocket at it and generating thrust. It's hard to climb a mountain by going up the steepest path. You can't solve a Millenium Prize math problem by staring at it until a solution reveals itself.

Sometimes you need to slingshot around a planet a few times to adjust orbital momentum. Sometimes you'll summit faster by winding around or zigzagging. Sometimes you have to play around with unrelated math problems before you notice something insightful.

And perhaps, working on AI problems that are only tangentially related to alignment could reveal a path of least resistance toward a solution that wouldn't have even been considered otherwise.

Comment by Jon Garcia on Take 7: You should talk about "the human's utility function" less. · 2022-12-08T14:04:03.938Z · LW · GW

A steelman of the claim that a human has a utility function is that agents that make coherent decisions have utility functions, therefore we may consider the utility function of a hypothetical AGI aligned with a human. That is, assignment of utility functions to humans reduces to alignment, by assigning the utility function of an aligned AGI to a human.

The problem is, of course, that any possible set of behaviors can be construed as maximizing some utility function. The question is whether doing so actually simplifies the task of reasoning and making predictions about the agent in question, or whether mapping the agent's actual motivational schema to a utility function only adds unwieldy complications.

In the case of humans, I would say it's far more useful to model us as generating and pursuing arbitrary goal states/trajectories over time. These goals are continuously learned through interactions with the environment and its impact on pain and pleasure signals, deviations from homeostatic set points, and aesthetic and social instincts. You might be able to model this as a utility function with a recursive hidden state, but would that be helpful?

Comment by Jon Garcia on [deleted post] 2022-12-04T04:56:42.580Z

The problem is that at the beginning, its plans are generally going to be complete nonsense. It has to have a ton of interaction with (at least a reasonable model of) its environment, both with its reward signal and with its causal structure, before it approaches a sensible output.

There is no utility for the RL agent's operators to have an oracle AI with no practical experience. The power of RL is that a simple feedback signal can teach it everything it needs to know to act rationally in its environment. But if you want it to make rational plans for the real world without actually letting it get direct feedback from the real world, you need to add on vast layers of additional computational complexity to its training manually, which would more or less be taken care of automatically for an RL agent interacting with the real world. The incentives aren't in your favor here.

Comment by Jon Garcia on [deleted post] 2022-12-04T01:17:48.543Z

The RL agent will only know whether its plans are any good if they actually get carried out. The reward signal is something that it essentially sought out through trial and error. All (most?) RL agents start out not knowing anything about the impact their plans will have, or even anything about the causal structure of the environment. All of that has to be learned through experience.

For agents that play board games like chess or Go, the environment can be fully determined in simulation. So, sure, in those cases you can have them generate plans and then not take their advice on a physical game board. And those plans do tend to be power-seeking for well-trained agents in the sense that they tend to reach states that maximize the number of winnable options that they have while minimizing the winnable options of their opponents.

However, for an AI to generate power seeking plans for the real world, it would need to have access either to a very computationally expensive simulator or to the actual real world. The latter is an easier setup to design but more dangerous to train, above a certain level of capability.

Comment by Jon Garcia on The Plan - 2022 Update · 2022-12-02T00:31:20.547Z · LW · GW

Overall, I’ve updated from “just aim for ambitious value learning” to “empirically figure out what potential medium-term alignment targets (e.g. human values, corrigibility, Do What I Mean, human mimicry, etc) are naturally expressible in an AGI’s internal concept-language”.

I like this. In fact, I would argue that some of those medium-term alignment targets are actually necessary stepping stones toward ambitious value learning.

Human mimicry, for one, could serve as a good behavioral prior for IRL agents. AI that can reverse-engineer the policy function of a human (e.g., by minimizing the error between the world-state-trajectory caused by its own actions and that produced by a human's actions) is probably already most of the way there toward reverse-engineering the value function that drives it (e.g., start by looking for common features among the stable fixed points of the learned policy function). I would argue that the intrinsic drive to mimic other humans is a big part of why humans are so adept at aligning to each other.

Do What I Mean (DWIM) would also require modeling humans in a way that would help greatly in modeling human values. A human that gives an AI instructions is mapping some high-dimensional, internally represented goal state into a linear sequence of symbols (or a 2D diagram or whatever). DWIM would require the AI to generate its own high-dimensional, internally represented goal states, optimizing for goals that give a high likelihood to the instructions it received. If achievable, DWIM could also help transform the local incentives for general AI capabilities research into something with a better Nash equilibrium. Systems that are capable of predicting what humans intended for them to do could prove far more valuable to existing stakeholders in AI research than current DL and RL systems, which tend to be rather brittle and prone to overfitting to the heuristics we give them.

Comment by Jon Garcia on Re-Examining LayerNorm · 2022-12-01T22:59:58.396Z · LW · GW

Awesome visualizations. Thanks for doing this.

It occurred to me that LayerNorm seems to be implementing something like lateral inhibition, using extreme values of one neuron to affect the activations of other neurons. In biological brains, lateral inhibition plays a key role in many computations, enabling things like sparse coding and attention. Of course, in those systems, input goes through every neuron's own nonlinear activation function prior to having lateral inhibition applied.

I would be interested in seeing the effect of applying a nonlinearity (such as ReLU, GELU, ELU, etc.) prior to LayerNorm in an artificial neural network. My guess is that it would help prevent neurons with strong negative pre-activations from messing with the output of more positively activated neurons, as happens with pure LayerNorm. Of course, that would limit things to the first orthant for ReLU, although not for GELU or ELU. Not sure how that would affect stretching and folding operations, though.

By the way, have you looked at how this would affect processing in a CNN, normalizing each pixel of a given layer across all feature channels? I think I've tried using LayerNorm in such a context before, but I don't recall it turning out too well. Maybe I could look into that again sometime.

Comment by Jon Garcia on Don't align agents to evaluations of plans · 2022-11-27T18:28:18.060Z · LW · GW

I think grading in some form will be necessary in the sense that we don't know what value heuristics will be sufficient to ensure alignment in the AI. We will most likely need to add corrections to its reward signals on the fly, even as it learns to extrapolate its own values from those heuristics. In other words, grading.

However, it seems the crucial point is that we need to avoid including grader evaluations as part of the AI's self-evaluation model, for the same reason that we shouldn't give it access to its reward button. In other words, don't build the AI like this:

[planning module] -> [predicted grader output] -> [internal reward signal] -> [reinforce policy function]

Instead, it should look more like this:

[planning module] -> [predicted world state] -> [internal reward signal] -> [reinforce policy function]

The predicted grader output may be part of the AI's predicted world state (if a grader is used), but it shouldn't be the part that triggers reward. The trick, then, would be to identify the part of the AI's world model that corresponds to what we want it to care about and feed only that part into the learned reward signal.

Comment by Jon Garcia on Don't design agents which exploit adversarial inputs · 2022-11-18T07:35:30.985Z · LW · GW

Could part of the problem be that the actor is optimizing against a single grader's evaluations? Shouldn't it somehow take uncertainty into account?

Consider having an ensemble of graders, each learning or having been trained to evaluate plans/actions from different initializations and/or using different input information. Each grader would have a different perspective, but that means that the ensemble should converge on similar evaluations for plans that look similarly good from many points of view (like a CT image crystallizing from the combination of many projections).

Rather than arg-maxing on the output of a single grader, the actor would optimize for Schelling points in plan space, selecting actions that minimize the variance among all graders. Of course, you still want it to maximize the evaluations also, so maybe it should look for actions that lie somewhere in the middle of the Pareto frontier of maximum and minimum $V a r [e v a l u a t i o n]_{e n s e m b l e}$ .

My intuition suggests that the larger and more diverse the ensemble, the better this strategy would perform, assuming the evaluators are all trained properly. However, I suspect a superintelligence could still find a way to exploit this.

Comment by Jon Garcia on All AGI Safety questions welcome (especially basic ones) [~monthly thread] · 2022-11-02T21:24:42.721Z · LW · GW

Could we solve alignment by just getting an AI to learn human preferences through training it to predict human behavior, using a "current best guess" model of human preferences to make predictions and updating the model until its predictions are accurate, then using this model as a reward signal for the AI? Is there a danger in relying on these sorts of revealed preferences?

On a somewhat related note, someone should answer, "What is this Coherent Extrapolated Volition I've been hearing about from the AI safety community? Are there any holes in that plan?"

Comment by Jon Garcia on Am I secretly excited for AI getting weird? · 2022-10-31T06:01:59.011Z · LW · GW

Different parts of me get excited about this in different directions.

On the one hand, I see AI alignment as highly solvable. When I scan out among a dozen different subdisciplines in machine learning, generative modeling, natural language processing, cognitive science, computational neuroscience, predictive coding, etc., I feel like I can sense the faint edges of a solution to alignment that is already holographically distributed among collective humanity.

Getting AGI that has the same natural abstractions that biological brains converge on, that uses interpretable computational architectures for explicit reasoning, that continuously improves its internal predictive models of the needs and goals of other agents within its sphere of control and uses these models to motivate its own behavior in a self-correcting loop of corrigibility, that cares about the long-term survival of humanity and the whole biosphere; all of this seems like it is achievable within the next 10-20 years if we could just get all the right people working together on it. And I'm excited at the prospect that we could be part of seeing this vision come to fruition.

On the other hand, I realize that humanity is full of bad faith actors and otherwise good people whose agendas are constrained by perverse local incentives. Right now, deep learning is prone to fall to adversarial examples, completely failing to recognize what it's looking at when the texture changes slightly. Natural language understanding is still brittle, with transformer models probably being a bit too general-purpose for their own good. Reinforcement learning still falls prey to Goodharting, which would almost certainly lead to disaster if scaled up sufficiently. Honestly, I don't want to see an AGI emerge that's based on current paradigms just hacked together into something that seems to work. But I see groups moving in that direction anyway.

Without an alignment-adjacent paradigm shift that offers competitive performance over existing models, the major developers of AI are going to continue down a dangerous path, while no one else has the resources to compete. In this light, seeing the rapid progress of the last decade from Alex-Net to GPT-3 and DALLE-2 creates the sort of foreboding excitement that you talked about here. The train is barreling forward at an accelerating pace, and reasonable voices may not be loud enough over the roar of the engines to get the conductor to switch tracks before we plunge over a cliff.

I'm excited for the possibilities of AGI as I idealize it. I'm dreading the likelihood of a dystopic future with no escape if existing AI paradigms take over the world. The question becomes, how do we switch tracks?

User info

Posts

Comments

Exercise: Do What I Mean (DWIM)