Posts

[Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF 2024-10-22T13:57:41.125Z
We Should Prepare for a Larger Representation of Academia in AI Safety 2023-08-13T18:03:19.799Z
Andrew Ng wants to have a conversation about extinction risk from AI 2023-06-05T22:29:07.510Z
Evaluating Language Model Behaviours for Shutdown Avoidance in Textual Scenarios 2023-05-16T10:53:32.968Z
[Appendix] Natural Abstractions: Key Claims, Theorems, and Critiques 2023-03-16T16:38:33.735Z
Natural Abstractions: Key claims, Theorems, and Critiques 2023-03-16T16:37:40.181Z
Andrew Huberman on How to Optimize Sleep 2023-02-02T20:17:12.010Z
Experiment Idea: RL Agents Evading Learned Shutdownability 2023-01-16T22:46:03.403Z
Disentangling Shard Theory into Atomic Claims 2023-01-13T04:23:51.947Z
Citability of Lesswrong and the Alignment Forum 2023-01-08T22:12:02.046Z
A Short Dialogue on the Meaning of Reward Functions 2022-11-19T21:04:30.076Z
Leon Lang's Shortform 2022-10-02T10:05:36.368Z
Distribution Shifts and The Importance of AI Safety 2022-09-29T22:38:12.612Z
Summaries: Alignment Fundamentals Curriculum 2022-09-18T13:08:05.335Z

Comments

Comment by Leon Lang (leon-lang) on Leon Lang's Shortform · 2024-11-20T12:17:44.708Z · LW · GW

After the US election, the twitter competitor bluesky suddenly gets a surge of new users:

https://x.com/robertwiblin/status/1858991765942137227

Comment by Leon Lang (leon-lang) on U.S.-China Economic and Security Review Commission pushes Manhattan Project-style AI initiative · 2024-11-19T23:20:40.708Z · LW · GW

How likely are such recommendations usually to be implemented? Are there already manifold markets on questions related to the recommendation?

Comment by Leon Lang (leon-lang) on Bogdan Ionut Cirstea's Shortform · 2024-11-19T20:30:50.058Z · LW · GW

In the reuters article they highlight Jacob Helberg: https://www.reuters.com/technology/artificial-intelligence/us-government-commission-pushes-manhattan-project-style-ai-initiative-2024-11-19/

He seems quite influential in this initiative and recently also wrote this post:

https://republic-journal.com/journal/11-elements-of-american-ai-supremacy/

Wikipedia has the following paragraph on Helberg:

“ He grew up in a Jewish family in Europe.[9] Helberg is openly gay.[10] He married American investor Keith Rabois in a 2018 ceremony officiated by Sam Altman.”

Might this be an angle to understand the influence that Sam Altman has on recent developments in the US government?

Comment by Leon Lang (leon-lang) on Leon Lang's Shortform · 2024-11-18T14:40:05.735Z · LW · GW

Why I think scaling laws will continue to drive progress

Epistemic status: This is a thought I had since a while. I never discussed it with anyone in detail; a brief conversation could convince me otherwise. 

According to recent reports there seem to be some barriers to continued scaling. We don't know what exactly is going on, but it seems like scaling up base models doesn't bring as much new capability as people hope.

However, I think probably they're still in some way scaling the wrong thing: The model learns to predict a static dataset on the internet; however, what it needs to do later is to interact with users and the world. For performing well in such a task, the model needs to understand the consequences of its actions, which means modeling interventional distributions P(X | do(A)) instead of static data P(X | Y). This is related to causal confusion as an argument against the scaling hypothesis

This viewpoint suggests that if big labs figure out how to predict observations in an online-way by ongoing interactions of the models with users / the world, then this should drive further progress. It's possible that labs are already doing this, but I'm not aware of it, and so I guess they haven't yet fully figured out how to do that. 

What triggered me writing this is that there is a new paper on scaling law for world modeling that's about exactly what I'm talking about here. 

Comment by Leon Lang (leon-lang) on OpenAI Email Archives (from Musk v. Altman) · 2024-11-17T09:35:03.737Z · LW · GW

Do we know anything about why they were concerned about an AGI dictatorship created by Demis?

Comment by Leon Lang (leon-lang) on johnswentworth's Shortform · 2024-11-15T21:22:55.590Z · LW · GW

What’s your opinion on the possible progress of systems like AlphaProof, o1, or Claude with computer use?

Comment by Leon Lang (leon-lang) on Leon Lang's Shortform · 2024-11-14T10:19:12.991Z · LW · GW

"Scaling breaks down", they say. By which they mean one of the following wildly different claims with wildly different implications:

  • When you train on a normal dataset, with more compute/data/parameters, subtract the irreducible entropy from the loss, and then plot in a log-log plot: you don't see a straight line anymore.
  • Same setting as before, but you see a straight line; it's just that downstream performance doesn't improve .
  • Same setting as before, and downstream performance improves, but: it improves so slowly that the economics is not in favor of further scaling this type of setup instead of doing something else.
  • A combination of one of the last three items and "btw., we used synthetic data and/or other more high-quality data, still didn't help".
  • Nothing in the realm of "pretrained models" and "reasoning models like o1" and "agentic models like Claude with computer use" profits from a scale-up in a reasonable sense.
  • Nothing which can be scaled up in the next 2-3 years, when training clusters are mostly locked in, will demonstrate a big enough success to motivate the next scale of clusters costing around $100 billion.

Be precise. See also.

Comment by Leon Lang (leon-lang) on The Compendium, A full argument about extinction risk from AGI · 2024-11-06T17:31:23.327Z · LW · GW

Thanks for this compendium, I quite enjoyed reading it. It also motivated me to read the "Narrow Path" soon.

I have a bunch of reactions/comments/questions at several places. I focus on the places that feel most "cruxy" to me. I formulate them without much hedging to facilitate a better discussion, though I feel quite uncertain about most things I write. 

On AI Extinction

The part on extinction from AI seems badly argued to me. Is it fair to say that you mainly want to convey a basic intuition, with the hope that the readers will find extinction an "obvious" result?

To be clear: I think that for literal god-like AI, as described by you, an existential catastrophe is likely if we don't solve a very hard case of alignment. For levels below (superintelligence, AGI), I become progressively more optimistic. Some of my hope comes from believing that humanity will eventually coordinate to not scale to god-like AI unless we have enormous assurances that alignment is solved; I think this is similar to your wish, but you hope that we already stop before even AGI is built. 

On AI Safety 

When we zoom out from the individual to groups, up to the whole of humanity, the complexity of “finding what we want” explodes: when different cultures, different religions, different countries disagree about what they want on key questions like state interventionism, immigration, or what is moral, how can we resolve these into a fixed set of values? If there is a scientific answer to this problem, we have made little progress on it.

If we cannot find, build, and reconcile values that fit with what we want, we will lose control of the future to AI systems that ardently defend a shadow of what we actually care about.

This is a topic where I'm pretty confused, but I still try to formulate a counterposition: I think we can probably align AI systems to constitutions, which then makes it unnecessary to solve all value differences. Whenever someone uses the AI, the AI needs to act in accordance with the constitution, which already has mechanisms for how to resolve value conflicts.

Additionally, the constitution could have mechanisms for how to change the constitution itself, so that humanity and AI could co-evolve to better values over time. 

Progress on our ability to predict the consequences of our actions requires better science in every technical field.

ELK might circumvent this issue: Just query an AI about its latent knowledge of future consequences of our actions. 

Process design for alignment: [...]

This section seems quite interesting to me, but somewhat different from technical discussions of alignment I'm used to. It seems to me that this section is about problems similar to "intent alignment" or creating valid "training stories", only that you want to define alignment as working correctly in the whole world, instead of just individual systems. Thus, the process design should also prevent problems like "multipolar failure" that might be overlooked by other paradigms. Is this a correct characterization?

Given that this section mainly operates at the level of analogies to politics, economics, and history, I think this section could profit from making stronger connections to AI itself.

Just as solving neuroscience would be insufficient to explain how a company works, even full interpretability of an LLM would be insufficient to explain most research efforts on the AI frontier.

That seems true, and it reminds me of deep deceptiveness, where an AI engages in deception without having any internal process that "looks like" deception. 

The more powerful AI we have, the faster things will go. As AI systems improve and automate their own learning, AGI will be able to improve faster than our current research, and ASI will be able to improve faster than humanity can do science. The dynamics of intelligence growth means that it is possible for an ASI “about as smart as humanity” to move to “beyond all human scientific frontiers” on the order of weeks or months. While the change is most dramatic with more advanced systems, as soon as we have AGI we enter a world where things begin to move much quicker, forcing us to solve alignment much faster than in a pre-AGI world.

I agree that such a fast transition from AGI to superintelligence or god-like AI seems very dangerous. Thus, one either shouldn't build AGI, or should somehow ensure that one has lots of time after AGI is built. Some possibilities for having lots of time:

  1. Sufficient international cooperation to keep things slow.
  2. A sufficient lead of the West over countries like China to have time for alignment

Option 2 leads to a race against China, and even if we end up with a lead, it's unclear whether it will be sufficient to solve the hard problems of alignment. It's also unclear whether the West could use already AGI (pre superintelligence) for a robust military advantage, and absent such an advantage, scenario 2 seems very unstable. 

So a very cruxy question seems to be how feasible option 1 is. I think this compendium doesn't do much to settle this debate, but I hope to learn more in the "Narrow Path".

Thus we need to have humans validate the research. That is, even automated research runs into a bottleneck of human comprehension and supervision.

That seems correct to me. Some people in EA claim that AI Safety is not neglected anymore, but I would say if we ever get confronted with the need to evaluate automated alignment research (possibly on a deadline), then AI Safety research might be extremely neglected.

AI Governance

The reactive framework reverses the burden of proof from how society typically regulates high-risk technologies and industries. In most areas of law, we do not wait for harm to occur before implementing safeguards.

My impression is that companies like Anthropic, DeepMind, and OpenAI talk about mechanisms that are proactive rather than reactive. E.g., responsible scaling policies define an ASL level before it exists, including evaluations for these levels. Then, mitigations need to be in place once the level is reached. Thus, decisively this framework does not want to wait until harm occurred. 

I'm curious whether you disagree with this narrow claim (that RSP-like frameworks are proactive), or whether you just want to make the broader claim that it's unclear how RSP-like frameworks could become widespread enforced regulation. 

AI is being developed extremely quickly and by many actors, and the barrier to entry is low and quickly diminishing.

I think that the barrier to entry is not diminishing: to be at the frontier requires increasingly enormous resources.

Possibly your claim is that the barrier to entry for a given level of capabilities diminishes. I agree with that, but I'm unsure if it's the most relevant consideration. I think for a given level of capabilities, the riskiest period is when it's reached for the first time since humanity then won't have experience in how to mitigate potential risks.

Paul Graham estimates training price for performance has decreased 100x in each of the last two years, or 10000x in two years. 

If GPT-4's costs were 100 million dollars, then it could be trained and released by March 2025 for 10k dollars. That seems quite cheap, so I'm not sure if I believe the numbers.

The reactive framework incorrectly assumes that an AI “warning shot” will motivate coordination.

I never saw this assumption explicitly expressed. Is your view that this is an implicit assumption?

Companies like Anthropic, OpenAI, etc., seem to have facilitated quite some discussion with the USG even without warning shots. 

But history shows that it is exactly in such moments that these thresholds are most contested –- this shifting of the goalposts is known as the AI Effect and common enough to have its own Wikipedia page. Time and again, AI advancements have been explained away as routine processes, whereas “real AI” is redefined to be some mystical threshold we have not yet reached.

I would have found this paragraph convincing before ChatGPT. But now, with efforts like the USG national security memorandum, it seems like AI capabilities are being taken almost adequately seriously.

we’ve already seen competitors fight tooth and nail to keep building.

OpenAI thought that their models are considered high-risk in the EU AI act. I think arguing that this is inconsistent with OpenAI's commitment for regulation would require to look at what the EU AI act actually said. I didn't engage with it, but e.g. Zvi doesn't seem to be impressed

The AI Race

Anthropic released Claude, which they proudly (and correctly) describe as a state-of-the-art pushing model, contradicting  their own Core Views on AI Safety, claiming “We generally don’t publish this kind of work because we do not wish to advance the rate of AI capabilities progress.”

The full quote in Anthropic's article is:

"We generally don’t publish this kind of work because we do not wish to advance the rate of AI capabilities progress. In addition, we aim to be thoughtful about demonstrations of frontier capabilities (even without publication). We trained the first version of our headline model, Claude, in the spring of 2022, and decided to prioritize using it for safety research rather than public deployments. We've subsequently begun deploying Claude now that the gap between it and the public state of the art is smaller."

This added context sounds quite different and seems to make clear that with "publish", Anthropic means the publication of the methods to get to the capabilities. Additionally, I agree with Anthropic that releasing models now is less of a race-driver than it would have been in 2022, and so the current decisions seem more reasonable.

These policy proposals lack a roadmap for government enforcement, making them merely hypothetical mandates. Even worse, they add provisions to allow the companies to amend their own framework as they see fit, rather than codifying a resilient system. See Anthropic’s Responsible Scaling Policy: [...]

I agree that it is bad that there is no roadmap for government enforcement. But without such enforcement, and assuming Anthropic is reasonable, I think it makes sense for them to change their RSP in response to new evidence for what works. After all, we want the version that will eventually be encoded in law to be as sensible as possible.

I think Anthropic also deserves some credit for communicating changes to the RSPs and learnings

Mechanistic interpretability, which tries to reverse-engineer AIs to understand how they work, which can then be used to advance and race even faster. [...] Scalable oversight, which is another term for whack-a-mole approaches where the current issues are incrementally “fixed” by training them away. This incentivizes obscuring issues rather than resolving them. This approach instead helps Anthropic build chatbots, providing a steady revenue stream.

This seems not argued well. It's unclear how mechanistic interpretability would be used to advance the race further (unless you mean that it leads to safety-washing for more government trust and public trust?). Also, scalable oversight is so broad as a collection of strategies that I don't think it's fair to call them whack-a-mole strategies. E.g., I'd say many of the 11 proposals fall under this umbrella.

I'd be happy for any reactions to my comments!

Comment by Leon Lang (leon-lang) on Ryan Kidd's Shortform · 2024-10-31T20:11:32.306Z · LW · GW

Then the MATS stipend today is probably much lower than it used to be? (Which would make sense since IIRC the stipend during MATS 3.0 was settled before the FTX crash, so presumably when the funding situation was different?)

Comment by Leon Lang (leon-lang) on Ryan Kidd's Shortform · 2024-10-31T18:11:07.028Z · LW · GW

Is “CHAI” being a CHAI intern, PhD student, or something else? My MATS 3.0 stipend was clearly higher than my CHAI internship stipend.

Comment by Leon Lang (leon-lang) on Alexander Gietelink Oldenziel's Shortform · 2024-10-23T12:16:47.765Z · LW · GW

I have a similar feeling, but there are some forces in the opposite direction:

  • Nvidia seems to limit how many GPUs a single competitor can acquire.
  • training frontier models becomes cheaper over time. Thus, those that build competitive models some time later than the absolute frontier have to invest much less resources.
Comment by Leon Lang (leon-lang) on Dario Amodei — Machines of Loving Grace · 2024-10-12T13:12:30.113Z · LW · GW

My impression is that Dario (somewhat intentionally?) plays the game of saying things he believes to be true about the 5-10 years after AGI, conditional on AI development not continuing.

What happens after those 5-10 years, or if AI gets even vastly smarter? That seems out of scope for the article. I assume he's doing that since he wants to influence a specific set of people, maybe politicians, to take a radical future more seriously than they currently do. Once a radical future is more viscerally clear in a few years, we will likely see even more radical essays. 

Comment by Leon Lang (leon-lang) on Leon Lang's Shortform · 2024-10-03T21:05:21.580Z · LW · GW

It is a thing that I remember having been said at podcasts, but I don't remember which one, and there is a chance that it was never said in the sense I interpreted it.

Also, quote from this post:

"DeepMind says that at large quantities of compute the scaling laws bend slightly, and the optimal behavior might be to scale data by even more than you scale model size. In which case you might need to increase compute by more than 200x before it would make sense to use a trillion parameters."

Comment by Leon Lang (leon-lang) on Leon Lang's Shortform · 2024-10-03T15:27:41.776Z · LW · GW

Are the straight lines from scaling laws really bending? People are saying they are, but maybe that's just an artefact of the fact that the cross-entropy is bounded below by the data entropy. If you subtract the data entropy, then you obtain the Kullback-Leibler divergence, which is bounded by zero, and so in a log-log plot, it can actually approach negative infinity. I visualized this with the help of ChatGPT:

Here, f represents the Kullback-Leibler divergence, and g the cross-entropy loss with the entropy offset. 

Comment by Leon Lang (leon-lang) on Daniel Kokotajlo's Shortform · 2024-09-30T19:10:41.739Z · LW · GW

Agreed.

To understand your usage of the term “outer alignment” a bit better: often, people have a decomposition in mind where solving outer alignment means technically specifying the reward signal/model or something similar. It seems that to you, the writeup of a model-spec or constitution also counts as outer alignment, which to me seems like only part of the problem. (Unless perhaps you mean that model specs and constitutions should be extended to include a whole training setup or similar?)

If it doesn’t seem too off-topic to you, could you comment on your views on this terminology?

Comment by Leon Lang (leon-lang) on Leon Lang's Shortform · 2024-09-29T20:44:28.529Z · LW · GW

https://www.wsj.com/tech/ai/californias-gavin-newsom-vetoes-controversial-ai-safety-bill-d526f621

“California Gov. Gavin Newsom has vetoed a controversial artificial-intelligence safety bill that pitted some of the biggest tech companies against prominent scientists who developed the technology.

The Democrat decided to reject the measure because it applies only to the biggest and most expensive AI models and leaves others unregulated, according to a person with knowledge of his thinking”

Comment by Leon Lang (leon-lang) on Leon Lang's Shortform · 2024-09-25T16:43:46.725Z · LW · GW

New Bloomberg article on data center buildouts pitched to the US government by OpenAI. Quotes:

- “the startup shared a document with government officials outlining the economic and national security benefits of building 5-gigawatt data centers in various US states, based on an analysis the company engaged with outside experts on. To put that in context, 5 gigawatts is roughly the equivalent of five nuclear reactors, or enough to power almost 3 million homes.”
- “Joe Dominguez, CEO of Constellation Energy Corp., said he has heard that Altman is talking about building 5 to 7 data centers that are each 5 gigawatts. “
- “John Ketchum, CEO of NextEra Energy Inc., said the clean-energy giant had received requests from some tech companies to find sites that can support 5 GW of demand, without naming any specific firms.”

Compare with the prediction by Leopold Aschenbrenner in situational awareness

- "The trillion-dollar cluster—+4 OOMs from the GPT-4 cluster, the ~2030 training cluster on the current trend—will be a truly extraordinary effort. The 100GW of power it’ll require is equivalent to >20% of US electricity production"


 

Comment by Leon Lang (leon-lang) on Stephen McAleese's Shortform · 2024-09-16T12:45:45.101Z · LW · GW

OpenAI would have mentioned if they had reached gold on the IMO.

Comment by Leon Lang (leon-lang) on Leon Lang's Shortform · 2024-09-13T07:48:17.287Z · LW · GW

I think it would be valuable if someone would write a post that does (parts of) the following:

  • summarize the landscape of work on getting LLMs to reason.
  • sketch out the tree of possibilities for how o1 was trained and how it works in inference.
  • select a “most likely” path in that tree and describe in detail a possibility for how o1 works.

I would find it valuable since it seems important for external safety work to know how frontier models work, since otherwise it is impossible to point out theoretical or conceptual flaws for their alignment approaches.

One caveat: writing such a post could be considered an infohazard. I’m personally not too worried about this since I guess that every big lab is internally doing the same independently, so that the post would not speed up innovation at any of the labs.

Comment by Leon Lang (leon-lang) on Are LLMs on the Path to AGI? · 2024-08-30T04:49:13.144Z · LW · GW

Thanks for the post, I agree with the main points.

There is another claim on causality one could make, which would be: LLMs cannot reliably act in the world as robust agents since by acting in the world, you change the world, leading to a distributional shift from the correlational data the LLM encountered during training.

I think that argument is correct, but misses an obvious solution: once you let your LLM act in the world, simply let it predict and learn from the tokens that it receives in response. Then suddenly, the LLM does not model correlational, but actual causal relationships.

Comment by Leon Lang (leon-lang) on Leon Lang's Shortform · 2024-08-29T07:47:07.471Z · LW · GW

Agreed.

I think the most interesting part was that she made a comment that one way to predict a mind is to be a mind, and that that mind will not necessarily have the best of all of humanity as its goal. So she seems to take inner misalignment seriously. 

Comment by Leon Lang (leon-lang) on Leon Lang's Shortform · 2024-08-29T05:22:51.323Z · LW · GW

40 min podcast with Anca Dragan who leads safety and alignment at google deepmind: https://youtu.be/ZXA2dmFxXmg?si=Tk0Hgh2RCCC0-C7q

Comment by Leon Lang (leon-lang) on Defining alignment research · 2024-08-24T20:47:17.143Z · LW · GW

To clarify: are you saying that since you perceive Chris Olah as mostly intrinsically caring about understanding neural networks (instead of mostly caring about alignment), you conclude that his work is irrelevant to alignment?

Comment by Leon Lang (leon-lang) on Vanessa Kosoy's Shortform · 2024-07-28T08:13:07.677Z · LW · GW

I can see that research into proof assistants might lead to better techniques for combining foundation models with RL. Is there anything more specific that you imagine? Outside of math there are very different problems because there is no easy to way to synthetically generate a lot of labeled data (as opposed to formally verifiable proofs).

Not much more specific! I guess from a certain level of capabilities onward, one could create labels with foundation models that evaluate reasoning steps. This is much more fuzzy than math, but I still guess a person who created a groundbreaking proof assistant would be extremely valuable for any effort that tries to make foundation models reason reliably. And if they’d work at a company like google, then I think their ideas would likely diffuse even if they didn’t want to work on foundation models.

Thanks for your details on how someone could act responsibly in this space! That makes sense. I think one caveat is that proof assistant research might need enormous amounts of compute, and so it’s unclear how to work on it productively outside of a company where the ideas would likely diffuse.

Comment by Leon Lang (leon-lang) on Vanessa Kosoy's Shortform · 2024-07-27T21:09:29.088Z · LW · GW

I think the main way that proof assistant research feeds into capabilies research is not through the assistants themselves, but by the transfer of the proof assistant research to creating foundation models with better reasoning capabilities. I think researching better proof assistants can shorten timelines.

  • See also Demis' Hassabis recent tweet. Admittedly, it's unclear whether he refers to AlphaProof itself being accessible from Gemini, or the research into AlphaProof feeding into improvements of Gemini.
  • See also an important paragraph in the blogpost for AlphaProof: "As part of our IMO work, we also experimented with a natural language reasoning system, built upon Gemini and our latest research to enable advanced problem-solving skills. This system doesn’t require the problems to be translated into a formal language and could be combined with other AI systems. We also tested this approach on this year’s IMO problems and the results showed great promise."
Comment by Leon Lang (leon-lang) on Leon Lang's Shortform · 2024-07-27T20:59:48.213Z · LW · GW

https://www.washingtonpost.com/opinions/2024/07/25/sam-altman-ai-democracy-authoritarianism-future/

Not sure if this was discussed at LW before. This is an opinion piece by Sam Altman, which sounds like a toned down version of "situational awareness" to me. 

Comment by Leon Lang (leon-lang) on "AI achieves silver-medal standard solving International Mathematical Olympiad problems" · 2024-07-25T22:45:54.363Z · LW · GW

The news is not very old yet. Lots of potential for people to start freaking out.

Comment by Leon Lang (leon-lang) on Leon Lang's Shortform · 2024-07-21T11:23:59.126Z · LW · GW

One question: Do you think Chinchilla scaling laws are still correct today, or are they not? I would assume these scaling laws depend on the data set used in training, so that if OpenAI found/created a better data set, this might change scaling laws.

Do you agree with this, or do you think it's false?

Comment by Leon Lang (leon-lang) on Leon Lang's Shortform · 2024-07-19T11:15:27.155Z · LW · GW

https://x.com/sama/status/1813984927622549881

According to Sam Altman, GPT-4o mini is much better than text-davinci-003 was in 2022, but 100 times cheaper. In general, we see increasing competition to produce smaller-sized models with great performance (e.g., Claude Haiku and Sonnet, Gemini 1.5 Flash and Pro, maybe even the full-sized GPT-4o itself). I think this trend is worth discussing. Some comments (mostly just quick takes) and questions I'd like to have answers to:

  • Should we expect this trend to continue? How much efficiency gains are still possible? Can we expect another 100x efficiency gain in the coming years? Andrej Karpathy expects that we might see a GPT-2 sized "smart" model.
  • What's the technical driver behind these advancements? Andrej Karpathy thinks it is based on synthetic data: Larger models curate new, better training data for the next generation of small models. Might there also be architectural changes? Inference tricks? Which of these advancements can continue?
  • Why are companies pushing into small models? I think in hindsight, this seems easy to answer, but I'm curious what others think: If you have a GPT-4 level model that is much, much cheaper, then you can sell the service to many more people and deeply integrate your model into lots of software on phones, computers, etc. I think this has many desirable effects for AI developers:
    • Increase revenue, motivating investments into the next generation of LLMs
    • Increase market-share. Some integrations are probably "sticky" such that if you're first, you secure revenue for a long time. 
    • Make many people "aware" of potential usecases of even smarter AI so that they're motivated to sign up for the next generation of more expensive AI.
    • The company's inference compute is probably limited (especially for OpenAI, as the market leader) and not many people are convinced to pay a large amount for very intelligent models, meaning that all these reasons beat reasons to publish larger models instead or even additionally. 
  • What does all this mean for the next generation of large models? 
    • Should we expect that efficiency gains in small models translate into efficiency gains in large models, such that a future model with the cost of text-davinci-003 is massively more capable than today's SOTA? If Andrej Karpathy is right that the small model's capabilities come from synthetic data generated by larger, smart models, then it's unclear to me whether one can train SOTA models with these techniques, as this might require an even larger model to already exist. 
    • At what point does it become worthwhile for e.g. OpenAI to publish a next-gen model? Presumably, I'd guess you can still do a lot of "penetration of small model usecases" in the next 1-2 years, leading to massive revenue increases without necessarily releasing a next-gen model. 
    • Do the strategies differ for different companies? OpenAI is the clear market leader, so possibly they can penetrate the market further without first making a "bigger name for themselves". In contrast, I could imagine that for a company like Anthropic, it's much more important to get out a clear SOTA model that impresses people and makes them aware of Claude. I thus currently (weakly) expect Anthropic to more strongly push in the direction of SOTA than OpenAI.
Comment by Leon Lang (leon-lang) on Fully booked - LessWrong Community weekend · 2024-07-16T19:54:25.357Z · LW · GW

I went to this event in 2022 and it was lovely. Will come again this year. I recommend coming!

Comment by Leon Lang (leon-lang) on A simple case for extreme inner misalignment · 2024-07-14T10:53:40.863Z · LW · GW

Thanks for the answer!

But basically, by "simple goals" I mean "goals which are simple to represent", i.e. goals which have highly compressed representations

It seems to me you are using "compressed" in two very different meanings in part 1 and 2. Or, to be fairer, I interpret the meanings very differently.

I try to make my view of things more concrete to explain:

Compressed representations: A representation is a function  from observations of the world state  (or sequences of such observations) into a representation space  of "features". That this is "compressed" means (a) that in , only a small number of features are active at any given time and (b) that this small number of features is still sufficient to predict/act in the world. 

Goals building on compressed representations: A goal is a (maybe linear) function  from the representation space into the real numbers. The goal "likes" some features and "dislikes" others. (Or if it is not entirely linear, then it may like/dislike some simple combinations/compositions of features)

It seems to me that in part 2 of your post, you view goals as compositions . Part 1 says that  is highly compressed. But it's totally unclear to me why the composition  should then have the simplicity properties you claim in part 2, which in my mind don't connect with the compression properties of  as I just defined them.

A few more thoughts:

  • The notion of "simplicity" in part  seems to be about how easy it is to represent a function -- i.e., the space of parameters with which the function  is represented is simple in your part 2.
  • The notion of "compression" in part 1 seems to be about how easy it is to represent an input -- i.e., is there a small number of features such that their activation tells you the important things about the input?
  • These notions of simplicity and compression are very different. Indeed, if you have a highly compressed representation  as in part 1, I'd guess that  necessarily lives in a highly complex space of possible functions with many parameters, thus the opposite of what seems to be going on in part 2.

This is largely my fault since I haven't really defined "representation" very clearly, but I would say that the representation of the concept of a dog should be considered to include e.g. the neurons representing "fur", "mouth", "nose", "barks", etc. Otherwise if we just count "dog" as being encoded in a single neuron, then every concept encoded in any neuron is equally simple, which doesn't seem like a useful definition.

(To put it another way: the representation is the information you need to actually do stuff with the concept.)

I'm confused. Most of the time, when seeing a dog, most of what I need is actually just to know that it is a "dog", so this is totally sufficient to do something with the concept. E.g., if I walk on the street and wonder "will this thing bark?", then knowing "my dog neuron activates" is almost enough. 

I'm confused for a second reason: It seems like here you want to claim that the "dog" representation is NOT simple (since it contains "fur", "mouth", etc.). However, the "dog" representation needs lots of intelligence and should thus come along with compression, and if you equate compression and simplicity, then it seems to me like you're not consistent. (I feel a bit awkward saying "you're not consistent", but I think it's probably good if I state my honest state of mind at this moment).

To clarify my own position, in line with my definition of compression further above: I think that whether representation is simple/compressed is NOT a property of a single input-output relation (like "pixels of dog gets mapped to dog-neuron being activated"), but instead a property of the whole FUNCTION that maps inputs to representations. This function is compressed if for any given input, only a small number of neurons in the last layer activate, and if these can be used (ideally in a linear way) for further predictions and for evaluating goal-achievement. 

I agree that most people who say they are hedonic utilitarians are not 100% committed to hedonic utilitarianism. But I still think it's very striking that they at least somewhat care about making hedonium. I claim this provides an intuition pump for how AIs might care about squiggles too.

Okay, I agree with this, fwiw. :) (Though I may not necessarily agree with claims about how this connects to the rest of the post)

Comment by Leon Lang (leon-lang) on A simple case for extreme inner misalignment · 2024-07-13T22:05:01.480Z · LW · GW

Thanks for the post!

a. How exactly do 1 and 2 interact to produce 3?

I think the claim is along the lines of "highly compressed representations imply simple goals", but the connection between compressed representations and simple goals has not been argued, unless I missed it. There's also a chance that I simply misunderstand your post entirely. 

b. I don't agree with the following argument:

Decomposability over space. A goal is decomposable over space if it can be evaluated separately in each given volume of space. All else equal, a goal is more decomposable if it's defined over smaller-scale subcomponents, so the most decomposable goals will be defined over very small slices of space—hence why we're talking about molecular squiggles. (By contrast, you can't evaluate the amount of higher-level goals like "freedom" or "justice" in a nanoscale volume, even in principle.)

The classical ML-algorithm that evaluates features separately in space is a CNN. That doesn't mean that features in CNNs look for tiny structures, though: The deeper into the CNN you are, the more complex the features get. Actually, deep CNNs are an example of what you describe in argument 1: The features in later layers of CNNs are highly compressed, and may tell you binary information such as "is there a dog", but they apply to large parts of the input image.

Therefore, I'd also expect that what an AGI would care about are ultimately larger-scale structures since the AGI's features will nontrivially depend on the interaction of larger parts in space (and possibly time, e.g. if the AGI evaluates music, movies, etc.). 

c. I think this leaves the confusion why philosophers end up favoring the analog of squiggles when they become hedonic utilitarians. I'd argue that the premise may be false since it's unclear to me how what philosophers say they care about ("henonium") connects with what they actually care about (e.g., maybe they still listen to complex music, build a family, build status through philosophical argumentation, etc.)

Comment by Leon Lang (leon-lang) on Leon Lang's Shortform · 2024-07-01T12:04:30.717Z · LW · GW

You should all be using the "Google Scholar PDF reader extension" for Chrome.

Features I like:

  • References are linked and clickable
  • You get a table of contents
  • You can move back after clicking a link with Alt+left

Screenshot: 

Comment by Leon Lang (leon-lang) on Examples of Highly Counterfactual Discoveries? · 2024-04-25T13:49:44.535Z · LW · GW

I guess (but don't know) that most people who downvote Garrett's comment overupdated on intuitive explanations of singular learning theory, not realizing that entire books with novel and nontrivial mathematical theory have been written on it. 

Comment by Leon Lang (leon-lang) on A couple productivity tips for overthinkers · 2024-04-21T17:14:01.039Z · LW · GW

I do all of these except 3, and implementing a system like 3 is among my deprioritized things in my ToDo-list. Maybe I should prioritize it.

Comment by Leon Lang (leon-lang) on Transformers Represent Belief State Geometry in their Residual Stream · 2024-04-17T20:46:26.981Z · LW · GW

Yes the first! Thanks for the link!

Comment by Leon Lang (leon-lang) on Transformers Represent Belief State Geometry in their Residual Stream · 2024-04-17T17:17:47.891Z · LW · GW

I really enjoyed reading this post! It's quite well-written. Thanks for writing it.

The only critique is that I would have appreciated more details on how the linear regression parameters are trained and what exactly the projection is doing. John's thread is a bit clarifying on this.

One question: If you optimize the representation in the residual stream such that it corresponds to a particular chosen belief state, does the transformer than predict the next token as if in that belief state? I.e., does the transformer use the belief state for making predictions?

Comment by Leon Lang (leon-lang) on More people getting into AI safety should do a PhD · 2024-03-15T00:57:17.126Z · LW · GW

MATS mentorships are often weekly, but only for limited time, unlike PhD programs that offer mentorship for several years. These years are probably often necessary to develop good research taste.

Comment by Leon Lang (leon-lang) on Sharing Information About Nonlinear · 2023-09-08T22:36:06.890Z · LW · GW

(Fwiw, I don’t remember problems with stipend payout at seri mats in the winter program. I was a winter scholar 2022/23.)

Comment by Leon Lang (leon-lang) on Long-Term Future Fund: April 2023 grant recommendations · 2023-08-02T09:41:43.743Z · LW · GW

This is very helpful, thanks! Actually, the post includes several sections, including in the appendix, that might be more interesting to many readers than the grant recommendations themselves. Maybe it would be good to change the title a bit so that people also expect other updates.

Comment by Leon Lang (leon-lang) on DSLT 2. Why Neural Networks obey Occam's Razor · 2023-07-12T06:26:39.865Z · LW · GW

Thanks for the reply!

As I show in the examples in DSLT1, having degenerate Fisher information (i.e. degenerate Hessian at zeroes) comes in two essential flavours: having rank-deficiency, and having vanishing second-derivative (i.e. ). Precisely, suppose  is the number of parameters, then you are in the regular case if  can be expressed as a full-rank quadratic form near each singularity, 

Anything less than this is a strictly singular case. 

So if , then  is a singularity but not a strict singularity, do you agree? It still feels like somewhat bad terminology to me, but maybe it's justified from the algebraic-geometry--perspective. 

Comment by Leon Lang (leon-lang) on Leon Lang's Shortform · 2023-07-05T09:15:29.588Z · LW · GW

Zeta Functions in Singular Learning Theory

In this shortform, I very briefly explain my understanding of how zeta functions play a role in the derivation of the free energy in singular learning theory. This is entirely based on slide 14 of the SLT low 4 talk of the recent summit on SLT and Alignment, so feel free to ignore this shortform and simply watch the video.

The story is this: we have a prior , a model , and there is an unknown true distribution . For model selection, we are interested in the evidence of our model for a data set , which is given by

where  is the empirical KL divergence. In fact, we are interested in selecting the model that maximizes the average of this quantity over all data sets. The average is then given by

where  is the Kullback-Leibler divergence. 

But now we have a problem: how do we compute this integral? Computing this integral is what the free energy formula is about

The answer: by computing a different integral. So now, I'll explain the connection to different integrals we can draw. 

Let

which is called the state density function. Here,  is the Dirac delta function.  For different , it measures the density of states (= parameter vectors) that have . It is thus a measure for the "size" of different level sets. This state density function is connected to two different things. 

Laplace Transform to the Evidence

First of all, it is connected to the evidence above. Namely, let  be the Laplace transform of . It is a function  given by

In first step, we changed the order of integration, and in the second step we used the defining property of the Dirac delta. Great, so this tells us that ! So this means we essentially just need to understand .

Mellin Transform to the Zeta Function

But how do we compute ? By using another transform. Let  be the Mellin transform of . It is a function  (or maybe only defined on part of ?) given by

Again, we used a change in the order of integration and then the defining property of the Dirac delta. This is called a Zeta function. 

What's this useful for?

The Mellin transform has an inverse. Thus, if we can compute the zeta function, we can also compute the original evidence as

Thus, we essentially changed our problem to the problem of studying the zeta function  To compute the integral of the zeta function, it is then useful to perform blowups to resolve the singularities in the set of minima of , which is where algebraic geometry enters the picture. For more on all of this, I refer, again, to the excellent SLT low 4 talk of the recent summit on singular learning theory. 

Comment by Leon Lang (leon-lang) on DSLT 2. Why Neural Networks obey Occam's Razor · 2023-07-03T23:13:00.661Z · LW · GW

Thanks for the answer! I think my first question was confused because I didn't realize you were talking about local free energies instead of the global one :) 

As discussed in the comment in your DSLT1 question, they are both singularities of  since they are both critical points (local minima).

Oh, I actually may have missed that aspect of your answer back then. I'm confused by that: in algebraic geometry, the zero's of a set of polynomials are not necessarily already singularities. E.g., in , the zero set consists of the two axes, which form an algebraic variety, but only at  is there a singularity because the derivative disappears.
Now, for the KL-divergence, the situation seems more extreme: The zero's are also, at the same time, the minima of , and thus, the derivative disappears at every point in the set . This suggests every point in  is singular. Is this correct?

So far, I thought "being singular" means the effective number of parameters around the singularity is lower than the full number of parameters. Also, I thought that it's about the rank of the Hessian, not the vanishing of the derivative. Both perspectives contradict the interpretation in the preceding paragraph, which leaves me confused. 

The uninteresting answer is that SLT doesn't care about the prior (other than its regularity conditions) since it is irrelevant in the  limit.

I vaguely remember that there is a part in the MDL book by Grünwald where he explains how using a good prior such as Jeffrey's prior somewhat changes asymptotic behavior for , but I'm not certain of that. 

Comment by Leon Lang (leon-lang) on DSLT 4. Phase Transitions in Neural Networks · 2023-07-03T22:48:04.400Z · LW · GW

Thanks also for this post! I enjoy reading the sequence and look forward to post 5 on the connections to alignment :) 

At some critical value , we recognise a phase transition as being a discontinuous change in the free energy or one of its derivatives, for example the generalisation error .

"Discontinuity" might suggest that this happens fast. Yet, e.g. in work on grokking, it actually turns out that these "sudden changes" happen over a majority of the training time (often, the x-axis is on a logarithmic scale). Is this compatible, or would this suggest that phenomena like grokking aren't related to the phase transitions predicted by SLT?

There is, however, one fundamentally different kind of "phase transition" that we cannot explain easily with SLT: a phase transition of SGD in time, i.e. the number gradient descent steps. The Bayesian framework of SLT does not really allow one to speak of time - the closest quantity is the number of datapoints , but these are not equivalent. We leave this gap as one of the fundamental open questions of relating SLT to current deep learning practice.

As far as I know, modern transformers are often only trained once on each data sample, which should close the gap between SGD time and the number of data samples quite a bit. Do you agree with that perspective?

In general, it seems to me that we're probably most interested in phase transitions that happen across SGD time or with more data samples, whereas phase transitions related to other hyperparameters (for example, varying the truth as in your examples here) are maybe less crucial. Would you agree with that?

Would you expect that most phase transitions in SGD time or the number of data samples are first-order transitions (as is the case when there is a loss-complexity tradeoff), or can you conceive of second-order phase transitions that might be relevant in that context as well?

Which altered the posterior geometry, but not that of  since  (up to a normalisation factor).

I didn't understand this footnote. 

but the node-degeneracy and orientation-reversing symmetries only occur under precise configurations of the truth.

Hhm, I thought that these symmetries are about configurations of the parameter vector, irrespective of whether it is the "true" vector or not.
Are you maybe trying to say the following? The truth determines which parameter vectors are preferred by the free energy, e.g. those close to the truth. For some truths, we will have more symmetries around the truth, and thus lower RLCT for regions preferred by the posterior

We will use the label weight annihilation phase to refer to the configuration of nodes such that the weights all point into the centre region and annihilate one another.

It seems to me that in the other phase, the weights also annihilate each other, so the "non-weight annihilation phase" is a somewhat weird terminology. Or did I miss something?

The weight annihilation phase  is never preferred by the posterior

I think there is a typo and you meant .

Comment by Leon Lang (leon-lang) on DSLT 3. Neural Networks are Singular · 2023-07-03T13:48:45.112Z · LW · GW

Thanks Liam also for this nice post! The explanations were quite clear. 

The property of being singular is specific to a model class , regardless of the underlying truth.

This holds for singularities that come from symmetries where the model doesn't change. However, is it correct that we need the "underlying truth" to study symmetries that come from other degeneracies of the Fisher information matrix? After all, this matrix involves the true distribution in its definition. The same holds for the Hessian of the KL divergence. 

Both configurations, non-weight-annihilation (left) and weight-annihilation (right)

What do you mean with non-weight-annihilation here? Don't the weights annihilate in both pictures?

Comment by Leon Lang (leon-lang) on Neural networks generalize because of this one weird trick · 2023-06-27T04:09:03.576Z · LW · GW


In particular, it is the singularities of these minimum-loss sets — points at which the tangent is ill-defined — that determine generalization performance.

To clarify: there is not necessarily a problem with the tangent, right? E.g., the function  has a singularity at  because the second derivative vanishes there, but the tangent is define. I think for the same reason, some of the pictures may be misleading to some readers. 

  • A model, parametrized by weights , where  is compact;

Why do we want compactness? Neural networks are parameterized in a non-compact set. (Though I guess usually, if things go well, the weights don't blow up. So in that sense it can maybe be modeled to be compact)

The empirical Kullback-Leibler divergence is just a rescaled and shifted version of the negative log likelihood.

I think it is only shifted, and not also rescaled, if I'm not missing something. 

But these predictions of "generalization error" are actually a contrived kind of theoretical device that isn't what we mean by "generalization error" in the typical ML setting.

Why is that? I.e., in what way is the generalization error different from what ML people care about? Because real ML models don't predict using an updated posterior over the parameter space? (I was just wondering if there is a different reason I'm missing)

Comment by Leon Lang (leon-lang) on DSLT 1. The RLCT Measures the Effective Dimension of Neural Networks · 2023-06-27T01:55:13.807Z · LW · GW

Thanks for the answer mfar!

Yeah I remember also struggling to parse this statement when I first saw it. Liam answered but in case it's still not clear and/or someone doesn't want to follow up in Liam's thesis,  is a free variable, and the condition is talking about linear dependence of functions of .

Consider a toy example (not a real model) to help spell out the mathematical structure involved: Let  so that  and . Then let  and  be functions such that  and .. Then the set of functions  is a linearly dependent set of functions because .

Thanks! Apparently the proof of the thing I was wondering about can be found in Lemma 3.4 in Liam's thesis. Also thanks for your other comments!

Comment by Leon Lang (leon-lang) on DSLT 1. The RLCT Measures the Effective Dimension of Neural Networks · 2023-06-27T01:32:41.017Z · LW · GW

Thanks for the answer Liam! I especially liked the further context on the connection between Bayesian posteriors and SGD. Below a few more comments on some of your answers:

The partition function is equal to the model evidence , yep. It isn’t equal to (I assume  is fixed here?) but is instead expressed in terms of the model likelihood and prior (and can simply be thought of as the “normalising constant” of the posterior), 

and then under this supervised learning setup where we know , we have . Also note that this does “factor over ” (if I’m interpreting you correctly) since the data is independent and identically distributed.  

I think I still disagree. I think everything in these formulas needs to be conditioned on the -part of the dataset. In particular, I think the notation  is slightly misleading, but maybe I'm missing something here.

I'll walk you through my reasoning: When I write  or , I mean the whole vectors, e.g., . Then I think the posterior compuation works as follows:

That is just Bayes rule, conditioned on  in every term. Then,  because from alone you don't get any new information about the conditional  (A more formal way to see this is to write down the Bayesian network of the model and to see that  and  are d-separated). Also, conditioned on  is independent over data points, and so we obtain

So, comparing with your equations, we must have  Do you think this is correct?

Btw., I still don't think this "factors over ". I think that

The reason is that old data points should inform the parameter , which should have an influence on future updates. I think the independence assumption only holds for the true distribution and the model conditioned on 

If you expand that term out you find that 

because the second integral is the first central moment of a Gaussian. The derivative of the prior is irrelevant. 

Right. that makes sense, thank you! (I think you missed a factor of , but that doesn't change the conclusion)

Thanks also for the corrected volume formula, it makes sense now :) 

Comment by Leon Lang (leon-lang) on DSLT 2. Why Neural Networks obey Occam's Razor · 2023-06-25T22:46:25.722Z · LW · GW

Thanks for this nice post! I fight it slightly more vague than the first post, but I guess that is hard to avoid when trying to distill highly technical topics. I got a lot out of it. 

Fundamentally, we care about the free energy  because it is a measure of posterior concentration, and as we showed with the BIC calculation in DSLT1, it tells us something about the information geometry of the posterior.

Can you tell more about why it is a measure of posterior concentration (It gets a bit clearer further below, but I state my question nonetheless to express that this statement isn't locally clear to me here)? I may lack some background in Bayesian statistics here. In the first post, you wrote the posterior as

and it seems like you want to say that if free energy is low, then the posterior is more concentrated. If I look at this formula, then low free energy corresponds to high , meaning the prior and likelihood have to "work quite a bit" to ensure that this expression overall integrates to . Are you claiming that most of that work happens very localized in a small parameter region?

Additionally, I am not quite sure what you mean with "it tells us something about the information geometry of the posterior", or even what you mean by "information geometry" here. I guess one answer is that you showed in post 1 that the Fisher information matrix appears in the formula for the free energy, which contains geometric information about the loss landscape. But then in the proof, you regarded that as a constant that you ignored in the final BIC formula, so I'm not sure if that's what you are referring to here. More explicit references would be useful to me. 

Since there is a correspondence

we say the posterior prefers a region  when it has low free energy relative to other regions of 

Note to other readers (as this wasn't clear to me immediately): That correspondence holds because one can show that 

Here,  is the global partition function. 

The Bayes generalisation loss is then given by 

I believe the first expression should be an expectation over .

It follows immediately that the generalisation loss of a region  is 

I didn't find a definition of the left expression. 

So, the region in  that minimises the free energy has the best accuracy-complexity tradeoff. This is the sense in which singular models obey Occam's Razor: if two regions are equally accurate, then they are preferred according to which is the simpler model. 

Purposefully naive question: can I just choose a region  that contains all singularities? Then it surely wins, but this doesn't help us because this region can be very large.

So I guess you also want to choose small regions. You hinted at that already by saying that  should be compact. But now I of course wonder if sometimes just all of  lies within a compact set. 

There are two singularities in the set of true parameters, 

which we will label as  and  respectively.

Possible correction: one of those points isn't a singularity, but a regular loss-minimizing point (as you also clarify further below).

Let's consider a one parameter model  with KL divergence defined by 

on the region  with uniform prior 

The prior seems to do some work here: if it doesn't properly support the region with low RLCT, then the posterior cannot converge there. I guess a similar story might a priori hold for SGD, where how you initialize your neural network might matter for convergence.

How do you think about this? What are sensible choices of priors (or network initializations) from the SLT perspective?

Also, I find it curious that in the second example, the posterior will converge to the lowest loss, but SGD would not since it wouldn't "manage to get out of the right valley", I assume. This seems to suggest that the Bayesian view of SGD can at most be true in high dimensions, but not for very low-dimensional neural networks. Would you agree with that, or what is your perspective?

Comment by Leon Lang (leon-lang) on DSLT 1. The RLCT Measures the Effective Dimension of Neural Networks · 2023-06-25T01:06:55.232Z · LW · GW

Thank you for this wonderful article! I read it fairly carefully and have a number of questions and comments. 

where the partition function (or in Bayesian terms the evidence) is given by

Should I think of this as being equal to , and would you call this quantity ? I was a bit confused since it seems like we're not interested in the data likelihood, but only the conditional data likelihood under model 

And to be clear: This does not factorize over  because every data point informs  and thereby the next data point, correct?

The learning goal is to find small regions of parameter space with high posterior density, and therefore low free energy.

But the free energy does not depend on the parameter, so how should I interpret this claim? Are you already one step ahead and thinking about the singular case where the loss landscape decomposes into different "phases" with their own free energy?

there is almost sure convergence  as  to a constant  that doesn't depend on [5]

I think the first expression should either be an expectation over , or have the conditional entropy  within the parantheses. 

  • In the realisable case where , the KL divergence is just the euclidean distance between the model and the truth adjusted for the prior measure on inputs, 

I briefly tried showing this and somehow failed. I didn't quite manage to get rid of the integral over . Is this simple? (You don't need to show me how it's done, but maybe mentioning the key idea could be useful)

A regular statistical model class is one which is identifiable (so  implies that ), and has positive definite Fisher information matrix  for all 

The rest of the article seems to mainly focus on the case of the Fisher information matrix. In particular, you didn't show an example of a non-regular model where the Fisher information matrix is positive definite everywhere. 

Is it correct to assume models which are merely non-regular because the map from parameters to distributions is non-injective aren't that interesting, and so you maybe don't even want to call them singular? I found this slightly ambiguous, also because under your definitions further down, it seems like "singular" (degenerate Fisher information matrix) is a stronger condition then "strictly singular" (degenerate Fisher information matrix OR non-injective map from parameters to distributions).

It can be easily shown that, under the regression model,  is degenerate if and only the set of derivatives

is linearly dependent. 

What is  in this formula? Is it fixed? Or do we average the derivatives over the input distribution?

Since every true parameter is a degenerate singularity[9] of , it cannot be approximated by a quadratic form.

Hhm, I thought having a singular model just means that some singularities are degenerate.

One unrelated conceptual question: when I see people draw singularities in the loss landscape, for example in Jesse's post, they often "look singular": i.e., the set of minimal points in the loss landscape crosses itself. However, this doesn't seem to actually be the case: a perfectly smooth curve of loss-minimizing points will consist of singularities because in the direction of the curve, the derivative does not change. Is this correct?

We can Taylor expand the NLL as 

I think you forgot a  in the term of degree 1. 

In that case, the second term involving  vanishes since it is the first central moment of a normal distribution

Could you explain why that is? I may have missed some assumption on  or not paid attention to something. 

In this case, since  for all , we could simply throw out the free parameter  and define a regular model with  parameters that has identical geometry , and therefore defines the same input-output function, .

Hhm. Is the claim that if the loss of the function does not change along some curve in the parameter space, then the function itself remains invariant? Why is that?

Then the dimension  arises as the scaling exponent of , which can be extracted via the following ratio of volumes formula for some 

This scaling exponent, it turns out, is the correct way to think about dimensionality of singularities. 

Are you sure this is the correct formula? When I tried computing this by hand it resulted in , but maybe I made a mistake. 

General unrelated question: is the following a good intuition for the correspondence of the volume with the effective number of parameters around a singularity? The larger the number of effective parameters  around , the more  blows up around  in all directions because we get variation in all directions, and so the smaller the region where  is below . So  contributes to this volume. This is in fact what it does in the formulas, by being an exponent for small 

So, in this case the global RLCT is , which we will see in DSLT2 means that the posterior is most concentrated around the singularity 

Do you currently expect that gradient descent will do something similar, where the parameters will move toward singularities with low RLCT? What's the state of the theory regarding this? (If this is answered in later posts, feel free to just refer to them)

Also, I wonder whether this could be studied experimentally even if the theory is not yet ready: one could probably measure the RLCT around minimal loss points by measuring volumes, and then just check whether gradient descent actually ends up in low-RLCT regions. Maybe this is what you do in later posts. If this is the case, I wonder whether I should be surprised or not: it seems like the lower the RLCT, the larger the number of (fractional) directions where the loss is minimal, and so the larger the basin. So for purely statistical reasons, one may end up in such a region instead of isolated loss-minimizing points of high RLCT.