lukemarks's Shortform

post by lukemarks (marc/er) · 2024-07-02T06:56:27.115Z · LW · GW · 19 comments

Contents

19 comments

19 comments

Comments sorted by top scores.

comment by lukemarks (marc/er) · 2024-07-27T05:03:49.625Z · LW(p) · GW(p)

More people should consider dropping out of high school, particularly if they:

  • Don't find their classes interesting
  • Have self-motivation
  • Don't plan on going to university

In most places, once you reach an age younger than the typical age of graduation you are not legally obligated to attend school. Many continue because it's normal, but some brief analysis could reveal that graduating is not worth the investment for you.

Some common objections I heard:

  • It's only  more months, why not finish?

Why finish?

  • What if 'this whole thing' doesn't pan out?

The mistake in this objection is thinking there was a single reason I wanted to leave school. I was increasing my free time, not making a bet on a particular technology.

  • My parents would never consent to this.

In some cases this is true. You might be surprised if you demonstrate long term commitment and the ability to get financial support though.

Leaving high school is not the right decision for everyone, but many students won't even consider it. At least make the option available to yourself.

Replies from: lahwran, Viliam, jmh, alexander-gietelink-oldenziel
comment by the gears to ascension (lahwran) · 2024-07-27T07:26:24.065Z · LW(p) · GW(p)

What's the epistemic backing behind this claim, how much data, what kind? Did you do it, how's it gone? How many others do you know of dropping out and did it go well or poorly?

Replies from: marc/er, TsviBT, localdeity
comment by lukemarks (marc/er) · 2024-07-27T11:00:24.290Z · LW(p) · GW(p)

I dropped out one month ago. I don't know anyone else who has dropped out. My comment recommends students consider dropping out on the grounds that it seemed like the right decision for me, but it took me a while to realize this was even a choice.

So far my experience has been pleasant. I am ~twice as productive. The total time available to me is ~2.5-3x as much as I had prior. The excess time lets me get a healthy amount of sleep and play videogames without sacrificing my most productive hours. I would make the decision again, and earlier if I could.

comment by TsviBT · 2024-07-27T11:46:24.653Z · LW(p) · GW(p)

Anecdata: I got some benefits from school, but the costs were overwhelming. I probably should have dropped out after kindergarten; certainly before fourth grade. https://tsvibt.blogspot.com/2022/05/harms-and-possibilities-of-schooling.html

comment by localdeity · 2024-07-28T00:45:46.893Z · LW(p) · GW(p)

I dropped out after 10th grade.  I messed around at home, doing some math and programming, for ~6 years, then started working programming jobs at age 22 (nearly 10 years ago).  I'd say results were decent.

A friend of mine dropped out after 11th grade.  He has gone back and forth between messing around (to some extent with math and programming; in later years, with meditation) and working programming jobs, and I think is currently doing well with such a job.  Probably also decent.

(And neither of us went to college, although I think my friend may have audited some classes.)

comment by Viliam · 2024-07-27T12:03:08.037Z · LW(p) · GW(p)

It might be useful to have some test for "have self-motivation", to reduce the number of people who believe they have it, they quit school, and then it turns out they actually don't.

Or maybe, it's not just whether you feel motivated right now, but how long that feeling stays, on average.

comment by jmh · 2024-07-27T11:26:53.231Z · LW(p) · GW(p)

I do think you're correct that it would be a good decision for some. I would also say establishing this as a norm might induce some to take the easy way out and it be a mistake for them.

Might be the case that councelors should be prepared to have a real converstation with HS students that come to that decision but not really make it one schools promote as a path forward. But I do know I was strongly encouraged to complete HS even when I was not really happy with it (and not doing well by many metrics) but recognized as an intelligent kid. I often think I should have just dropped out, got me GED, worked (which I was already doing and then skipping school often) and then later pursued college (which I also did a few years after I graduated HS). I do feel I probably lost some years playing the expected path game.

comment by lukemarks (marc/er) · 2024-07-28T08:17:51.503Z · LW(p) · GW(p)

Neural Redshift: Random Networks are not Random Functions shows that even randomly initialized neural nets tend to be simple functions (measured by frequency, polynomial order and compressibility), and that this bias can be partially attributed to ReLUs. Previous speculation on simplicity biases focused mostly on SGD, but this is now clearly not the only contributor.

The authors propose that good generalization occurs when an architecture's preferred complexity matches the target function's complexity. We should think about how compatible this is with our projections for how future neural nets might behave. For example: If this proposition were true and a significant decider of generalization ability, would this make mesaoptimization less likely? More likely?

As an aside: Research on inductive biases could be very impactful. My impression is that far less resources are spent studying inductive biases than interpretability, but inductive bias research could be feasible on small compute budgets, and tell us lots about what to expect as we scale neural nets.

Replies from: Lblack
comment by Lucius Bushnaq (Lblack) · 2024-07-28T09:39:38.821Z · LW(p) · GW(p)

Singular Learning Theory explains/predicts this. If you go to a random point in the loss landscape, you very likely land in a large region implementing the same behaviour, meaning the network has a small effective parameter count. Just because most of the loss landscape is taken up by the biggest, and thus simplest, behavioural regions.

You can see this happening if you watch proxies for the effective parameter count while models train. E.g. a modular addition transformer or MNIST MLP start out with very few effective parameters at initialisation, then gain more as the network trains. If the network goes through a grokking transition, you can watch the effective parameter count go down again.


For example: If this proposition were true and a significant decider of generalization ability, would this make mesaoptimization less likely? More likely?

≈ no change I'd say. We already knew neural network training had a bias towards algorithmic simplicity of some kind, because otherwise it wouldn't work. So we knew general algorithms, like mesa-optimisers, would be preferred over memorised solutions that don't generalise out of distribution. SLT just tells us how that works.

One takeaway might be that observations about how biological brains train are more applicable to AI training than one might have previously thought. Previously, you could've figured that since AIs use variants of gradient descent as their updating algorithm, while the brain uses we-don't-even-know-what, their inductive biases could be completely different.

Now, it's looking like the updating rule you use doesn't actually matter that much for determining the inductive bias. Anything in a wide class of local optimisation methods might give you pretty similar stuff. Some methods are a lot more efficient than others, but the real pixie fairy dust that makes any of this possible is in the architecture, not the updating rule. 

(Obviously, it still matters what loss signal you use. You can't just expect that an AI will converge to learn the same desires a human brain would, unless the AI's training signals are similar to those used by the human brain. And we don't know what most of the brain's training signals are.)

Replies from: marc/er
comment by lukemarks (marc/er) · 2024-07-28T13:43:27.493Z · LW(p) · GW(p)

I think the predictions SLT makes are different from the results in the neural redshift paper. For example, if you use tanh instead of ReLU the simplicity bias is weaker. How does SLT explain/predict this? Maybe you meant that SLT predicts that good generalization occurs when an architecture's preferred complexity matches the target function's complexity?

The explanation you give sounds like a different claim however.

If you go to a random point in the loss landscape, you very likely land in a large region implementing the same behaviour, meaning the network has a small effective parameter count

This is true of all neural nets, but the neural redshift paper claims that specific architectural decisions beat picking random points in the loss landscape. Neural redshift could be true in worlds where the SLT prediction was either true or false.

We already knew neural network training had a bias towards algorithmic simplicity of some kind, because otherwise it wouldn't work. 

We knew this, but the neural redshift paper claims that the simplicity bias is unrelated to training.

So we knew general algorithms, like mesa-optimisers, would be preferred over memorised solutions that don't generalise out of distribution. 

The paper doesn't just show a simplicity bias, it shows a bias for functions of a particular complexity that is simpler than random. To me this speaks against the likelihood of mesaoptimization, because it seems unlikely a mesaoptimizer would be similar in complexity to the training set if that training set did not describe an optimizer.

Replies from: Lblack
comment by Lucius Bushnaq (Lblack) · 2024-07-28T16:22:07.221Z · LW(p) · GW(p)

For example, if you use tanh instead of ReLU the simplicity bias is weaker. How does SLT explain/predict this?
 

It doesn't. It just has neat language to talk about how the simplicity bias is reflected in the way the loss landscape of ReLU vs. tanh look different. It doesn't let you predict ahead of checking that the ReLU loss landscape will look better.

Maybe you meant that SLT predicts that good generalization occurs when an architecture's preferred complexity matches the target function's complexity?

That is closer to what I meant, but it isn't quite what SLT says. The architecture doesn't need to be biased toward the target function's complexity. It just needs to always prefer simpler fits to more complex ones. 

SLT says neural network training works because in a good nn architecture simple solutions take up exponentially more space in the loss landscape. So if you can fit the target function on the training data with a fit of complexity 1, that's the fit you'll get. If there is no function with complexity 1 that matches the data, you'll get a fit with complexity 2 instead. If there is no fit like that either, you'll get complexity 3. And so on. 

This is true of all neural nets, but the neural redshift paper claims that specific architectural decisions beat picking random points in the loss landscape. Neural redshift could be true in worlds where the SLT prediction was either true or false.

Sorry, I don't understand what you mean here. The paper takes different architectures and compares what functions you get if you pick a point at random from their parameter spaces, right? 

If you mean this

But unlike common wisdom, NNs do not have an inherent “simplicity bias”. This property depends on components such as ReLUs, residual connections, and layer normalizations.

then that claim is of course true. Making up architectures with bad inductive biases is easy, and I don't think common wisdom thinks otherwise. 

We knew this, but the neural redshift paper claims that the simplicity bias is unrelated to training.

Sure, but for the question of whether mesa-optimisers will be selected for, why would it matter if the simplicity bias came from the updating rule instead of the architecture? 


The paper doesn't just show a simplicity bias, it shows a bias for functions of a particular complexity that is simpler than random. To me this speaks against the likelihood of mesaoptimization, because it seems unlikely a mesaoptimizer would be similar in complexity to the training set if that training set did not describe an optimizer.

What would a 'simplicity bias' be other than a bias towards things simpler than random in whatever space we are referring to? 'Simpler than random' is what people mean when they talk about simplicity biases.


To me this speaks against the likelihood of mesaoptimization, because it seems unlikely a mesaoptimizer would be similar in complexity to the training set if that training set did not describe an optimizer.

What do you mean by 'similar complexity to the training set'? The message length of the training set is very likely going to be much longer than the message length of many mesa-optimisers, but that seems like an argument for mesa-optimiser selection if anything.

Though I hasten to add that SLT doesn't actually say training prefers solutions with low K-complexity. A bias towards low learning coefficients seems to shake out in some sort of mix between a bias toward low K-complecity, and a bias towards speed [LW · GW]. 

Replies from: marc/er
comment by lukemarks (marc/er) · 2024-07-29T01:18:30.130Z · LW(p) · GW(p)

That is closer to what I meant, but it isn't quite what SLT says. The architecture doesn't need to be biased toward the target function's complexity. It just needs to always prefer simpler fits to more complex ones. 

This why the neural redshift paper says something different to SLT. It says neural nets that generalize well don't just have a simplicity bias, they have a bias for functions with similar complexity to the target function. This brings into question mesaoptimization, because although mesaoptimization is favored by a simplicity bias, it is not necessarily favored by a bias toward equivalent simplicity to the target function.

comment by lukemarks (marc/er) · 2024-07-02T06:56:27.242Z · LW(p) · GW(p)

Present cryptography becomes redundant when the past can be approximated. Simulating the universe at an earlier point and running it forward to extract information before it's encrypted is a basic, but difficult way to do this. For some information this approximation could even be fuzzy, and still cause damage if public. How can you protect information when your adversary can simulate the past?

The information must never exist as plaintext in the past. A bad way to do this is to make the information future-contingent. Perhaps it could be acausally inserted into the past by future agents, but probably you would not be able to act on future-contingent information in useful ways. A better method is to run many homomorphically encrypted instances of a random function that might output programs that do computations that yield sensitive information (e.g, an uploaded human). You would then give a plaintext description of the random function, including a proof that it probably output a program doing computations likely adversaries would want. This incentivizes the adversary to not destroy the program output by the random function, because it may not be worth the cost of destruction and replacement with something that is certainly doing better computations.

This method satisfies the following desiderata:
1. The adversary does not know the output of the encrypted random function, or the outputs of the program the random function output
2. There is an incentive to not destroy the program output by the random function

One problem with this is that your advserary might be superintelligent, and prove the assumptions that made your encryption appear strong to be incorrect. To avoid this, you could base your cryptography on something other than computational hardness. 

My first thought was to necessitate computations that would make an adversary incur massive negative utility to verify the output of a random function. It's hard to predict what an adversary's preferences might be in advance, so the punishment for verifying the output of the random function would need to be generically bad, such as forcing the adversary to expend massive amounts of computation on useless problems. This idea is bad for obvious reasons, and will probably end up making the same or equally bad assumptions about the inseparability of the punishment and verification.

Replies from: robo
comment by robo · 2024-07-02T08:37:07.254Z · LW(p) · GW(p)

I don't think I understand your hypothetical.  Is your hypothetical about a future AI which has:

  • Very accurate measurements of the state of the universe in the future
  • A large amount of compute, but not exponentially large
  • Very good algorithms for retrodicting* the past

I think it's exponentially hard to retrodict the past.  It's hard in a similar way as encryption is hard.  If an AI isn't power enough to break encryption, it also isn't powerful enough to retrodict the past accurately enough to break secrets.

If you really want to keep something secret from a future AI, I'd look at ways of ensuring the information needed to theoretically reconstruct your secret is carried away from the earth at the speed of light in infrared radiation.  Write the secret in sealed room, atomize the room to plasma, then cool the plasma by exposing it to the night sky.

*predicting is using your knowledge of the present to predict the state of the future.  Retrodicting is using your knowledge of the present to predict retrodict the state of the past

Replies from: robo, marc/er
comment by robo · 2024-07-02T09:07:23.757Z · LW(p) · GW(p)

Oh, wait, is this "How does a simulation keep secrets from the (computationally bounded) matrix overlords?"

Replies from: marc/er
comment by lukemarks (marc/er) · 2024-07-02T09:08:39.761Z · LW(p) · GW(p)

This should be an equivalent problem, yes.

Replies from: robo
comment by robo · 2024-07-02T09:56:18.128Z · LW(p) · GW(p)

No, that's a very different problem.  The matrix overlords are Laplace's demon, with god-like omniscience about the present and past.  The matrix overlords know the position and momentum of every molecule in my cup of tea.  They can look up the microstate of any time in the past, for free.

The future AI is not Laplace's demon.  The AI is informationally bounded.  It knows the temperature of my tea, but not the position and momentum of every molecule.  Any uncertainties it has about the state of my tea will increase exponentially when trying to predict into the future or retrodict into the past.  Figuring out which water molecules in my tea came from the kettle and which came from the milk is very hard, harder than figuring out which key encrypted a cypher-text.

comment by lukemarks (marc/er) · 2024-07-02T09:07:00.496Z · LW(p) · GW(p)

Yes, your description of my hypothetical is correct. I think it's plausible that approximating things that happened in the past is computationally easier than breaking some encryption, especially if the information about the past is valuable even if it's noisy. I strongly doubt my hypothetical will materialize, but I think it is an interesting problem regardless.

My concern with approaches like the one you suggest is that they're restricted to small parts of the universe, so with enough data it might be possible to fill in the gaps.