Posts
Comments
Hmm, so I'm very wary of defending tropical geometry when I know so little about it; if anyone more informed is reading please jump in! But until then, I'll have a go.
tropical geometry might be relevant ML, for the simple reason that the functions coming up in ML with ReLU activation are PL
I'm not sure I agree with this argument.
Hmm, even for a very small value of `might'? I'm not saying that someone who wants to contribute to ML needs to seriously consider learning some tropical geometry, just that if one already knows tropical geometry it's not a crazy idea to poke around a bit and see if there are applications.
The use of PL functions is by no means central to ML theory, and is an incidental aspect of early algorithms.
I agree this is an important point. I don’t actually have a good idea what activation functions people use in practise these days. Thinking about asymptotic linearity makes me think about the various papers appearing using polynomial activation functions. Do you have an opinion on this? For people in algebraic geometry it’s appealing as it generates lots of AG problems (maybe v hard), but I don’t have a good feeling as to whether it’s got anything much to do with `real life’ ML. I can link to some of the papers I’m thinking of if that’s helpful, or maybe you are already a bit familiar.
I don't see why one wouldn't just use ordinary currents here (currents on a PL manifold can be made sense of after smoothing, or in a distribution-valued sense, etc.).
I think you’re right; this paper just came to mind because I was reading it recently.
whether tropical geometry has ever been useful (either in proving something or at least in reconceptualizing something in an interesting way) in linear programming.
A little googling suggests there are some applications. This paper seems to give an application of tropical geometry to complexity of linear programming: https://inria.hal.science/hal-03505719/document and this list of conference abstracts seems to give other applications: https://him-application.uni-bonn.de/fileadmin/him/Workshops/TP3_21_WS1_Abstracts.pdf Whether they are 'convincing' I leave up to you.
1 Algebraic geometry in general (including tropical geometry) isn't good at dealing with deep compositions of functions, and especially approximate compositions.
Fair, though one might also see that as an interesting challenge. I don’t have a feeling as to whether this is for really fundamental reasons, or people haven’t tried so hard yet.
2 [….] I simply can't think of any behavior that is at all meaningful from an AG-like perspective where the questions of fan combinatorics and degrees of polynomials are replaced by questions of approximate equality.
There are plenty of cases where "high degree" is enough (Falting’s Theorem is the first thing that comes to mind, but there are lots). But I agree that "degree approximately 5" feels quite unnatural.
Hi Dmitry,
To me it seems not unreasonable to think that some ideas from tropical geometry might be relevant ML, for the simple reason that the functions coming up in ML with ReLU activation are PL, and people in tropical geometry have thought seriously about PL functions. Of course this does not guarantee that there is anything useful to be said!
One possible example that comes to mind in the context of your post here is the concept of polyhedral currents. As I understand it, here the notion of "density of polygons' is used as a kind of proxy for the derivative of a PL function? But I think the theory of polyhedral currents gives a much more general theory of differentiation of PL functions. Very naively, rather than just recording the locus where the function fails to be linear, one also records how much the derivative changes when crossing the walls. I learnt about this from a paper of Mihatsch: https://arxiv.org/pdf/2107.12067 but I'm certain there are older references.
I'm really a log person, I don't know the tropical world very well; sorry if what I write does not make sense!
Get that agreement in writing.
I'm not sure that would be particularly reassuring to me (writing as one of the contributors). First, how would one check that the agreement had been adhered to (maybe it's possible, I don't know)? Second, people in my experience often don't notice they are training on data (as mentioned in a post above by ozziegooen).
I think this is a key point. Even the best possible curriculum, if it has to work for all students at the same rate, is not going to work well. What I really want (both for my past-self as a student, and my present self as a teacher of university mathematics) is to be able to tailor the learning rate to individual students and individual topics (for student me, this would have meant 'go very fast for geometry and rather slowly for combinatorics'). And while we're at it, can we also customise the learning styles (some students like to read, some like to sit in class, some to work in groups, etc)?
This is technologically more feasible than it was a decade ago, but seems far from common.
Thanks Charlie.
Just to be double-sure, the second process was choosing the weight in a ball (so total L2 norm of weights was <= 1), rather than on a sphere (total norm == 1), right?
Yes, exactly (though for some constant , which may not be , but turn out not to matter).
Is initializing weights that way actually a thing people do?
Not sure (I would like to know). But what I had in mind was initialising a network with small weights, then doing a random walk ('undirected SGD'), and then looking at the resulting distribution. Of course this will be more complicated than the distributions I use above, but I think the shape may depend quite a bit on the details of the SGD. For example, I suspect that the result of something like adaptive gradient descent may tend towards more spherical distributions, but I haven't thought about this carefully.
If training large neural networks only moves the parameters a small distance (citation needed), do you still think there's something interesting to say about the effect of training in this lens of looking at the density of nonlinearities?
I hope so! I would want to understand what norm the movements are 'small' in (L2, L, ...).
LayerNorm looks interesting, I'll take a look.
Maths at my Dutch university also has homework for quite a few of the courses, which often counts for something like 10-20% of final grade. It can usually be submitted online, so you only need to be physically present for exams. However, there are a small number of courses that are exceptions to this, and actually require attendance to some extent (e.g. a course on how to give a scientific presentation, where a large part of the course consists of students giving and commenting on each other's presentations - not so easy to replace the learning experience with a single exam at the end).
But this differs between Dutch universities.
I suspect the arXiv might not be keen on an account that posts papers by a range of people (not including the account-owner as coauthor). This might lead to heavier moderation/whatever. But I could be very wrong!
Some advice for getting papers accepted on arxiv
As some other comments have pointed out, there is a certain amount of moderation on arXiv. This is a little opaque, so below is an attempt to summarise some things that are likely to make it easier to get your paper accepted. I'm sure the list is very incomplete!
In writing this I don't want to give the impression that posting things to arXiv is hard; I have currently 28 papers there, have never had a single problem or delay with moderation, and the submission process generally takes me <15 mins these days.
-
Endorsement. When you first attempt to submit a paper you may need to be endorsed. JanBrauner kindly offered below to help people with endorsements; I might also be able to do the same, but I've never posted in the CS part of arXiv, so not sure how effective this will be. However, even better to avoid need for moderation. To this end, use an academic email address if you have one; this is quite likely to already be enough. Also, see below on subject classes (endorsement requirements depend on which subject class(es) you want to post in).
-
Choosing subject classes. Each paper gets one or more subject classes, like CS.AI; see [https://arxiv.org/category_taxonomy] for a list. Some subject classes attract more junk than others, and the ones that attract more junk are more heavily moderated. In mathematics, it is math.GM (General Mathematics) that attracts most junk, hence is most heavily moderated. I guess most people here are looking at CS.AI, I don't know what this is like. But one easy thing is to minimise cross-listing (adding additional subject classes for your paper), as then you are moderated by all of them.
-
Write in (la)tex, submit the tex file. You don't have to do this, but it is standard and preferred by the arXiv, and I suspect makes it less likely your paper gets flagged for moderation. It is also an easy way to make sure your paper looks like a serious academic paper.
-
It is possible to submit papers on behalf of third parties. I've never done this, and I suspect such papers will be more heavily moderated.
-
If you have multiple authors, it doesn't really matter who submits. After the submission is posted you are sent a 'paper password' allowing coauthors to 'claim' the paper; it is then associated to their arXiv account, orcid etc (orcid is optional, but a really good idea, and free).
Finally, a request: please be nice to the moderators! They are generally unpaid volunteers doing a valuable service to the community (e.g. making sure I don't have to read nonsense proofs of the Riemann hypothesis every morning). Of course it doesn't feel good if your paper gets held up, but please try not to take it personally.
The arXiv really prefers that you upload in tex. For the author this makes it less likely that your paper will be flagged for moderation etc (I guess). So if it were possible to export to Rex I think that for the purposes of uploading to arXiv this would be substantially better. Of course, I don’t know how much more/less work it is…
Hi Charlie, If you can give a short (precise) description for an agent that does the task, then you have written a short programme that solves the task. I think then if you need more space to ‘explain what the agent would do’ then you are saying there also exists a less efficient/compact way to specify the solution. From this perspective I think the latter is then not so relevant. David
- I think that
provable guarantees on the safety of an FHE scheme that do not rely on open questions in complexity theory such as the difficulty of lattice problems.
is far out of reach at present (in particular to the extent that there does not exist a bounty which would affect people’s likeliness to work on it). It is hard to do much in crypto without assuming some kind of problem to be computationally difficult. And there are very few results proving that a given problem is computationally difficult in an absolute sense (rather than just ‘at least as hard as some other problem we believe to be hard’). C.f. P vs NP. Or perhaps I misunderstand your meaning; are you ok with assuming e.g. integer factorisation to be computationally hard?
Personally I also don’t think this is so important; if we could solve alignment modulo assuming e.g. integer factorisation (or some suitable lattice problem) is hard, then I think we should be very happy…
-
More generally, I’m a bit sceptical of the effectiveness a bounty here because the commercial application of FHE are already so great.
-
About 10 years ago when I last talked to people in the area about this I got a bit the impression that FHE schemes were generally expected to be somewhat less secure than non-homomorphic schemes, just because the extra structure gives an attacker so much more to work with. But I have no idea if people still believe this.
P.s. the main thing I have taken so far from the link you posted is that the important part is not exactly about the biases of SGD. Rather, it is about the structure of the DNN itself; the algorithm used to find a (local) optimum plays less of a role than the overall structure. But probably I’m reading too much into your precise phrasing.
Hi Thomas, I agree the proof of the bound is not so interesting. What I found more interesting were the examples and discussion suggesting that, in practise, the upper bound seems often to be somewhat tight.
Concerning differential advancement, I agree this can advance capabilities, but I suspect that advancing alignment is somewhat hopeless unless we can understand better what is going on inside DNNs. On that basis I think it does differentials advance alignment, but of course other people may disagree.
Thanks very much for the link!
If you get the daily arXiv email feeds for multiple areas it automatically removes duplicates (i.e. each paper appears exactly once, regardless of cross-listing). The email is not to everyone's taste of course, but this is a nice aspect of it.
I was about to write approximately this, so thank you! To add one point in this direction, I am sceptical about the value of reducing the expectation for researchers to explain what they are doing. My research is in two fields (arithmetic geometry and enumerative geometry). In the first we put a lot of burden on the writer to explain themselves, and in the latter poor and incomplete explanations are standard. This sometimes allows people in the latter field to move faster, but
- it leaves critical foundational gaps, which we can ignore for a while but which eventually causes lot of pain;
- sometimes really critical points are hidden in the details, and we just miss these if we don’t write the details down properly. Disclaimers:
- while I think a lot of people working in these fields would agree with me that this distinction exists, not so many will agree that it is generally a bad thing.
- I’m generally criticising lack of rigour rather than lack of explanation. I am or claiming these necessarily have to go together, but in my experience they very often do.
p.s.
For the more substantive results in section 4, I do believe the direction is always flat --> sharp.
I agree with this (with 'sharp' replaced by 'generalise', as I think you intend). It seems to me potentially interesting to ask whether this is necessarily the case.
Vacuous sure, but still true, and seems relevant to me. You initially wrote:
Regarding the 'sharp minima can generalize' paper, they show that there exist sharp minima with good generalization, not flat minima with poor generalization, so they don't rule out flatness as an explanation for the success of SGD.
But, allowing reparametrisation, this seems false? I don't understand the step in your argument where you 'rule out reparametrisation', nor do I really understand what this would mean.
Your comment relating description length to flatness seems nice. To talk about flatness at all (in the definitions I have seen) requires a choice of parametrisation. And I guess your coordinate description is also using a fixed parametrisation, so this seems reasonable. A change of parametrisation will then change both flatness and description length (i.e. 'required coordinate precision').
Thank you for the quick reply! I’m thinking about section 5.1 on reparametrising the model, where they write:
every minimum is observationally equivalent to an infinitely sharp minimum and to an infinitely flat min- imum when considering nonzero eigenvalues of the Hessian;
If we stick to section 4 (and so don’t allow reparametrisation) I agree there seems to be something more tricky going on. I initially assumed that I could e.g. modify the proof of Theorem 4 to make a sharp minimum flat by taking alpha to be big, but it doesn’t work like that (basically we’re looking at alpha + 1/alpha, which can easily be made big, but not very small). So maybe you are right that we can only make flat minimal sharp and not conversely. I’d like to understand this better!
I'm not sure I agree with interstice's reading of the 'sharp minima' paper. As I understand it, they show that a given function can be made into a sharp or flat minimum by finding a suitable point in the parameter space mapping to the function. So if one has a sharp minmum that does not generalise (which I think we will agree exists) then one can make the same function into a flat minimum, which will still not generalise as it is the same function! Sorry I'm 2 years late to the party...
if we gave research grants to smart and personable university graduates and gave them carte blanche to do with the money what they wished that would work just as well as the current system
This thought is not unique to you; see e.g. the French CNRS system. My impression is that it works kind of as you would expect; a lot of them go on to do solid work, some do great work, and a few stop working after a couple of years. Of course we can not really know how things would have turned out if the same people had been given more conventional positions,
The request for elaboration concerned how the experience described related to the LCS hierarchy described in the post, which was (and remains) very unclear to me.
Definitely the antagonistic bits - I enjoyed the casual style! Really just the line ‘ Sit down. Sit down. Shut up. Listen. You don’t know nothing yet’ I found quite off-putting - even though in hindsight you were correct!
Thanks! I thought it might be, but was unsure, and didn't want to make an awkward situation for the OP in case it was something very different...
I really liked the content, but I found some of the style (`Sit down!' etc) really off-putting, which I why I only actually read the post on my 3rd attempt. Obviously you're welcome to write in whatever style you want, and probably lots of other people really like it, I just thought it might be useful to mention that a non-empty set of people find it off-putting.
Can you elaborate on this a bit? I'm sorry to hear that you had a bad experience during fieldwork, though I'm afraid I'm not certain what you refer to by 'Active Personal Life'. Can you explain how the experience you relate connects to the LCS hierarchy?
I'm sceptical of your decision to treat tenured and non-tenured faculty alike. As tenured faculty, this has long seemed to me to be perhaps the most important distinction.
More generally, what you write here is not very consistent with my own experience of academia (which is in mathematics and in Europe, though I have friends and collaborators in other countries and fields, so I am not totally clueless about how things work there).
Some points I am not seeing in your post are:
-
For many academics, being able to do their own research and work with brilliant students is their primary motivation. Grants etc are mainly valuable in how they facilitate that. This makes for a confusing situation where 'losers' in the original LCS model do the minimum work necessary for their paycheck, whereas 'losers' in the academic system (as you seem to be defining them?) do the maximum work that is compatible with their health and personal situation. Not only is this conceptually confusing to me, it also means that all other things being equal, the more `losers' one is in academia the more impressive one's CV will tend to be. Which is I think the opposite of the situation in the conventional LCS hierarchy?
-
The fact that I 'perform peer review for nothing at all' apparently makes me clueless. But this is weird; it does not go on my CV, and I do it because I think it is important to the advancement of science. Surely this makes it a `loser' activity?
-
Acceptance of papers and awarding of grants is decided by people external to your university. This makes a huge difference, and I think you miss it by writing `So we might analyze this system at the department level, at the university level, or at the all-academia level, but it doesn't make much of a difference.'.
Perhaps the above makes it sound as if I view academia as an organisational utopia; this is far from the case! But I do not think this post does a good job of identifying problems. I think a post analysing moral mazes in academia would be interesting, but I'm not convinced that the LCS hierarchy is an appropriate model, and this attempt to apply it does not seem to me to make useful category distinctions.
So the set of worlds, , is the set of functions from to ...
I guess the should be a ? Also, you don't seem to define ; perhaps ?
I expect most people on LW to be okay being asked their Cheerful Price to have sex with someone.
I find this a surprising assertion. It does not apply to me, probably it does apply to you. Ordinarily I would ask if you had any other data points, but I don't want to take the conversation in this direction...
Sure, in the end we only really care about what comes top, as that's the thing we choose. My feeling is that information on (relative) strengths of preferences is often available, and when it is available it seems to make sense to use it (e.g. allowing circumvention of Arrow's theorem).
In particular, I worry that, when we only have ordinal preferences, the outcome of attempts to combine various preferences will depend heavily on how finely we divide up the world; by using information on strengths of preferences we can mitigate this.
(actually, my formula doubles the numbers you gave)
Are you sure? Suppose we take with , , then , so the values for should be as I gave them. And similarly for , giving values . Or else I have mis-understood your definition?
I'd simply see that as two separate partial preferences
Just to be clear, by "separate partial preference" you mean a separate preorder, on a set of objects which may or may not have some overlap with the objects we considered so far? Then somehow the work is just postponed to the point where we try to combine partial preferences?
EDIT (in reply to your edit): I guess e.g. keeping conditions 1,2,3 the same and instead minimising
where is proportion to the reciprocal of the strength of the preference? Of course there are lots of variants on this!
This seems really neat, but it seems quite sensitive to how one defines the worlds under consideration, and whether one counts slightly different worlds as actually distinct. Let me try to illustrate this with an example.
Suppose we have a consisting of 7 worlds, , with preferences
and no other non-trivial preferences. Then (from the `sensible case'), I think we get the following utilities:
.
Suppose now that I create two new copies , of the world which each differ by the position of a single atom, so as to give me (extremely weak!) preferences , so all the non-trivial preferences in the new are now summarised as
Then the resulting utilities are (I think):
.
In particular, before adding in these 'trivial copies' we had , and now we get . Is this a problem? It depends on the situation, but to me it suggests that, if using this approach, one needs to be careful in how the worlds are specified, and the 'fine-grainedness' needs to be roughly the same everywhere.
Thanks! I like the way your optimisation problem handles non-closed cycles.
I think I'm less comfortable with how it treats disconnected components - as I understand it you just translate each separately to have `centre of mass' at 0. If one wants to get a utility function out at the end one has to make some kind of choice in this situation, and the choice you make is probably the best one, so in that sense it seems very good.
But for example it seems vulnerable to creating 'virtual copies' of worlds in order to shift the centre of mass and push connected components one way or the other. That was what started me thinking about including strength of preference - if one adds to your setup a bunch of virtual copies of a world between which one is `almost indifferent' then it seems it will shift the centre of mass, and thus the utility relative to come other chain. Of course, if one is actually indifferent then the 'virtual copies' will be collapsed to a single point in your , but if they are just extremely close then it seems it will affect the utility relative to some other chain. I'll try to explain this more clearly in a comment to your post.
Thanks for the comment Charlie.
If I am indifferent to a gamble with a probability of ice cream, and a probability 0.8 of chocolate cake and 0.2 of going hungry
To check I understand correctly, you mean the agent is indifferent between the gambles (probability of ice cream) and (probability 0.8 of chocolate cake, probability 0.2 of going hungry)?
If I understand correctly, you're describing a variant of Von Neumann–Morgenstern where instead of giving preferences among all lotteries, you're specifying a certain collection of special type of pairs of lotteries between which the agent is indifferent, together with a sign to say in which `direction' things become preferred? It seems then likely to me that the data you give can be used to reconstruct preferences between all lotteries...
If one is given information in the form you propose but only for an incomplete' set of special triples (c.f.
weak preferences' above), then one can again ask whether and in how many ways it can be extended to a complete set of preferences. It feels to me as if there is an extra ambiguity coming in with your description, for example if the set of possible outcomes has 6 elements and I am given the value of the Betterness
function on two disjoint triples, then to generate a utility function I have to not only choose a `translation' between the two triples, but also a scaling. But maybe this is better/more realistic!
. By `special types', I mean indifference between pairs of gambles of the form
(probability of A) vs (probability of B and probability of C)
for some , and possible outcomes A, B, C. Then the sign says that I prefer higher probability of B (say).
Thanks for pointing me to this updated version :-). This seems a really neat trick for writing down a utility function that is compatible with the given preorder. I thought a bit more about when/to what extent such a utility function will be unique, in particular if you are given not only the data of a preorder, but also some information on the strengths of the preferences. This ended up a bit too long for a comment, so I wrote a few things in outline here:
https://www.lesswrong.com/posts/7ncFy84ReMFW7TDG6/categorial-preferences-and-utility-functions
It may be quite irrelevant to what you're aiming for here, but I thought it was maybe worth writing down just in case.
Never mind - I had fun thinking about this :-).
Hi Stuart,
I’m working my way through your `Research Agenda v0.9’ post, and am therefore going through various older posts to understand things. I wonder if I could ask some questions about the definition you propose here?
First, that be contained in for some seems not so relevant; can I just assume X, Y and Z are some manifolds ( for some )? And we are given some partial order on X, so that we can refer to `being a better world'?
Then, as I understand it, your definition says the following:
Fix X, and Z. Let Y be a manifold and , . Given a local homomorphism , we say that is partially preferred to if for all , we have .
I’m not sure which inequalities should be strict, but this seems non-essential for now. On the other hand, the dependence of this definition on the choice of Y seems somewhat subtle and interesting. I will try to illustrate this in what follows.
First, let us make a new definition. Fix X, , and Z as before. Let , a two-element set equipped with the discrete topology, and let be an immersion of -manifolds. We say that is weakly partially preferred to if for all , we have .
First, it is clear that partial preference implies weak partial preference. More formally:
Claim 1: Fix X, and Z. Suppose we have a manifold Y, points , , and a local homomorphism such that is partially preferred to . Setting with the subspace topology from (i.e. discrete), and taking to be the restriction of from to , we have that is weakly partially preferred to .
Proof: obvious. $\qed$
However, the converse can fail if Z is not contractible. First, let’s prove that the concepts are equivalent for Z contractible:
Claim 2: Fix X, and Z, and assume that Z is contractible. Suppose we have a two-element set and a map making weakly partially preferred to . Then there exist a manifold Y, an injection , and a local homeomorphism whose restriction to is , making partially preferred to .
Proof: Let’s assume for simplicity of notation that X is equidimensional, say of dimension , and write for the dimension of Z. Let Y be the disjoint union of two open balls of dimension , with the inclusion of the centres of the balls. Then take an -neighbourhood of Z in X; it is diffeomorphic to since the normal bundle to Z in X is trivialisable (c.f. https://math.stackexchange.com/questions/857784/product-neighborhood-theorem-with-boundary). $\qed$
If we want examples where weak partial preference and partial preference don’t coincide, we should look for an example where Z is not contractible, and its normal bundle in X is not contractible.
Example 3: Let X be the disjoint union of two moebius bands, and let Z be a circle. Note that including Z along the centre of either band gives a submanifold whose tubular neighbourhood is not a product. Assume that is such that one component of X is preferred to the other (and is indifferent within each connected component). Then take , and to be the inclusion of the two circles along the centres of the two moebius bands, such that ends up in the preferred band. This yields a situation where is weakly partially preferred to , but the conclusion of Claim 2 fails, i.e. this cannot be extended to a partial preference for over .
What conclusion should we draw from this? To me, it suggests that the notion of partial preference is not yet quite as one would want. In the setting of Example 3, where X consists of two moebius strips, one of which is preferred to the other, then landing in the preferred strip should be preferred to landing in the un-preferred strip?! And yet the `local homeomorphism from a product’ condition gets in the way. This example is obviously quite artificial, and maybe analogous things cannot occur in reality. But I’m not so happy with this as an answer, since our approaches to AI safety should be (so far as possible) robust against the flaws in our understanding of physics.
Apologies for the overly-long comment, and for the imperfect LaTeX (I've not used this type of form much before).
Thanks for the reply, Zack.
The reason this objection doesn't make the post completely useless...
Sorry, I hope I didn't suggest I thought that! You make a good point about some variables being more natural in given applications. I think it's good to keep in mind that sometimes it's just a matter of coordinate choice, and other times the points may be separated but not in a linear way.
Hi Zack,
Can you clarify something? In the picture you draw, there is a codimension-1 linear subspace separating the parameter space into two halves, with all red points to one side, and all blue points to the other. Projecting onto any 1-dimensional subspace orthogonal to this (there is a unique one through the origin) will thus yield a `variable' which cleanly separates the two points into the red and blue categories. So in the illustrated example, it looks just like a problem of bad coordinate choice.
On the other hand, one can easily have much more pathological situations; for examples, the red points could all lie inside a certain sphere, and the blue points outside it. Then no choice of linear coordinates will illustrate this, and one has to use more advanced analysis techniques to pick up on it (e.g. persistent homology).
So, to my vague question: do you have only the first situation in mind, or are you also considering the general case, but made the illustrated example extra-simple?
Perhaps this is clarified by your numerical example, I'm afraid I've not checked.