Nonlinear limitations of ReLUs
post by magfrump · 2023-10-26T18:51:24.130Z · LW · GW · 1 commentThis is a question post.
Contents
1 comment
A neural net using rectified linear unit activation functions of any size is unable to approximate the function sin(x) outside a compact interval.
I am reasonably confident that I can prove that any NN with ReLU activation approximates a piecewise linear function. I believe the number of linear pieces that can be achieved is bounded by at most 2^(L*D) where L is the number of nodes per layer and D is the number of layers.
This leads me to two questions:
- Is the inability to approximate periodic functions of a single variable important?
- If not, why not?
- If so, is there practical data augmentation that can be used to improve performance at reasonable compute cost?
- E.g., naively, augment the input vector {x_i} with {sin(x_i)} whenever x_i is a scalar.
- Since the number of parameters of a NN scales as L*D^2 and the trivial bound on number of linear pieces scales with L*D, is this why neural nets go deep rather than going "wide"?
- Are there established scaling hypotheses for the growth of depth vs. layer size?
- Are there better (probabilistic) analytic or empirical bounds on the number of linear sections achieved by NNs of given size?
- Are there activation functions that would avoid this constraint? I imagine a similar analytic constraint replacing "piecewise linear" with "piecewise strictly increasing" for classic activations like sigmoid or arctan.
- Something something Fourier transform something something?
Regarding (2a), empirically I found that while approximating sin(x) with small NNs in scikit-learn, increasing the width of the network caused catastrophic failure of learning (starting at approximately L=10 with D=4, at L=30 with D=8, and at L=50 with D=50).
Regarding (1), naively this seems relevant to questions of out-of-distribution performance and especially the problem of what it means for an input to be out-of-distribution in large input spaces.
Answers
1 comment
Comments sorted by top scores.
comment by jacob_cannell · 2023-10-26T23:14:01.144Z · LW(p) · GW(p)
Is the inability to approximate periodic functions of a single variable important?
Periodic functions are already used as an important encoding in SOTA ANNs, from transformer LLMs to NERFs in graphics. From the instant-ngp paper:
For neural networks, input encodings have proven useful in the attention components of recurrent architectures [Gehring et al. 2017] and, subsequently, transformers [Vaswani et al. 2017], where they help the neural network to identify the location it is currently processing. Vaswani et al. [2017] encode scalar positions 𝑥 ∈ R as a multiresolution sequence of 𝐿 ∈ N sine and cosine functions enc(𝑥) = sin(2 0 𝑥),sin(2 1 𝑥), . . . ,sin(2 𝐿−1 𝑥), cos(2 0 𝑥), cos(2 1 𝑥), . . . , cos(2 𝐿−1 𝑥) . (1) This has been adopted in computer graphics to encode the spatiodirectionally varying light field and volume density in the NeRF algorithm [Mildenhall et al. 2020].