I see a lot of comments here assuming "linear model" means "can't model nonlinearities." Absolutely not the case. Splines can easily take care of that. The "linear" part of linear model just means "linear in the predictor space." You can add a non-linear predictor easily via spline basis (similar/sometimes identical to "kernels" in ML).
My series of lm/glm/gam/gamm revelations was:
1. All t-tests and ANOVA flavors are just linear models
2. Linear models are just a special case of generalized linear models (GLMs), which can deal with binary or count data too
3. All linear models and GLMs are just special cases of generalized linear mixed models, which can deal with repeated measures, grouped data, and other non-iid clustering
4. Linearity is usually a bad assumption, which can easily be overcome via splines
5. Estimating a spline smoothing penalty is the same thing as estimating the variance for a random effect in a a mixed model, so #3 and #4 can be combined for free
And then you end up with a generalized additive mixed model (GAMM), which can model smooth nonlinear functions of many variables, smooth interaction surfaces between variables (e.g. latitude/longitude), handle repeated measurements and grouping, and deals with many types of outcomes, including continuous, binary yes/no, count, ordinal categories, or survival time measurements.
All while yielding statistically valid confidence intervals, and typically only taking a few minutes of CPU time even on datasets with hundreds of thousands / millions of datapoints.
You aren't guaranteed that your equilibria behave the same in the linearization of a nonlinear system if your Jacobian has any eigenvalues with real part of 0.
Taylor is only for some functions, must be continuos and differentiable to n grades.
In control theory you break the functions, so discontinuities can also be considered.
In the 70's in CS grad school at USC, I wrote an adaptive least squares cubic spline fit routine. Kept subdividing intervals and fitting a least squares cubic spline in each interval until a criteria was met.
Can you recommend some resources to go from step 3 to 4?
So far I have been successfully using GLMMs where appropriate, but then jumped to implementing completely arbitrary models by fitting them to the data (plus bootstrapping).
GLMs are a non linear transformation on an output followed by linear modelling. They are referred to as “linear models” but you might as well then also consider NNs as linear models, or any model at all which ends with addition as the final step.
Or, for a more approachable treatment, Semiparametric Regression with R by Harezlak, Ruppert, and Wand. A middle ground between Wood's book (which is comprehensive but can dip into math that's way over my head at times) and H/R/W is Semiparametric Regression by Ruppert, Wand, and Carroll.
I have also heard great things about Frank Harrell's Regression Modeling Strategies which uses a slightly different approach (still spline-based though), but I haven't read it. His other writing is fantastic though.
> Frank Harrell's Regression Modeling Strategies which uses a slightly different approach (still spline-based though), but I haven't read it.
Very very good indeed, with the exception that he basically ignores compute time and efficiency. I learned a bunch, but applying his approach to the kinds of datasets I deal with (much larger and with pretty strict compute budgets) was very difficult.
If you know DNNs, you'll find the introduction of splines in Semiparametric Regression in R very intuitive - splines are introduced using what the authors call a "truncated line basis" but you already know it as the RELU function, just with a bias. Indeed, even the penalization of splines will look extremely familiar: it's basically just L2 regularization to induce smoothness.
You might also enjoy reading the "Neural Additive Model" paper from Hinton's lab, which is basically GAMs using a separate DNN as a "spline basis" for each input variable.
No no, linear just means some expression of the form
y = b1*x1 + b2*x2 + ... + bp*xp
which is in the heart of the model (perhaps more specific description would be linear combination). Whatever you call it, it's just some quantity y, that is constructed via an additive process from components x1...xp, and each component is multiplied by some constant (the coefficient of the linear combination).
This linear combination "core" of the model can be used directly (in which case it is geometrically a hyperplane), but you could also use non-linear transformation of the inputs or the outputs.
E.g. x1 could be a spline basis function, or log-transformed data, or anything.
Similarly, the output y can be passed though some non-linearity to force a particular output (like in logistic regression). Hence the statement is not vacuous
Well yeah, but what is the non trivial statement here? Linear model is linear, and if you add to your model non-linear steps (such as pre- or post- processing) then it is not longer linear and can handle non-linear data. In the same spirit, you could say that GPT4 is a linear model (token logits) with a very large preprocessing (transformer).
Maybe what OP meant is that particular non linearities (splines) allow to keep some goodies from the linear model in non-linear settings. Still it's not clear for me in which settings exactly such models are good enough
Statistics is more than hypothesis testing, but you'll get surprisingly far without straying too far from linear models - I remember a Stats prof saying 'most of classical Statistics is GLM [0]'
I disagree with the implication that linearity is an unnatural concept, it appears whenever the changes being studied are small relative to the key parameters that determine the system. Every system is linear for small perturbations. Even logic gates; in negative feedback they can form passable inverting amplifiers. In a place as big as the universe it is rather common for two things to be very different in scale and yet interacting.
I've never actually seen a physical example of a system without a continuous first derivative. For example phase transitions, commonly touted as an example of discontinuity, don't actually occur until the matter has gone a bit over the point and a transition nucleates somewhere. The probability of a phase transition is a continuous function of temperature, with continuous derivatives.
I'm skeptical that discontinuities can exist because, if they did, they'd serve as infinitely powerful microscopes. If there's a discontinuity in nature, it must exist at absolute zero. I don't have a similarly good argument for continuous first derivatives but I do think it's interesting that there are no examples AFAIK.
Any physical system that makes/breaks contact, such as walking robots. Sure, the foot is not perfectly rigid and technically is a stiff spring. But from a computational perspective, problems still bear all the hallmarks of a discontinuous system such as requiring a very short integration step.
Quantized space is absolutely discontinuous, and tunneling is a discontinuous system. In fact assuming the universe is quantum it’s discontinuous in reality but the appearance is continuous. But these distinctions aren’t super useful unless you’re dealing with these sorts of effects. Continuity is the approximation, discontinuity is the reality. But depending on what’s useful we use the mathematics that help us.
Tunneling currents are continuous in every parameter, although I admit that when you're dealing with particles you have continuous probability distributions with continuously varying means, rather than continua of matter. (But that should count, because all macroscopic variables are expectation values.)
Specifically at quantized space and time levels everything is discrete even distribution functions. There’s no sense in having a continuous spacial distribution sub Planck lengths.
In chaotic systems, time becomes one of the parameters that has to be small for a linear model to make predictions, but they are still linear for brief times.
My favorite moment in university was in the first class of semester 2, where a prof said "lets look at a really small part of our thing, and assume we apply some force to it. This will make it stretch, lets assume the stretching is linear relative to the applied force". I raised my hand and asked "is this assumption supported empirically?" and he said "no, we know it's not always true, but if we don't make it we can't calculate anything".
At the time I was mad at engineers for being non-scientific, but after a few years I understood the deep wisdom in that. Nonlinear materials exist, and materials we use have nonlinear ranges. We just don't build things from those, because the math is too unwieldy. (except in very very specific edge cases where we spend a lot of money building a very limited thing)
It's funny: I had the EXACT SAME PROBLEM !!!! Like "how is it science when a prof says: "this equation is too complicated too solve so let's erase these third and forth derivatives and pretend that it's the same - even if it's not - because that's all that we can do"
So I left the "physics" courses and went the "math" courses... only to learn 3 years later how to prove that this kind of approximation is mathematically sound indeed :-D
Practically all materials behave nonlinearly when stretched or compressed a visible amount. For certain structural applications, though, if that happens we've already failed. Linear models work really well for designing big concrete structures and certain metal structures. Sometimes we try to apply linear models to other things, but that's always kind of fishy.
I knew the linked-in-the-article https://lindeloev.github.io/tests-as-linear/ which is also great. A bit meta on the widespread use of linear models: "Transcending General Linear Reality" by Andrew Abbott, DOI:10.2307/202114
You can state and prove theorems with linear models. You can do inference and testing. This means papers and academics naturally love them. And therefore, they are everywhere.
Not the case with non-linear models. We need to throw computers at them.
I think that's an overstatement. You can consider the nonlinearity as a perturbation of the linear theory, study the topology of the solutions or their symmetries, or sometimes even find exact solutions, like exist for a handful of transistor circuits.
If the model is y = a + b * x + e, then we can precisely state probability distributions of a and b given x and y for reasonably general assumptions about error e.
If the model is y = f(x) + e, where for example f is a neural net, there is not much we can say about the “quality” of parameters of f. That is, we can’t attach confidence interval. We will have to resort to expensive simulations and for most practical applications this is not workable (size of dataset).
What you say is useful in a different context. Local linearization is a very powerful idea.
Maybe not if there are many more parameters in f than there are samples in the dataset, but if there are a handful of parameters in f then a typical step is to take the inverse of the derivative of goodness-of-fit with respect to each fit parameter, to determine the precision with which each parameter was determined by the fit.
I find it irritating that the article mentions repeated measures, but does not try them, much less a mixed effects model. Yes, they are linear models with more parameters, but doing it in lm would be a special kind of madness
A cool thing about linear models is that they can be used to model non-linear correlations by using transformations. For example, an exponential function can be made linear if you just take the logarithm.
Doesn't have to be 'trivial'. It could be a trillion-parameter MoE LLM spitting out 10,000-large embeddings. But a lot of that will act linearly and can, in theory, be put into your linear model easily & profitably.
I thought nonlinearity was very important to be able to make a larger model better than a smaller one? Like so important that tom7 made a half-joke demo with it: https://yewtu.be/watch?v=Ae9EKCyI1xU
same thing with any feed forward network too. They are all piece-wise linear in respect to inputs.
Layers reduce resource requirements and make some patterns easier or even practical to find, but any ANN that is a FNN supervised learning could be represented as a parametric linear regression.
Unsupervised learning, that tends to use clustering is harder to visualize but is the same thing.
You still have ANNs, which have binary output, which can be viewed through the lens of deciders. They have to have unique successor and predecessor functions.
Really this is just set shattering that relates to a finite VC dimensionality being required for something to be PAC learnable.
But the title of this is confusing the map for the territory. It isn't that 'Everything is a linear model' but that linear models are the preferred, most practical form.
The efforts to leverage spikey neutral networks, which is a more realistic model of cortical neurons, and which have continuous output (or more correctly the computable reals) tend to run into problems like riddled basins.
I see what you mean. Though in my mind, and this is clearly subjective, piece-wise linear is at least less strict than a linear model. (With enough ReLUs, you could get arbitrarily close to a lookup table, which I think would be best described as a nonlinear model.)
Only when you dimensions are truly independent and that is a stretch. Really what you are saying is that you are more likely to find a field for your problem, and fields don't exist in more than 2 dimensions.
Consider Predator Pray with fear and refuge, which is indeterminate, and not due to a lack of precision but a topological feature where ≥3 open sets share the same boundary set.
General relativity, with 3 spacial and one temporal dimension is another. One lens to consider this is that rotations are hyperbolic due to the lack of independence from the time dimension.
Quantum mechanics would have been much more difficult if it didn't have two exit basins. Which is similar to ANNs and linear regressions being binary output.
(Some exceptions will orthogonal dimensions like EM)
Linear models are a linear combination of possibly non-linear regressors. The linearity is strictly in the parameters, not in whatever you're adding up.
A neural network can be pedantically referred to as a linear model of the form y = a + b*neural_network, for example. Here, y is a linear model (even though neural_network isn't).
A linear relationship between any transformation of the outcome and any transformation of the predictor variables — so the function is linear but the relationship between predictors and outcome can take on almost any shape.
This is why I feel like I couldn't become a credible AI/ML consultant. I would just throw everything into a linear model, make some progress, and call it a day.
Wasn’t there an OpenAI paper where they showed that MM multiplication with the numerical imprecision of floats was enough to get general function learning?
One almost certainly does not need anything else ... unless what you want is big piles of investor cash, a Scrooge McDuck swimming pool quantity of investor cash - then you need to call whatever maths you do AI.
My series of lm/glm/gam/gamm revelations was:
1. All t-tests and ANOVA flavors are just linear models
2. Linear models are just a special case of generalized linear models (GLMs), which can deal with binary or count data too
3. All linear models and GLMs are just special cases of generalized linear mixed models, which can deal with repeated measures, grouped data, and other non-iid clustering
4. Linearity is usually a bad assumption, which can easily be overcome via splines
5. Estimating a spline smoothing penalty is the same thing as estimating the variance for a random effect in a a mixed model, so #3 and #4 can be combined for free
And then you end up with a generalized additive mixed model (GAMM), which can model smooth nonlinear functions of many variables, smooth interaction surfaces between variables (e.g. latitude/longitude), handle repeated measurements and grouping, and deals with many types of outcomes, including continuous, binary yes/no, count, ordinal categories, or survival time measurements.
All while yielding statistically valid confidence intervals, and typically only taking a few minutes of CPU time even on datasets with hundreds of thousands / millions of datapoints.