Yes, I meant "linear least squares" which is "(X.T X)^{-1} X.T y". > Also, somet...

jing · on Oct 19, 2018

I find it hard to believe that SGD would be faster than the closed form solutions for linear regression (gels, gelsd etc.). The closed-form solutions give a lot of other benefits in practical settings as well which makes them more likely to be used if possible. SGD + related optimizers give benefits with non-convex or non-analytical loss functions or with non-linear layers / more than one layer.

gnulinux · on Oct 19, 2018

Then why would anyone use tensorflow with this loss function in practice. In my school's ML class, we used this technique too (in addition to closed form solution). Is there any practical reason to use an optimizer to solve a linear problem?

jing · on Oct 19, 2018

Note that it's not just the loss function. It's the loss function combined with a very specific problem formulation - namely a neural network with only linear activations (equivalent to a 0-layer network). Once you go to non-linear layers or a different loss it's no longer solved analytically.

I do see a lot of people writing tutorials like OP's. See for example:

https://towardsdatascience.com/linear-regression-using-gradi...

The existence of these articles should not be taken as an indication of best practice. They often have the goal of teaching SGD in a simplified setting, not teaching best practice for LLS. I suppose only nice thing about using TF / SGD for such a simple problem is that you now have starting point for solving more complex problems (RELU activation, cross-entropy loss, more layers, etc.).

A few other points as to why you would never SGD for LLS:

1) it's always way slower than the closed form matrix solutions

2) if you're doing SGD instead of just GD, there's noise in which "rows" are in a given batch - as a result, repeated runs may not converge to exactly the same final weights. This never happens with the analytical solution which always gets exactly the same result.

3) if you're doing this as part of a data science pipeline which is likely the case in the real world, you'll likely want to do some cross-validation. In the SGD case you have to recompute the entire solution for each fold whereas in the LLS case you can immediately compute CVs once you've calculated the initial XTX / XTYs. This makes the process of using LLS even faster than SGD.