Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yes, I meant "linear least squares" which is "(X.T X)^{-1} X.T y".

> Also, sometimes it just a lot faster to use SGD or other algorithms.

Right, that's what I thought. Is this because essentially optimization will be bounded by the number of of epochs but linear least squares will be bounded by matrix operations (so scale with N). Which means, if you can solve the problem in small number of epochs (say, 200) and N is very large then SGD will be faster. Is this correct?

EDIT: Obviously, I don't think all regression problems can be solved this way but in this blog post their loss function can be solved by linear least squares. If you solve the optimization problem analytically, you'll get "(X.T X)^{-1} X.T y".



I find it hard to believe that SGD would be faster than the closed form solutions for linear regression (gels, gelsd etc.). The closed-form solutions give a lot of other benefits in practical settings as well which makes them more likely to be used if possible. SGD + related optimizers give benefits with non-convex or non-analytical loss functions or with non-linear layers / more than one layer.


Then why would anyone use tensorflow with this loss function in practice. In my school's ML class, we used this technique too (in addition to closed form solution). Is there any practical reason to use an optimizer to solve a linear problem?


Note that it's not just the loss function. It's the loss function combined with a very specific problem formulation - namely a neural network with only linear activations (equivalent to a 0-layer network). Once you go to non-linear layers or a different loss it's no longer solved analytically.

I do see a lot of people writing tutorials like OP's. See for example:

https://towardsdatascience.com/linear-regression-using-gradi...

The existence of these articles should not be taken as an indication of best practice. They often have the goal of teaching SGD in a simplified setting, not teaching best practice for LLS. I suppose only nice thing about using TF / SGD for such a simple problem is that you now have starting point for solving more complex problems (RELU activation, cross-entropy loss, more layers, etc.).

A few other points as to why you would never SGD for LLS:

1) it's always way slower than the closed form matrix solutions

2) if you're doing SGD instead of just GD, there's noise in which "rows" are in a given batch - as a result, repeated runs may not converge to exactly the same final weights. This never happens with the analytical solution which always gets exactly the same result.

3) if you're doing this as part of a data science pipeline which is likely the case in the real world, you'll likely want to do some cross-validation. In the SGD case you have to recompute the entire solution for each fold whereas in the LLS case you can immediately compute CVs once you've calculated the initial XTX / XTYs. This makes the process of using LLS even faster than SGD.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: