A misunderstanding the impact of logistic regression on NNs
Preface #
I’m pissed.
I’m on the cusp of understanding yet not being able to understand neural networks. The concept of supervised machine learning makes sense to me. You take a bunch of features and have them output whatever garbage they can. You tune the weights so that it gets closer to predicting the answer better the next time. You have it trained again and again and this is the process of “gradient descent”. You do it until you reach the point of diminishing returns or you’re not at the risk of overfitting on your data so that it generalizes well across your validation and test set.
However, how does logistic regression handle its loss, and why it is relevant a neural network? I thought I understood it but I guess I don’t. (I’m also under a veil of still believing that a neural network is a multi-layer perceptron so that might be blinding me)
The confusion #
I’m firstly confused on how a classification model is used in neural network to deterministically produce outputs that make sense. Because I thought we were trying to get values out of it. Maybe the fact that the values we retrieve are to be mapped out into particular categories is where logistic regression/models come into play.
The softmax function used in a neural network, for instance, normalizes the probabilities to allow us to define our classifications to fall in-between the range of 0 and 1. In application scenarios, we pick one of the few categories that are present which have the highest probability, this is our categorical result. However, my confusion stems from the question: “Where is the logistic regression being used in this case then? Isn’t it just the softmax function?”
A study into logistic regression #
I’m referencing this while re-reading stuff about logistic regressions.
- A logistic regression model is used to generally classify your data.
- It predicts a 0 or a 1.
- The threshold on logistic function determines how little risk you want to take to get false positives/negatives. It’s generally set to 0.5.
- If the probability falls to be less than the threshold, the output is labelled to be 0 or else it’s 1.
- The website I’ve referenced takes on a very good example of a rainy day.
To fit our function on the correct parameters, we need to find the loss from the predicted and true labels and then differentiate on it to get better at accurately determining the category. However, we can’t use a (linear) mean squared loss due to the way the function is curved. So, we opt for a Log-Loss.
$$ Log-Loss = \Sigma_{i=0}^n - (y_i * \log (p_i) + (1 - y_i) * \log (1 - p_i)) $$
Estimating coefficients #
Finding the correct coefficients for our problem that minimizes the loss function is hard. The two main approaches that are proposed for logistic regression are gradient descent and maximum likelihood estimation. (the MLE in a logistic regression is produced by a log likelihood which doesn’t have a closed form solution, so we’re forced to use gradient descent to minimize the loss we find instead.)
Gradient Descent #
This is essentially a method where you find the log loss cost function over all the samples and then minimize by fractionally nudging it towards the correct direction, which should be the opposite direction of whatever the errored output is facing towards.
There’s this nice slider on the website that showcases how the log loss and gradient descent is utilized to fit the model to classify better.
Going over more iterations, reduces the log loss as it should.
Fin #
And that’s about it. I think that I was misunderstanding this stuff. The perceptron is an advancement the the logistic regression which allows for multi-class classification using the softmax function.
“The Multi-Layer Perceptron is an advancement that looked just like a neural network”. To the past me, that completely threw me off the rails because I was under the assumption that a neural network was based off this rather than understanding that the the multi-layer perceptron is a special case of neural networks.
Considering that I had it the other way around, it explains why I was so confused when I did not see a perceptron or logistic regression in some of the Andrej’s earlier series. I also had a harder time figuring out why neural networks were working when I couldn’t see a logistic regression on them. (when you have a collection of only linear models in your first layer, it can be reduced down to a linear function. It’s still a neural network but has none of the advantages that are provided by it to model complex features)