On different notations with Binary Logistic Regression and Cross Entropy Loss
Learning logistic regression can be confusing the first time around. One of my issues early on was working on through the different notations you could have based on the classifiers you chose, e.g. (plus/minus one) vs. (zero/one). This won’t be a post explaining logistic regression (you can find tons of Medium articles on that), but rather the two different ways you can express your cross-entropy loss function.
Binary Logistic Regression with \(y \in (0,1)\), \(x \in \mathbb{R}^d\)
This is probably the notation you could find in most textbooks, and hence I won’t dwell too much on the derivations.
Sigmoid function
This doesn’t change between the notations, but I for completeness, I have included the sigmoid function we use in Logistic Regression.
\[\sigma(z) = \frac{1}{1+e^{-z}}\]Cross Entropy Loss
Typically, you’d find for the cross entropy loss with binary logistic regression as:
\[\begin{aligned} L(\boldsymbol{W^T}, \textbf{X}, \textbf{y}) = \frac{1}{n} \sum_i \left(- y_i \ln (\boldsymbol{\sigma}(\textbf{W}^T\textbf{X}_i)) - (1 - y_i) \ln (1 - \boldsymbol{\sigma}(\textbf{W}^T\textbf{X}_i) \right) \end{aligned}\]derived using the sample probabilty of \(x_i\) as \(\mathbb{P}(y|\textbf{W}^T \textbf{X}_i) =\sigma(\textbf{W}^T \textbf{X}_i)^{y_i}(1-\sigma(\textbf{W}^T \textbf{X}_i))^{1-y_i}\)
Gradient of Cross Entropy Loss
You probably need to find the gradient of the Cross Entropy Loss w.r.t. \(W\) to perform minimization via gradient descent. Here’s that below (note that we take the average gradient across \(i\) samples):
\[\nabla_{\boldsymbol{W}} L(\boldsymbol{W}, \textbf{X}, \textbf{y}) = - \frac{1}{n} \sum_i \left( y_i - \boldsymbol{\sigma}(\textbf{W}^T\textbf{X}_i) \right) \textbf{X}_i\]Binary Logistic Regression with \(y \in (-1,1)\), \(x \in \mathbb{R}^d\)
With the labels marked as \(y_i\) = (-1,1), the notation changes a bit differently (it actually simplifies it) for the cross-entropy loss and its gradient.
Cross Entropy Loss
Because with -1/1 are our labels for \(y_i\), we can state the sample probability for \(x_i\) more compactly as
\[\mathbb{P}(y_i|\textbf{W}^T \textbf{X}_i) = \sigma(y_i \textbf{W}^T\textbf{X}_i)\]With this simplification, we can derive the cross-entropy loss as:
\[L(\boldsymbol{W}, \textbf{X}, \textbf{y}) = \frac{1}{n} \sum_i \ln{ (1+e^{-y_i \textbf{w}^T \textbf{X}_i})}\]Gradient of Cross Entropy Loss
The derived gradient for the cross-entropy loss is then:
\[\nabla_{\boldsymbol{W}} L(\boldsymbol{W}, \textbf{X}, \textbf{y}) = - \frac{1}{n} \sum_i \dfrac{y_i \textbf{X}_i}{(1+e^{y_i \textbf{W}^T \textbf{X}_i})}\]Further Reading
I didn’t explain how I derived the cross-entropy loss / gradients. However, I’ve listed out some resources that should let you be able to follow the derivation. You probably should follow first the classic labels (0/1) derivations:
With the -1/1 notation, you can follow the derivation in: