On different notations with Binary Logistic Regression and Cross Entropy Loss

Learning logistic regression can be confusing the first time around. One of my issues early on was working on through the different notations you could have based on the classifiers you chose, e.g. (plus/minus one) vs. (zero/one). This won’t be a post explaining logistic regression (you can find tons of Medium articles on that), but rather the two different ways you can express your cross-entropy loss function.

Binary Logistic Regression with \(y \in (0,1)\), \(x \in \mathbb{R}^d\)

This is probably the notation you could find in most textbooks, and hence I won’t dwell too much on the derivations.

Sigmoid function

This doesn’t change between the notations, but I for completeness, I have included the sigmoid function we use in Logistic Regression.

\[\sigma(z) = \frac{1}{1+e^{-z}}\]

Cross Entropy Loss

Typically, you’d find for the cross entropy loss with binary logistic regression as:

\[\begin{aligned} L(\boldsymbol{W^T}, \textbf{X}, \textbf{y}) = \frac{1}{n} \sum_i \left(- y_i \ln (\boldsymbol{\sigma}(\textbf{W}^T\textbf{X}_i)) - (1 - y_i) \ln (1 - \boldsymbol{\sigma}(\textbf{W}^T\textbf{X}_i) \right) \end{aligned}\]

derived using the sample probabilty of \(x_i\) as \(\mathbb{P}(y|\textbf{W}^T \textbf{X}_i) =\sigma(\textbf{W}^T \textbf{X}_i)^{y_i}(1-\sigma(\textbf{W}^T \textbf{X}_i))^{1-y_i}\)

Gradient of Cross Entropy Loss

You probably need to find the gradient of the Cross Entropy Loss w.r.t. \(W\) to perform minimization via gradient descent. Here’s that below (note that we take the average gradient across \(i\) samples):

\[\nabla_{\boldsymbol{W}} L(\boldsymbol{W}, \textbf{X}, \textbf{y}) = - \frac{1}{n} \sum_i \left( y_i - \boldsymbol{\sigma}(\textbf{W}^T\textbf{X}_i) \right) \textbf{X}_i\]

Binary Logistic Regression with \(y \in (-1,1)\), \(x \in \mathbb{R}^d\)

With the labels marked as \(y_i\) = (-1,1), the notation changes a bit differently (it actually simplifies it) for the cross-entropy loss and its gradient.

Cross Entropy Loss

Because with -1/1 are our labels for \(y_i\), we can state the sample probability for \(x_i\) more compactly as

\[\mathbb{P}(y_i|\textbf{W}^T \textbf{X}_i) = \sigma(y_i \textbf{W}^T\textbf{X}_i)\]

With this simplification, we can derive the cross-entropy loss as:

\[L(\boldsymbol{W}, \textbf{X}, \textbf{y}) = \frac{1}{n} \sum_i \ln{ (1+e^{-y_i \textbf{w}^T \textbf{X}_i})}\]

Gradient of Cross Entropy Loss

The derived gradient for the cross-entropy loss is then:

\[\nabla_{\boldsymbol{W}} L(\boldsymbol{W}, \textbf{X}, \textbf{y}) = - \frac{1}{n} \sum_i \dfrac{y_i \textbf{X}_i}{(1+e^{y_i \textbf{W}^T \textbf{X}_i})}\]

Binary Logistic Regression with \(y \in (0,1)\), \(x \in \mathbb{R}^d\)

Sigmoid function

Cross Entropy Loss

Gradient of Cross Entropy Loss

Binary Logistic Regression with \(y \in (-1,1)\), \(x \in \mathbb{R}^d\)

Cross Entropy Loss

Gradient of Cross Entropy Loss

Further Reading