Scaling Logistic Regression Via Multi-GPU/TPU Training | by Annika Brundyn | Sep, 2020


If you’re accustomed to logistic regression be happy to skip this segment.

We use logistic regression to are expecting a discrete category label (equivalent to cat vs. canine), that is often referred to as classification. This differs from regression the place the purpose is to are expecting a continual real-valued amount (equivalent to inventory value).

In the most simple case of logistic regression, we’ve simply 2 categories, this is known as binary classification.

Binary classification

Our purpose in logistic regression is to are expecting a binary goal variable Y (i.e. Zero or 1) from a matrix of enter values or options, X. For instance, say we’ve a bunch of pets and we need to to find out which is a cat or a canine (Y) in response to some options like ear form, weight, tail duration, and so forth. (X). Let Zero denote cat, and 1 denote canine. We have n samples, or pets, and m options about every puppy:

We need to are expecting the chance {that a} puppy is a canine. To achieve this, we first take a weighted sum of the enter variables — let w denote the burden matrix. The linear aggregate of the options X and the weights w is given by the vector z.

Next, we observe the sigmoid serve as to each component of the z vector which supplies the vector y_hat.

The sigmoid serve as, often referred to as the logistic serve as, is an S-shaped serve as that “squashes” the values of z into the variety [0,1].

Since every worth in y_hat is now between Zero and 1, we interpret this because the chance that the given pattern belongs to the “1” category, versus the “0” category. In this example, we’d interpret y_hat because the chance {that a} puppy is a canine.

Our purpose is to search out your best option of the w parameter. We need to discover a w such that the chance P(y=1|x) is huge when x belongs to the “1” category and small when x belongs to the “0” category (wherein case P(y=0|x) = 1 — P(y=1|x) is huge).

Notice that every type is totally specified by the number of w. We can use the binary go entropy loss, or log loss, serve as to judge how smartly a selected type is appearing. We need to know how “a long way” our type predictions are from the real values within the coaching set.

Note that simplest one of the vital two phrases within the summation is non-zero for every coaching instance (relying on whether or not the real label y is Zero or 1). In our instance, when we’ve a canine (i.e. y = 1) minimizing the loss method we want to make y_hat = P(y=1|x) massive. If we’ve a cat (i.e. y = 0) we need to make 1 — y_hat = P(y=0|x) massive.

Now we’ve loss serve as that measures how smartly a given w suits our coaching knowledge. We can discover ways to classify our coaching knowledge by minimizing L(w) to search out your best option of w.

One manner wherein we will be able to seek for the most efficient w is thru an iterative optimization set of rules like gradient descent. In order to make use of the gradient descent set of rules, we want so to calculate the spinoff of the loss serve as with recognize to w for any worth of w.

Note, that for the reason that sigmoid serve as is differentiable, the loss serve as is differentiable with recognize to w. This lets in us to make use of gradient descent, but additionally lets in us to make use of computerized differentiation applications, like PyTorch, to coach our logistic regression classifier!

Multi-class classification

We can generalize the above to the multi-class surroundings, the place the label y can tackle Ok other values, relatively than simplest two. Note that we commence indexing at 0.

The purpose now, is to estimate the chance of the category label taking up every of the Ok other conceivable values, i.e. P(y=ok|x) for every ok = 0, …, Ok-1. Therefore the prediction serve as will output a Ok-dimensional vector whose parts sum to at least one, to present the Ok estimated chance.

Specifically, the speculation category now takes the shape:

This is often referred to as multinomial logistic regression or softmax regression.

A be aware on dimensions —above we’re having a look at one instance simplest, x is a m x 1 vector, y is an integer worth between 0 and Ok-1, and let w(ok) denote a m x 1 vector that represents the function weights for the k-th category.

Each component of the output vector, takes the apply shape:

This is referred to as the softmax serve as. The softmax turns arbitrary genuine values into chances. The outputs of the softmax serve as are at all times within the vary [0, 1] and the summation within the denominator method the entire phrases upload as much as 1. Hence, they shape a chance distribution. It may also be observed as a generalization of the sigmoid serve as and within the binary case, the softmax serve as in truth simplifies to the sigmoid serve as (attempt to turn out this for your self!)

For comfort, we specify the matrix W to indicate the entire parameters of the type. We concatenate the entire w(ok) vectors into columns in order that the matrix W has measurement m x ok.

As within the binary case, our purpose in coaching is to be told the W values that reduce the go entropy loss serve as (an extension of the binary method).


Please enter your comment!
Please enter your name here