When we want to predict something that is “qualitative” or “categorical”, this problem is known as classification problem. For example:
- Identify if an email is Spam or not.
- Identifying if a person has any particular disease or not.
- Given the data predicting if a person would default the payment or not.
Linear regression is not appropriate for classification problem because it assumes that the outcome variable is continuous and normally distributed. However, in the case of classification, the outcome variable is usually binary (i.e., it can only take on two values such as “positive” or “negative”) or multi-class (i.e., it can take on multiple values such as “red”, “green”, “blue”).
See below python code for example.
If you run above code, you’ll see following plot as output:
Below points describe the output of Linear Regression:
- Error terms tend to be large at the middle values of X (independent variable) and small at the extreme values, which is the violation of linear regression assumptions that errors should be normally distributed.
- Generates nonsensical predictions of greater than 1 and less than 0 at end values of X.
Logistic regression is used for classification problem. Logistic regression models the probability that Y (dependent variable) belongs to a particular category instead of modeling the Y directly.
We don’t use linear regression because it simply does not fulfill the same role. The least squares criterion for fitting a linear regression does not respect the role of the predictions as conditional probabilities, while logistic regression maximizes the likelihood of the training data with respect to the predicted conditional probabilities. Additionally, the predictions from linear regression can be any real number, which negates their use as probabilities.