1. Introduction to Linear Classification
In the field of machine learning, supervised learning represents a class of applications where algorithms learn from training data composed of input vectors and their corresponding target vectors. The objective of these algorithms varies based on the nature of the desired output. For regression problems, the goal is to predict one or more continuous variables. In contrast, for classification problems, the objective is to assign an input to one of a finite number of discrete categories. This whitepaper focuses on the fundamental principles of classification.
The fundamental goal of a linear classifier is to take an input vector, denoted as x, and assign it to one of K discrete classes. It achieves this by learning a function that partitions the feature space with a linear boundary. For example, consider a dataset where the features are a fish's weight $(x1)$ and length $(x2)$, and the target class $(y)$ is its species. A linear classifier would learn a boundary in the weight-length feature space to distinguish between different species. Given new measurements for a fish, the model can then predict its species by determining which side of the boundary it falls on.
This document will delve into Logistic Regression, a foundational and powerful algorithm for solving binary classification problems. Its principles not only provide an effective solution for two-class scenarios but also serve as the basis for more advanced classification models.
2. Deconstructing Logistic Regression for Binary Classification
A strategic understanding of the core mechanics of Logistic Regression is essential for any machine learning practitioner. Its principles of probabilistic modeling and linear decision boundaries form the basis for more complex classification models, including neural networks. By deconstructing this algorithm, we gain deep insights into the foundations of modern classification.
Fundamental Differences: Linear vs. Logistic Regression
While both models are linear in nature, they are designed for fundamentally different tasks, which is reflected in their mathematical formulation and output. Linear Regression predicts a continuous value, while Logistic Regression models the probability of a discrete outcome.
| Model | Function $fw(x)$ | Output Range |
| Linear Regression | $fw(x) = w^T * x$ | [-∞, +∞] |
| Logistic Regression | $fw(x) = σ(w^T * x) = 1 / (1 + e^(-w^T * x))$ | [0, 1] |
The Role of the Sigmoid (Logistic) Function
The core component that distinguishes Logistic Regression is the Sigmoid (or logistic) function, denoted $σ(z)$. This function maps any real-valued number z to a value between 0 and 1, making it perfectly suited for modeling probability. The behavior of the function is controlled by its input, $w^T*x$. The input vector x determines the position on the sigmoid curve, while the magnitude of the weight vector w controls the steepness of the function's slope. Larger weights lead to a steeper, more decisive transition, while smaller weights result in a gentler slope.
Probabilistic Interpretation of the Output
The output of the Logistic Regression model has a direct probabilistic interpretation. The value $σ(w^T*x)$ is interpreted as the conditional probability of the positive class (Class 1) given the input x.
- Probability of the Positive Class: $P(Class 1 | x) = σ(w^T*x)$
- Probability of the Negative Class: $P(Class 0 | x) = 1 - σ(w^T*x)$
Formally, the complete probability model is expressed using the Bernoulli distribution, as each classification is a trial with two possible outcomes: $p(y | x, w) = Bernoulli(y | σ(w^T*x))$
The Decision Boundary
The decision boundary is the surface that separates the classes in the feature space. For Logistic Regression, this is the contour where the probability for each class is equal (i.e., 0.5). We derive the equation for this boundary by setting the model's output to 0.5:
- Start with the probability equation: $1 / (1 + e^(-w^T*x)) = 0.5$
- Rearranging the terms gives: $2 = 1 + e^(-w^T*x)$
- This simplifies to: $1 = e^(-w^T*x)$
- Taking the natural logarithm of both sides yields: $0 = -w^T*x$
- Therefore, the decision boundary is defined by the equation: $w^T*x = 0$
This equation defines a hyperplane (a straight line in two dimensions), which confirms that Logistic Regression is a linear classifier. The weight vector w specifies the orientation of this hyperplane, and geometrically, the decision boundary is perpendicular to the vector w. To allow the boundary to be displaced from the origin, the feature vector $x$ is augmented with a constant 1 (i.e., $x' := [1, x]$). This allows the corresponding weight vector w' to learn an intercept term w0 implicitly (i.e., $w' := [w0, w]$), making the effective equation for the boundary $w^T*x + w0 = 0$.
Having defined the model and its components, the next critical step is to understand how its parameters are learned from data.
3. Model Optimization: Learning the Optimal Weights
Defining the mathematical form of the Logistic Regression model is only the first step. The crucial next phase is to establish a method for finding the optimal weight vector w that allows the model to best fit the training data. This is achieved through a process of optimization, where we define an objective function and use an algorithm to find the parameters that minimize it.
The Objective Function: Maximum Likelihood Estimation
The objective function for Logistic Regression is derived from the principle of Maximum Likelihood Estimation (MLE). The goal of MLE is to find the model parameters that maximize the likelihood of observing the given training data. This leads to a cost function known as the Negative Log-Likelihood, which measures how poorly the model fits the training data by calculating the negative logarithm of the probability that the model would have correctly assigned each data point's true label.
The formula for the cost function, $E(w)$, is: $E(w) = -Σ [yi * log(σ(w^T*xi)) + (1 - yi) * log(1 - σ(w^T*xi))]$
Here, yi is the true label (0 or 1) for the i-th training example. This formulation penalizes confident, incorrect predictions while rewarding correct ones. It is fundamentally different from the sum of squared errors (SSE) used in Linear Regression. SSE is unsuitable for classification because it penalizes predictions based on their numeric distance from the true label, a concept that is meaningless for discrete, categorical labels like 0 and 1.
The Learning Algorithm: Gradient Descent
For Logistic Regression's cost function, there is no closed-form solution that allows us to solve for w directly. Instead, we must use an iterative optimization algorithm. The standard choice is Gradient Descent. This algorithm works by repeatedly calculating the gradient (the derivative) of the cost function with respect to the weights and taking a small step in the opposite direction to minimize the cost.
The derivative of the cost function with respect to the weights w, representing the error gradient, is: $dE(w)/dw = Σ (σ(w^T*xi) - yi) * xi$
The weight update rule for each iteration is then: $w := w - α * dE(w)/dw$
Here, α is the learning rate, a small positive number that controls the size of each step.
The Nature of the Optimization Problem
A critical property of the Negative Log-Likelihood cost function for Logistic Regression is that it is convex. This is a highly desirable characteristic in optimization. For a convex function, any local minimum is also the global minimum. This guarantees that the gradient descent algorithm, given a suitable learning rate, will converge to the unique set of optimal weights that best fit the data.
While this optimization process is robust, it can lead to a potential pitfall—overconfidence in the model's predictions—which requires a specific technique to mitigate.
4. Enhancing Model Generalization with Regularization
A practical challenge in training machine learning models is the risk of overfitting and overconfidence. In the context of Logistic Regression, unconstrained learning can lead to excessively large weight values. This creates a model that is overly certain in its predictions, with probabilities pushed to the extremes of 0 and 1. Such a model may perform well on the training data but often fails to generalize to new, unseen data.
Diagnosing Overconfidence
The magnitude of the learned weights w directly controls the steepness of the sigmoid function. Larger weights result in a steeper curve, meaning the model transitions from a prediction of 0 to 1 very rapidly. If the training data is perfectly linearly separable, the learning algorithm will attempt to maximize the likelihood by increasing the weights indefinitely towards infinity. This drives the predicted probabilities to exactly 0 or 1, making the model maximally confident but also brittle and sensitive to noise.
L2 Regularization as a Solution
To prevent overconfidence and improve the model's ability to generalize, a technique called regularization is used. Regularization works by adding a penalty term to the cost function that discourages the model from learning excessively large weights. The most common form is L2 Regularization.
The augmented cost function with an L2 penalty is: $E(w) = -[Log-Likelihood] + (λ/2) ||w||^2$
The term $(λ/2) ||w||^2$ adds half the squared magnitude of the weight vector to the cost. The hyperparameter λ (lambda) controls the strength of the penalty. The 1/2 factor is a mathematical convenience that simplifies the derivative. From a Bayesian perspective, adding an L2 penalty is equivalent to assuming a zero-mean Gaussian prior on the weights, which expresses a belief that weights should be small and centered around zero.
The Updated Learning Rule
With the addition of the regularization term, the gradient of the cost function must be updated. The new derivative incorporates the derivative of the L2 penalty, leading to a modified gradient descent update rule.
The derivative of the augmented cost function is: $dE(w)/dw = Σ(σ(w^T*xi) - yi)xi + λw$
This update rule has an intuitive interpretation: at each step, the algorithm first moves the weights in the direction that reduces the data classification error, and then it shrinks the weight vector w slightly towards the origin. This second step mechanistically prevents the weights from growing uncontrollably, thereby mitigating overconfidence and improving generalization.
After improving the binary classifier, the next logical step is to extend its capabilities to handle problems that involve more than two distinct classes.
5. Extending to Multiclass Classification
While binary classification is a common task, many real-world problems—such as document categorization, object recognition, or medical diagnosis—involve multiple categories. Adapting binary classifiers for these scenarios is not merely an extension but a necessary generalization to address the limitations of the binary model in real-world applications. Two common strategies exist for this purpose.
The "One-versus-All" (1vAll) Approach
A simple and intuitive strategy is the One-versus-All (1vAll) method. This approach involves training K independent binary classifiers for a problem with K classes. Each classifier k is trained to distinguish class k (as the positive class) from all other K-1 classes (lumped together as the negative class). To make a prediction for a new input, all K classifiers are run, and the class corresponding to the classifier with the highest output score is chosen. However, this method has notable issues, including the potential for ambiguous regions in the feature space where multiple or no classifiers predict a positive outcome, and it becomes computationally expensive as K grows.
The MaxEnt (Softmax Regression) Model
A more natural and direct approach to multiclass classification is the Maximum Entropy (MaxEnt) model, also known as Softmax Regression. This model is a direct generalization of Logistic Regression. It replaces the Bernoulli likelihood used for binary problems (a "coin flip") with a Multinomial likelihood (a "dice roll"), which is suitable for K outcomes. The core of this model is the Softmax function, which computes the probability of a specific class y given an input x and a weight matrix W (containing a weight vector for each class):
$p(y|x,W) = exp(w_y^T*x) / Σ_k(exp(w_k^T*x))$
This function takes a vector of scores (one for each class) and normalizes them into a probability distribution where all outputs are between 0 and 1 and sum to 1.
Relationship Between Softmax and Logistic Regression
Logistic Regression is a special case of Softmax Regression. To demonstrate this, consider the binary case where K=2. The Softmax probability for Class 1 is: $p(y=1 | x, W) = exp(w_1^T*x) / (exp(w_1^T*x) + exp(w_2^T*x))$
First, we divide the numerator and denominator by exp(w_1^T*x) to isolate the relationship between the two class $weights: = 1 / (1 + exp(w_2^T*x) / exp(w_1^T*x))$
Using the property of exponents, this simplifies to: = $1 / (1 + exp(w_2^T*x - w_1^T*x)) = 1 / (1 + exp(-(w_1 - w_2)^T * x))$
This final form is equivalent to the sigmoid function $σ(w^T*x)$, where the single weight vector w of the binary classifier is defined as the difference between the two weight vectors of the Softmax model, $w = w_1 - w_2$.
The Cost Function for MaxEnt
Similar to its binary counterpart, the cost function for MaxEnt is the negative log-likelihood of the correctly assigned classes across all training data points. The labels are typically represented in a one-hot encoded matrix Y, where $Y(i,c)=1$ if the i-th example belongs to class c, and 0 otherwise. The cost is the sum of the negative logs of the predicted probabilities for the true classes.
With a robust framework for both binary and multiclass problems established, we can now explore further extensions that enhance the model's flexibility and practical applicability.
6. Advanced Extensions and Practical Considerations
The standard Logistic Regression model, with its linear decision boundary, provides a powerful baseline but can be limited in certain scenarios. This section covers techniques to overcome this limitation and addresses other practical challenges, such as automated feature selection and training on large-scale datasets.
Addressing Non-Linearity with Basis Functions
A key limitation of the basic model is that it can only learn a linear decision boundary. To handle data that is not linearly separable, we can apply a fixed non-linear transformation, $φ(x)$, to the input features before they are fed into the model. This technique, known as using basis functions, might involve creating polynomial features (e.g., $x_1^2, x_2^2, x_1*x_2$). The result is a decision boundary that is non-linear in the original feature space but remains linear with respect to the learned weights w. This approach preserves the simplicity and convex optimization of the original model while significantly increasing its flexibility. However, it requires careful feature engineering and appropriate regularization to avoid underfitting or overfitting.
Achieving Sparsity with L1 Regularization
While L2 regularization is effective at preventing overconfidence by shrinking weights, L1 regularization offers a unique advantage: it can shrink weights exactly to zero. This property, known as sparsity, has several benefits: it automatically performs feature selection by "deleting" irrelevant features, saves memory by storing only non-zero weights, and can yield valuable domain insight by identifying the most important predictive features.
The core mechanical reason for this behavior is the behavior of the L1 penalty's gradient near the origin. Unlike the L2 gradient, which diminishes as a weight approaches zero, the L1 gradient remains constant. This provides a consistent "push" for a weight to cross the zero boundary and stay there. The primary optimization challenge with L1 regularization is that its cost function is not differentiable at zero, which requires more complex algorithms.
Differentiating Multi-Label and Multiclass Classification
It is crucial to distinguish between multiclass and multi-label classification, as they address different problem types and require different modeling approaches.
| Multiclass Classification | Multi-Label Classification |
| One input x has exactly one label | One input x can have many labels (from 0 to K) |
| Training data is a scalar y for one x | Training data is a vector y for one x |
| Prediction $y=f(x)$ is a scalar | Prediction $y=f(x)$ is a K-dimensional vector |
| Labels y=1...K are mutually exclusive | Labels $yk={0,1}$ are independent |
| `$P(yx)$` is a K-dimensional probability vector |
A simple and effective approach for multi-label classification is to train K independent binary logistic regression classifiers. Each classifier is responsible for predicting the presence or absence of a single label, effectively treating the problem as K separate binary classification tasks.
Strategy for Large-Scale Learning
Standard gradient descent requires a full pass over the entire training dataset to compute the gradient for a single weight update. This becomes a significant bottleneck for large-scale datasets that cannot fit into memory. The solution is Stochastic Gradient Descent (SGD) with mini-batches. Instead of using the full dataset, SGD approximates the true gradient using a small, randomly selected subset of data (a mini-batch) for each update. This drastically reduces the number of data reads required for convergence, making it possible to train models efficiently on massive datasets.
Having explored the mechanics and extensions of the model, the final step is to understand how to interpret its results and see how it is applied in the real world.
7. Interpretation and Applications
Beyond predictive accuracy, the ability to interpret a model's decisions is crucial for building trust, debugging errors, and extracting actionable insights. Logistic Regression, as a linear model, excels in this regard, offering a transparent view into the relationship between features and outcomes.
A Guide to Interpreting Learned Weights
After the model is trained, the learned weight vector w provides direct insight into the classification logic. Each weight wk corresponds to a feature xk in the input vector.
- A positive weight wk indicates that its corresponding feature xk is a positive indicator of the target class. As the value of xk increases, so does the probability of the positive class.
- A negative weight wk indicates that its feature xk is a contra-indicator. As the value of xk increases, the probability of the positive class decreases.
- If L1 regularization is used and a weight wk is set to zero, the model has determined that the corresponding feature xk is irrelevant to the classification task.
Practical Applications of Linear Classifiers
The simplicity, efficiency, and interpretability of linear classifiers make them valuable tools across various domains. The following examples illustrate their utility:
- Text Classification: In an analysis of parliamentary bills, a linear classifier trained on the bill text was able to predict whether a bill would be accepted into law with approximately 85% accuracy, demonstrating its power in natural language processing.
- Computer Vision: For person re-identification, a linear classifier can learn to determine if two image windows contain the same person. This is achieved by extracting feature vectors (x1, x2) from each image and training the classifier on their absolute difference, using a function like $F(x1,x2) = w^T|x1-x2|$.
- Feature Engineering: Linear classifiers can themselves be used as feature extractors. For a forensic sketch identification system, a series of binary classifiers can be trained to detect facial attributes (e.g., "big nose," "long hair," "male"). The outputs of these attribute classifiers then serve as a rich set of high-level features for a subsequent identification model.
8. Conclusion
This whitepaper has provided a comprehensive overview of Logistic Regression, a cornerstone algorithm for linear classification. It is an interpretable and efficient model that serves as a powerful tool for binary classification and as a conceptual building block for more advanced machine learning techniques. Its principles of probabilistic modeling, convex optimization, and regularization are fundamental concepts that extend across the discipline.
The extensibility of Logistic Regression—from a simple binary classifier to a flexible model for non-linear, multiclass, and multi-label problems—underscores its enduring relevance in the machine learning practitioner's toolkit. For professionals seeking to build robust and understandable machine learning solutions, a deep understanding of this algorithm is essential.
Key Takeaways
- Logistic Regression uses the sigmoid function to model class probabilities, constraining its output to a [0, 1] range suitable for probabilistic interpretation.
- It learns a linear decision boundary by minimizing a convex cost function (Negative Log-Likelihood) via gradient-based optimization, which guarantees convergence to a global minimum.
- Regularization, particularly L2, is critical to prevent the model from becoming overconfident in its predictions, thereby improving its ability to generalize to new data.
- The model can be extended to handle a variety of complex scenarios, including multiclass problems (via Softmax Regression), non-linear boundaries (via basis functions), and multi-label tasks.
- Advanced regularizers like L1 can be used to induce sparsity, which serves as a powerful method for automatic feature selection and model simplification.
'Data Science' 카테고리의 다른 글
| [Paper Review] EfficientNetV2 논문 번역 (1) | 2022.09.22 |
|---|---|
| [Bayes] Bayesian inference(베이즈 추론)의 배경 (1) | 2022.08.18 |
댓글