逻辑回归 Logistic Regression

One-line definition

A logistic regression model estimates $P(y=1\mid x)$ via $\sigma(w^\top x + b)$ and fits parameters by MLE for Bernoulli labels.

Problem setting

Inputs: feature vector $x \in \mathbb{R}^d$, dataset ${(x_i, y_i)}_{i=1}^n$ with $y_i \in {0,1}$.
Outputs: parameters $w \in \mathbb{R}^d$, bias $b \in \mathbb{R}$.
Assumptions: conditional Bernoulli; log-odds is linear in $x$.
Boundary: binary classification; multiclass needs softmax (multinomial logistic).

Mathematical formulation

Define $\sigma(t) = 1 / (1 + e^{-t})$ and $p_i = \sigma(w^\top x_i + b)$.

From MLE (Bernoulli likelihood): \(\max_{w,b} \; \sum_{i=1}^n \left[y_i \log p_i + (1-y_i) \log(1-p_i)\right]\)

Equivalent minimization of negative log-likelihood (cross-entropy): \(\min_{w,b} \; -\sum_{i=1}^n \left[y_i \log p_i + (1-y_i) \log(1-p_i)\right]\)

Optionally add $\ell_2$ regularization: $\lambda |w|_2^2$.

Algorithmic interpretation

Convex optimization with a linear decision boundary $w^\top x + b = 0$.
Learns calibrated probabilities under correct model assumptions.

Optimization & implementation details

Objective source: MLE for Bernoulli labels.
Solver: Newton/IRLS for small to medium $d$; L-BFGS or SGD for large-scale.
Complexity: Newton step $O(nd^2 + d^3)$; SGD $O(nd)$ per epoch.
Numerical notes: use log-sum-exp tricks; clip probabilities; standardize features.

Connections & boundaries

Compared to linear regression: linear regression minimizes squared loss for continuous targets; logistic uses log-likelihood for Bernoulli labels.
Compared to SVM: SVM maximizes margin with hinge loss; logistic provides probabilistic outputs.
Compared to naive Bayes: logistic is discriminative; naive Bayes is generative.
Boundary: linear log-odds; if relation is nonlinear, use feature maps or nonlinear models.

Failure modes

Perfectly separable data yields unbounded weights without regularization.
Class imbalance biases decision threshold.
Multicollinearity leads to unstable coefficients.
Outliers with extreme features can dominate gradients.

Minimal pseudo-code

Input: X in R^{n x d}, y in {0,1}^n
Initialize w, b
Repeat until convergence:
  p = sigmoid(X w + b)
  grad_w = X^T (p - y) + 2 lambda w
  grad_b = sum(p - y)
  Update w, b
Return w, b

Decision checklist

Labels are Bernoulli and log-odds linearity is acceptable
Class imbalance handled (weights or threshold)
Regularization chosen to prevent separation
Compared to SVM and tree-based baselines

Personal notes

TODO