支持向量机 SVM
One-line definition
A support vector machine finds a maximum-margin linear separator by solving a convex QP with hinge loss, optionally using kernels.
Problem setting
- Inputs: $x \in \mathbb{R}^d$, labels $y \in {-1, +1}$.
- Outputs: parameters $w, b$ (and support vectors in kernelized form).
- Assumptions: linear separability (hard margin) or approximate separability (soft margin).
- Boundary: binary classification; multiclass requires one-vs-rest or structured formulations.
Mathematical formulation
Soft-margin primal: \(\min_{w,b,\xi} \; \frac{1}{2}\|w\|_2^2 + C \sum_{i=1}^n \xi_i\) subject to \(y_i (w^\top x_i + b) \ge 1 - \xi_i, \; \xi_i \ge 0\)
Equivalent unconstrained form with hinge loss: \(\min_{w,b} \; \frac{1}{2}\|w\|_2^2 + C \sum_{i=1}^n \max(0, 1 - y_i(w^\top x_i + b))\)
Dual (kernelized) uses $K(x_i, x_j)$ with support vectors.
Algorithmic interpretation
- Maximizes geometric margin; only support vectors define the decision boundary.
- Kernel trick enables nonlinear decision boundaries without explicit feature maps.
Optimization & implementation details
- Objective source: margin maximization with hinge loss.
- Solver: SMO for dual; primal SGD for large-scale linear SVM.
- Complexity: kernel SVM typically $O(n^2)$ memory and time; linear SVM scales with $O(nd)$.
- Numerical notes: feature scaling is critical; choose $C$ (and kernel params) via validation.
Connections & boundaries
- Compared to logistic regression: SVM optimizes margin, logistic optimizes log-likelihood.
- Compared to perceptron: SVM adds margin and convex objective.
- Compared to kernel ridge: different loss (hinge vs squared), different robustness.
- Boundary: kernel SVM is expensive for large $n$; linear SVM for high-dimensional sparse data.
Failure modes
- Poor kernel or hyperparameters cause under/overfitting.
- Overlap and noise reduce margin effectiveness.
- Unscaled features distort margin geometry.
- Large $n$ makes kernel SVM infeasible.
Minimal pseudo-code
Input: X, y in {-1, +1}
Initialize w, b
Repeat:
For each (x_i, y_i):
if y_i (w^T x_i + b) < 1:
w = w - eta (w - C y_i x_i)
else:
w = w - eta w
Return w, b
Decision checklist
- Linear vs kernel SVM chosen based on data size
- Features scaled
- C (and kernel params) tuned
- Compared to logistic regression baseline
Personal notes
TODO