One-line definition

A support vector machine finds a maximum-margin linear separator by solving a convex QP with hinge loss, optionally using kernels.

Problem setting

  • Inputs: $x \in \mathbb{R}^d$, labels $y \in {-1, +1}$.
  • Outputs: parameters $w, b$ (and support vectors in kernelized form).
  • Assumptions: linear separability (hard margin) or approximate separability (soft margin).
  • Boundary: binary classification; multiclass requires one-vs-rest or structured formulations.

Mathematical formulation

Soft-margin primal: \(\min_{w,b,\xi} \; \frac{1}{2}\|w\|_2^2 + C \sum_{i=1}^n \xi_i\) subject to \(y_i (w^\top x_i + b) \ge 1 - \xi_i, \; \xi_i \ge 0\)

Equivalent unconstrained form with hinge loss: \(\min_{w,b} \; \frac{1}{2}\|w\|_2^2 + C \sum_{i=1}^n \max(0, 1 - y_i(w^\top x_i + b))\)

Dual (kernelized) uses $K(x_i, x_j)$ with support vectors.

Algorithmic interpretation

  • Maximizes geometric margin; only support vectors define the decision boundary.
  • Kernel trick enables nonlinear decision boundaries without explicit feature maps.

Optimization & implementation details

  • Objective source: margin maximization with hinge loss.
  • Solver: SMO for dual; primal SGD for large-scale linear SVM.
  • Complexity: kernel SVM typically $O(n^2)$ memory and time; linear SVM scales with $O(nd)$.
  • Numerical notes: feature scaling is critical; choose $C$ (and kernel params) via validation.

Connections & boundaries

  • Compared to logistic regression: SVM optimizes margin, logistic optimizes log-likelihood.
  • Compared to perceptron: SVM adds margin and convex objective.
  • Compared to kernel ridge: different loss (hinge vs squared), different robustness.
  • Boundary: kernel SVM is expensive for large $n$; linear SVM for high-dimensional sparse data.

Failure modes

  • Poor kernel or hyperparameters cause under/overfitting.
  • Overlap and noise reduce margin effectiveness.
  • Unscaled features distort margin geometry.
  • Large $n$ makes kernel SVM infeasible.

Minimal pseudo-code

Input: X, y in {-1, +1}
Initialize w, b
Repeat:
  For each (x_i, y_i):
    if y_i (w^T x_i + b) < 1:
      w = w - eta (w - C y_i x_i)
    else:
      w = w - eta w
Return w, b

Decision checklist

  • Linear vs kernel SVM chosen based on data size
  • Features scaled
  • C (and kernel params) tuned
  • Compared to logistic regression baseline

Personal notes

TODO