SVM
Notes from professor Patrick Winston’s SVM lecture
SVM is a supervised classification method.
All about decision boundaries
Other methods of drawing decision boundaries

Nearest neighbour

ID trees

Neural nets
SVM takes on the “widest street approach” using the support vectors only
Decision rule
[image goes here]
Let $\bar{w}$ be a vector perpendicular to the decision boundary we set.
Our mission is to classify an arbitrary vector $\bar{u}$.
In order to figure this out, we need a decision rule; some kind of criteria that can tells us if the vector is positive or negative.
An option is by projecting $\bar{u}$ onto $\bar{w}$ through dot product and see if the resulting value plus some constant $b$ is greater than 0. If it is, we say it is a positive sample given our decision boundary. \(\bar{w} \cdot \bar{u} + b \geq 0 \tag{1}\) That seems easy. But the problem is how to determine $\bar{w}$ and constant $b$.
There are infinitely many options for $\bar{w}$ , since there are infinitely many perpendicular lines to the boundary (A vector of any length as long as it is perpendicular).
So we don’t have enough constraints to choose one $\bar{w}$ and $b$.
Constraints
To narrow down our options, let’s add more constraints. \(\overrightarrow{w} \cdot \overrightarrow{x}_{+} + b \geq 1\)
\[\overrightarrow{w} \cdot \overrightarrow{x}_{} + b \leq 1\]This says that a positive sample must be “really positive” and a negative sample must be “really negative”. We want our decision boundary to ensure that there is enough safety margin for separating positive and negative samples.
But we have two equations here. How can we make it into one for mathematical convienience?
Let $y_i = 1$ for positive samples and $y_i = 1$ for negative samples.
We can multiply both sides of the equation by $y$. \(y_i(\overrightarrow{w} \cdot \overrightarrow{x}_{+i} + b) \geq 1\)
\[y_i(\overrightarrow{w} \cdot \overrightarrow{x}_{i} + b) \geq 1\]Note that multiplying $y = 1$ on both sides in the negative sample case flips the direction of the equality symbol, so we have the same looking equation! \(y_i(\overrightarrow{w} \cdot \overrightarrow{x}_i + b) \geq 1\) Organize a little bit and we get, \(y_i(\overrightarrow{w} \cdot \overrightarrow{x}_i + b)  1 \geq 0\) In addition, we say that the samples that are on the gutter (located exactly on the width of the street) meet the following equation. \(y_i(\overrightarrow{w} \cdot \overrightarrow{x}_i + b)  1 = 0 \tag{2}\) This will be the constraint that we will stick to.
By the way, the samples that are on the gutter are called “support vectors”. These samples are the main focus for SVM.
Width of the boundary
Remember, our main goal is to find the widest street that can separate the positive to negative samples.
[drawing]
But we need to find out how to compute the width of the street.
The width of the street can be expressed as a dot product between the unit vector that is perpendicular to our decision boundary and the distance between the positive and the negative sample on the gutter. \(width = (\overrightarrow{x}_+  \overrightarrow{x}_) \cdot \frac{\overrightarrow{w}}{\lVert{\overrightarrow{w}}\rVert}\) From constraint (2), \(\overrightarrow{w} \cdot \overrightarrow{x}_+ = 1  b\)
\[\overrightarrow{w} \cdot \overrightarrow{x}_ = 1 + b\]Then, the width simplifies to \(width = (1b 1 +b) \cdot \frac{\overrightarrow{w}}{\lVert{\overrightarrow{w}}\rVert} = \frac{2}{\lVert{\overrightarrow{w}}\rVert}\)
Our goal is to maximize the width. So, the objective function is simply \(max \frac{1}{\lVert{\overrightarrow{w}}\rVert}\) Or, \(min \lVert{\overrightarrow{w}}\rVert\) Or, even more mathematically convieniently, \(min \frac{1}{2}\lVert{\overrightarrow{w}}\rVert ^2 \tag{3}\)
Using lagrange multiplier
So we have an objective function (3) and an equality constraint function (2). This can be formulated as an optimization problem. \(min \quad \frac{1}{2}\lVert{\overrightarrow{w}}\rVert ^2 \\ s.t \quad y_i(\overrightarrow{w} \cdot \overrightarrow{x}_i + b)  1 = 0 \quad \forall i\) With lagrange multiplier theorem for equality constrained optimization, we can combine these two functions into one. \(L = \frac{1}{2}\lVert{\overrightarrow{w}}\rVert ^2  \Sigma \alpha_i[y_i(\overrightarrow{w} \cdot \overrightarrow{x}_i + b)  1] \tag{4}\) where $\alpha_i$ are lagrange multipliers for each sample.
$\alpha_i$ ‘s are equal to zero for all samples that are not on the gutter.
To find the extremum (minimum or maximum) of $L$, we find the derivative and set it to 0. \(\frac{\delta L}{\delta \overrightarrow{w}} = \overrightarrow{w}  \Sigma \alpha_i y_i \overrightarrow{x}_i = 0\)
\[\overrightarrow{w} = \Sigma \alpha_i y_i \overrightarrow{x}_i \tag{5}\]We can see that $\bar{w}$ that we wanted to find is now a linear sum of samples!
How about differentiating w.r.t to $b$? \(\frac{\delta L}{\delta b} =  \Sigma \alpha_i y_i = 0 \tag{6}\) We have now obtained 2 more valuable functions (eq 5, 6).
We will substitute these into eq (4). \(L = \frac{1}{2}(\Sigma \alpha_i y_i \overrightarrow{x_i})(\Sigma \alpha_j y_j \overrightarrow{x_j})  (\Sigma \alpha_i y_i \overrightarrow{x_i})(\Sigma \alpha_j y_j \overrightarrow{x_j})  \Sigma \alpha_i y_i b + \Sigma \alpha_i\) Since $\Sigma \alpha_i y_i b = b \Sigma \alpha_i y_i = 0 $ from eq (6), \(L = \Sigma \alpha_i  \frac{1}{2}\Sigma \Sigma \alpha_i \alpha_j y_i y_j x_i \cdot x_j \tag{7}\) Why did we get so far as this?
If we find $\alpha_i$, then we can now compute $\overrightarrow{w}$ and $b$ from eq (5) and eq (2) and our samples.
We ultimately found out our objective function is associated in terms of the samples we’ve got. Eq (7), we see that it depends on the dot product of samples in our data.
In addition, our decision rule in eq (1) becomes \(\Sigma \alpha_i y_i \overrightarrow{x}_i \cdot \overrightarrow{u} + b \geq 0 \quad then \quad '+'\) and therefore, it is also dependent on the dot product of our samples and an unknown vector.
So all of our equations we are interested are dependent on the dot product with our samples.
Since $L$ is a quadratic programming problem, finding $\alpha$ will not get stuck in local maximum, unlike neural nets.
Kernel function
However, SVM doesn’t work well for samples that are not linearly separable.
But that’s ok. If samples are not linearly separable, we will then apply some transformation, $\phi(*)$, on our samples to map them into another space, where they become linearly separable and then apply SVM.
Since all we need is the dot product between samples, we actually don’t need to figure out $\phi(*)$ itself, but rather the result of the dot product of the transformation. \(K(x_i, x_j) = \phi(x_i) \cdot \phi(x_j)\) So our mission, in case of non linearly separable case, will be to find the kernel function, $K$.
Some examples of common kernel functions are \((u_i \cdot u_j + 1)^n\)
\[e ^{\frac{\lVert u_i  u_j \rVert}{\sigma}}\]Summary
The goal of SVM is to determine the widest street (boundary) using samples that lie on the gutter, called support vectors. SVM is always able to find the solution to the linear classification problem with the given samples.
Tools we have used for computing decision boundary ($\overrightarrow{w}$ and $b$) in SVM
 Decision rule
 Constraints
 “Maximizing the width” of the street
Future notes
 https://www.youtube.com/watch?v=eHsErlPJWUU&hd=1
 A bit more on kernel functions and its uses in other methods