Instance-based Learning

Different to the type of learning that we have seen
- Stores the training examples
- In order to classify a new object
  - Extracts the most similar objects
  - Use their classification to classify the new object

Instance-based Learning

The learning process is trivial
The classification process takes most of the time
Also known as:
- Lazy learning or memory-based learning
- Training data is processed until it is required
- Use a distance measure

k-nearest neighbors

k-NN (k-nearest neighbors)
- The simplest
Robust with examples with noise
Obtains the objects closest to an instance
- Continuous attributes
  - Euclidean distance over the \(n\) possible attributes

k-nearest neighbors - Algorithm

k-nearest neighbors - Algorithm

Other types of distances

\[d(x_i, x_j) = \sqrt{\sum\limits_{r=1}^{n}(a_r(x_i)-a_r(x_j))^2}\]

The result of \(k\)-NN can be discrete or continuous
- Discrete: the result of the classification is the most common class among the \(k\)-neighbors

k-nearest neighbors - Algorithm

k-nearest neighbors - Algorithm

Voronoi diagram
Geometric construction that allow making a partition of the euclidean plane
- Simple interpolation, based on euclidean distance
- Assign a new instance the class of the nearest neighbor

k-nearest neighbors - Continuous classifications

We can take the mean of the classifications

\[f(x_q) = \frac{\sum\limits_{i=1}^{k}f(x_i)}{k}\]

k-nearest neighbors - Weighted Distances

Obvious extension
- Weight the classifications of the neighbors according to their distance to the object to be classified
  - The class of the closest neighbors has more weight
- Weighted average
  - Averages the output of the points inversely weighted by their distance

k-nearest neighbors - Weighted Distances

\[f(x_q) = \arg\max\limits_{v \in V} \sum\limits_{i=1}^{k}w_i\delta(v,f(x_i))\]

where: \(w_i = \frac{1}{d(x_q,x_i)^2}\), if \(d\) = 0, then \(w\) = 0.

k-nearest neighbors - Weighted Distances

For continuous classes \[f(x_q) = \frac{\sum\limits_{i=1}^{k}w_if(x_i)}{\sum\limits_{i=1}^{k}w_i}\]

k-nearest neighbors - Note

Assumption:
- Nearest neighbors give the best classification
- Using all the attributes
Problem
- It is possible that many of the attributes are irrelevant, and these may dominate the classification
  - 2 relevant attributes from 20 irrelevant attributes are dominated

k-nearest neighbors - Note

What should we do?
- We can weight the distance of each attribute, giving more weight to more relevant attributes
- Try to determine the weights from known training examples
  - Changing those weights to decrease the error
- We can eliminate the attributes that we consider irrelevant

k-nearest neighbors - Note

Additional practical element
- Store the examples
- Use tree-based representations (kd-trees)
  - Instances are distributed according to their closeness

k-nearest neighbors - Note

k-nearest neighbors - Note

Locally Weighted Regression

A generalization that builds a function that adjusts the training data that are in the vicinity of \(x_q\)
Can be used with linear functions, quadratic, neural networks, etc.

Locally Weighted Regression

If we use a linear function

\[\hat{f}(x)=w_0+w_1a_1(x) + \dots + w_na_n(x)\]

We can use gradient descent to adjust the weights that minimize the error
We can express the error as a squared difference of the error

\[E(W) = \frac{1}{2}\sum\limits_{x \in D} (f(x) - \hat{f}(x))^2\]

Locally Weighted Regression

We want to find the weights vector that minimizes the error \(E\)
- We change the weights in the direction that decreases the error surface the most
- The direction of this change is obtained through the gradient
  - Expresses the direction that produces the highest increment
  - The highest decrement is the negative of that direction

Locally Weighted Regression

Weights update rule
- \(W \leftarrow W + \bigtriangleup W\)
- \(\bigtriangleup W = -\alpha \bigtriangledown E\)
- where \(\alpha\) is the learning rate
  - How much we trust the error to adjust the weights

Locally Weighted Regression

then:

\[\bigtriangleup w_i = \alpha\sum\limits_{x \in D} (f(x) - \hat{f}(x))(-a_{i,x})\]

Locally Weighted Regression

danupranantha.wordpress.com/

Locally Weighted Regression

danupranantha.wordpress.com/

Locally Weighted Regression

To modify the weights
- Minimize the square error using the \(k\) nearest neighbors

\[E(W) \frac{1}{2} \sum\limits_{x \in k-nearest-neighbors} (f(x) - \hat{f}(x))^2\]

Minimize the squared error using all the examples weighted by their distance to \(x_q\)

\[E(W) = \frac{1}{2}\sum\limits_{x \in D}(f(x) - \hat{f}(x))^2K(d(x_q,x))\]

Minimize the squared error using the \(k\) nearest neighbors weighted by their distance to \(x_q\)

\[E(W) = \frac{1}{2}\sum\limits_{x \in k-nearest-neighbors}(f(x)-\hat{f}(x))^2K(d(x_q,x))\]

For the last case, the update rule is:

\[\bigtriangleup w_i = \alpha \sum\limits_{x \in k-nearest-neighbors} K(d(x_q,x))(f(x) - \hat{f}(x))(-a_{i,x})\]

Distance Measures

Distance functions can be classified in:
- Global functions
  - The same function is used in all the space
- Query based functions
  - The parameters of the distance measure adjust to each query, typically minimizing the error with cross-validation
- Distance functions based on points
  - Each data point has its own distance function
We change / adjust the distance function in order to enhance our predictions

Distance Measures

For continuous data
Euclidean \(d_E(x,q)=\sqrt{\sum\limits_{j}(x_j-q_j)^2} = \sqrt{(x-q)^T(x-q)}\)
Euclidean, diagonally weighted \(d_m(x,q)=\sqrt{\sum\limits_{j}(m_j(x_j-q_j)^2)}=\sqrt{(x-q)^TM^TM(x-q)}=d_E(Mx,Mq)\)

where \(m_j\) is the scale factor in dimension \(j\) and \(M\) is a diagonal matrix with \(M_{ij}=m_j\)

Distance Measures

For continuous data
Complete Euclidean or Mahalanobis
- \(d_M(x,q) = \sqrt{(x-q)^TM^TM(x-q)}=d_E(Mx,Mq)\)
- where \(M\) can be arbitrary
Normal or Minkowski
- \(d_p(x,q) = (\sum\limits_{i}|x_i - q_i|^P)^{\frac{1}{P}}\)
Weighted diagonal normal or complete
- Same as the Minkowski but including weights

Aplying the Distance Measure

We may include a range or scale in which to apply the generalization function, \(H\)
- Which instances are used to create the local model?
Some options are
- Fixed width band selection
  - \(H\) is a constant value
  - Then, it is used with constant values of data and shape
- Nearest neighbors selection
  - \(H\) is set to the distance to the \(k\) nearest neighbors
  - The volume of data changes according to the density of the nearest neighbors
- Global band selection
  - \(H\) is globally adjusted by an optimization process

Aplying the Distance Measure

We may include a range or scale in which to apply the generalization function
- Some options are
  - Query based
    - \(H\) is selected according to the query following an optimization process
  - Based on points
    - Each data point has its own associated \(H\)

Weight Functions or Kernels

The weight function must be maximal at distance 0 and softly decay with distance
It is not necessary to normalize the kernel
It doesn’t have to be unimodal
Has to be always positive

Weight Functions or Kernels

Weight Functions or Kernels

Examples of kernels
Get a negative power of the distance
- \(K(d) = \frac{1}{d^p}\)
To avoid infinites (inverse distance)
- \(K(d) = \frac{1}{1+d^p}\)
One of the most popular is the Gaussian Kernel
- \(K(d)=exp(-d^2)\)

Weight Functions or Kernels

Examples of kernels
A related one is the exponential
- \(K(d)=exp(-|d|)\)
Quadratic kernel or Epanechnikov or Bartlett-Priestley:
- \(K(d)= (1-d^2), if |d| < 1\)
- \(K(d)= 0, otherwise\)
- ignores data larger than 1 unit

Weight Functions or Kernels

Examples of kernels
The tricube kernel \[k(d) = \begin{cases} (1-|d|^3)^3, & \text{if } |d| < 1, \\ 0, & \text{otherwise} \end{cases}\]
Uniform weighting kernel \[K(d) = \begin{cases} 1, & \text{if } |d| < 1, \\ 0, & \text{otherwise} \end{cases}\]
Triangular kernel \[K(d) = \begin{cases} 1-|d|, & \text{if } |d| < 1, \\ 0, & \text{otherwise} \end{cases}\]

Weight Functions or Kernels

Examples of kernels
Variants of the triangular kernel \[K(d) = \begin{cases} \frac{1-|d|}{|d|}, & \text{if } |d| < 1, \\ 0, & \text{otherwise} \end{cases}\]
We can create new kernels
- According to authors, the kernel definition is not that critical

Weight Functions or Kernels

Few data and remarks
A possible problem may arise when we have a small dataset
Possible solutions
- Try to introduce new examples artificially
- Reduce the dimensionality using an attribute selection process
The efficiency of LWR depends on how many examples we have
- Use a \(kd\)-tree representation to access near examples quickly

Weight Functions or Kernels

Few data and remarks
In general, LWR is more expensive than nearest neighbors and weighted averages
In one hand, any representation can be used to create the local model
- Decision trees, rules, neural networks, etc.
A simple way to do it is to take the nearest neighbors and train a model / then use the model to classify

Weight Functions or Kernels

To implement LWR we require
- A distance function
  - LWR suppose that nearest data are the most relevant
  - The distance function doesn’t have to comply with the requirements of a distance metric
- Enough data to construct the models
- Data with an output \(y_i\)
- Adequate representation

Weight Functions or Kernels

Few data and remarks
Possible enhancements / future research directions
- Combine different data types (i.e. continuous and discrete)
- Enhance the way to tune parameters
- Local tuning at multiple scales
- Use gradients to tune parameters
- Define how much cross validation is enough
- Use probabilistic methods
- Forget data
- Enhance computational aspects with large amounts of data
- Don’t perform the learning step completely lazy

Radial Basis Functions

Radial basis functions (RBF) use a combination of kernel functions that decrease with distance
- Correspond to a \(K(d(x_q,x))\) from the previous expressions
- \(\hat{f}(x)=w_0+\sum\limits_{u=1}^{k}w_uK_u(d(x_u,x))\)
- For each instance \(x_u\), there exists a Kernel function that decreases with the distance to \(x_u\)

Radial Basis Functions

The most common is choosing normal/gaussian functions for the Ks

\[K_u(d(x_u,x))=\frac{1}{\sqrt{2\pi}\sigma^2}e^{-\frac{1}{2\sigma^2}}d^2(x_u,x)\]

Function \(\hat{f}(x)\) basically consists of two elements:
- One that computes the kernel functions and another that computes the functions weights
- These can be learnt with a two layer neural network

Radial Basis Functions

Radial Basis Functions

Training takes two steps
- Look for the \(x_u\) and \(\sigma\) for each function
- Look for the weights for the functions minimizing the global error

Radial Basis Functions

Options
- Center each function on each point and give all of them the same standard deviation
- Select a limited number of functions uniformly distributed through the instances space
- Select functions non-uniformly distributed
  - When instances are not uniformly distributed
  - We can obtain a sample over the instances or try to identify prototypes (possibly with a clustering algorithm)
  - We can use EM to choose \(k\) means of the gaussian distributions that better adjust to the data

Radial Basis Functions

In the case of RBF, we perform a previous learning with the training instances
Then, we try to classify the new instances

Case-based Reasoning

Alternative to instance-based learning
- Uses a richer symbolic representation to describe each instance
A case base reasoner solves new problems by adapting previous solutions to solve similar problems

Case-based Reasoning

Instances or cases usually have:
- A representation of the problem that they solve
- A description about how they will solve it
- The obtained result
It’s clear that the distance measures are more complex
Generating instances (finding hypotheses) also gets more complicated, they generally include:
- Domain knowledge
- Mechanisms to perform
  - Search processes
  - Sophisticated reasoning

Instance-based Learning

Machine Learning