Instance-based Learning
Machine Learning
Jesus A. Gonzalez
February 16, 2020
Instance-based Learning
- Different to the type of learning that we have seen
- Stores the training examples
- In order to classify a new object
- Extracts the most similar objects
- Use their classification to classify the new object
Instance-based Learning
- The learning process is trivial
- The classification process takes most of the time
- Also known as:
- Lazy learning or memory-based learning
- Training data is processed until it is required
- Use a distance measure
k-nearest neighbors
- k-NN (k-nearest neighbors)
- Robust with examples with noise
- Obtains the objects closest to an instance
- Continuous attributes
- Euclidean distance over the \(n\) possible attributes
k-nearest neighbors - Algorithm
k-nearest neighbors - Algorithm
\[d(x_i, x_j) = \sqrt{\sum\limits_{r=1}^{n}(a_r(x_i)-a_r(x_j))^2}\]
- The result of \(k\)-NN can be discrete or continuous
- Discrete: the result of the classification is the most common class among the \(k\)-neighbors
k-nearest neighbors - Algorithm
k-nearest neighbors - Algorithm
- Voronoi diagram
- Geometric construction that allow making a partition of the euclidean plane
- Simple interpolation, based on euclidean distance
- Assign a new instance the class of the nearest neighbor
k-nearest neighbors - Continuous classifications
- We can take the mean of the classifications
\[f(x_q) = \frac{\sum\limits_{i=1}^{k}f(x_i)}{k}\]
k-nearest neighbors - Weighted Distances
- Obvious extension
- Weight the classifications of the neighbors according to their distance to the object to be classified
- The class of the closest neighbors has more weight
- Weighted average
- Averages the output of the points inversely weighted by their distance
k-nearest neighbors - Weighted Distances
\[f(x_q) = \arg\max\limits_{v \in V} \sum\limits_{i=1}^{k}w_i\delta(v,f(x_i))\]
where: \(w_i = \frac{1}{d(x_q,x_i)^2}\), if \(d\) = 0, then \(w\) = 0.
k-nearest neighbors - Weighted Distances
- For continuous classes \[f(x_q) = \frac{\sum\limits_{i=1}^{k}w_if(x_i)}{\sum\limits_{i=1}^{k}w_i}\]
k-nearest neighbors - Note
- Assumption:
- Nearest neighbors give the best classification
- Using all the attributes
- Problem
- It is possible that many of the attributes are irrelevant, and these may dominate the classification
- 2 relevant attributes from 20 irrelevant attributes are dominated
k-nearest neighbors - Note
- What should we do?
- We can weight the distance of each attribute, giving more weight to more relevant attributes
- Try to determine the weights from known training examples
- Changing those weights to decrease the error
- We can eliminate the attributes that we consider irrelevant
k-nearest neighbors - Note
- Additional practical element
- Store the examples
- Use tree-based representations (kd-trees)
- Instances are distributed according to their closeness
k-nearest neighbors - Note
k-nearest neighbors - Note
Locally Weighted Regression
- A generalization that builds a function that adjusts the training data that are in the vicinity of \(x_q\)
- Can be used with linear functions, quadratic, neural networks, etc.
Locally Weighted Regression
- If we use a linear function
\[\hat{f}(x)=w_0+w_1a_1(x) + \dots + w_na_n(x)\]
- We can use gradient descent to adjust the weights that minimize the error
- We can express the error as a squared difference of the error
\[E(W) = \frac{1}{2}\sum\limits_{x \in D} (f(x) - \hat{f}(x))^2\]
Locally Weighted Regression
- We want to find the weights vector that minimizes the error \(E\)
- We change the weights in the direction that decreases the error surface the most
- The direction of this change is obtained through the gradient
- Expresses the direction that produces the highest increment
- The highest decrement is the negative of that direction
Locally Weighted Regression
- Weights update rule
- \(W \leftarrow W + \bigtriangleup W\)
- \(\bigtriangleup W = -\alpha \bigtriangledown E\)
- where \(\alpha\) is the learning rate
- How much we trust the error to adjust the weights
Locally Weighted Regression
\[\bigtriangleup w_i = \alpha\sum\limits_{x \in D} (f(x) - \hat{f}(x))(-a_{i,x})\]
Locally Weighted Regression
- danupranantha.wordpress.com/
Locally Weighted Regression
- danupranantha.wordpress.com/
Locally Weighted Regression
- To modify the weights
- Minimize the square error using the \(k\) nearest neighbors
\[E(W) \frac{1}{2} \sum\limits_{x \in k-nearest-neighbors} (f(x) - \hat{f}(x))^2\]
- Minimize the squared error using all the examples weighted by their distance to \(x_q\)
\[E(W) = \frac{1}{2}\sum\limits_{x \in D}(f(x) - \hat{f}(x))^2K(d(x_q,x))\]
- Minimize the squared error using the \(k\) nearest neighbors weighted by their distance to \(x_q\)
\[E(W) = \frac{1}{2}\sum\limits_{x \in k-nearest-neighbors}(f(x)-\hat{f}(x))^2K(d(x_q,x))\]
- For the last case, the update rule is:
\[\bigtriangleup w_i = \alpha \sum\limits_{x \in k-nearest-neighbors} K(d(x_q,x))(f(x) - \hat{f}(x))(-a_{i,x})\]
Distance Measures
- Distance functions can be classified in:
- Global functions
- The same function is used in all the space
- Query based functions
- The parameters of the distance measure adjust to each query, typically minimizing the error with cross-validation
- Distance functions based on points
- Each data point has its own distance function
- We change / adjust the distance function in order to enhance our predictions
Distance Measures
- For continuous data
Euclidean \(d_E(x,q)=\sqrt{\sum\limits_{j}(x_j-q_j)^2} = \sqrt{(x-q)^T(x-q)}\)
Euclidean, diagonally weighted \(d_m(x,q)=\sqrt{\sum\limits_{j}(m_j(x_j-q_j)^2)}=\sqrt{(x-q)^TM^TM(x-q)}=d_E(Mx,Mq)\)
where \(m_j\) is the scale factor in dimension \(j\) and \(M\) is a diagonal matrix with \(M_{ij}=m_j\)
Distance Measures
- For continuous data
- Complete Euclidean or Mahalanobis
- \(d_M(x,q) = \sqrt{(x-q)^TM^TM(x-q)}=d_E(Mx,Mq)\)
- where \(M\) can be arbitrary
- Normal or Minkowski
- \(d_p(x,q) = (\sum\limits_{i}|x_i - q_i|^P)^{\frac{1}{P}}\)
- Weighted diagonal normal or complete
- Same as the Minkowski but including weights
Aplying the Distance Measure
- We may include a range or scale in which to apply the generalization function, \(H\)
- Which instances are used to create the local model?
- Some options are
- Fixed width band selection
- \(H\) is a constant value
- Then, it is used with constant values of data and shape
- Nearest neighbors selection
- \(H\) is set to the distance to the \(k\) nearest neighbors
- The volume of data changes according to the density of the nearest neighbors
- Global band selection
- \(H\) is globally adjusted by an optimization process
Aplying the Distance Measure
- We may include a range or scale in which to apply the generalization function
- Some options are
- Query based
- \(H\) is selected according to the query following an optimization process
- Based on points
- Each data point has its own associated \(H\)
Weight Functions or Kernels
- The weight function must be maximal at distance 0 and softly decay with distance
- It is not necessary to normalize the kernel
- It doesn’t have to be unimodal
- Has to be always positive
Weight Functions or Kernels
Weight Functions or Kernels
- Examples of kernels
- Get a negative power of the distance
- To avoid infinites (inverse distance)
- \(K(d) = \frac{1}{1+d^p}\)
- One of the most popular is the Gaussian Kernel
Weight Functions or Kernels
- Examples of kernels
- A related one is the exponential
- Quadratic kernel or Epanechnikov or Bartlett-Priestley:
- \(K(d)= (1-d^2), if |d| < 1\)
- \(K(d)= 0, otherwise\)
- ignores data larger than 1 unit
Weight Functions or Kernels
- Examples of kernels
The tricube kernel \[k(d) = \begin{cases} (1-|d|^3)^3, & \text{if } |d| < 1, \\
0, & \text{otherwise} \end{cases}\]
Uniform weighting kernel \[K(d) = \begin{cases} 1, & \text{if } |d| < 1, \\
0, & \text{otherwise} \end{cases}\]
Triangular kernel \[K(d) = \begin{cases} 1-|d|, & \text{if } |d| < 1, \\
0, & \text{otherwise} \end{cases}\]
Weight Functions or Kernels
- Examples of kernels
- Variants of the triangular kernel \[K(d) = \begin{cases} \frac{1-|d|}{|d|}, & \text{if } |d| < 1, \\
0, & \text{otherwise} \end{cases}\]
- We can create new kernels
- According to authors, the kernel definition is not that critical
Weight Functions or Kernels
- Few data and remarks
- A possible problem may arise when we have a small dataset
- Possible solutions
- Try to introduce new examples artificially
- Reduce the dimensionality using an attribute selection process
- The efficiency of LWR depends on how many examples we have
- Use a \(kd\)-tree representation to access near examples quickly
Weight Functions or Kernels
- Few data and remarks
- In general, LWR is more expensive than nearest neighbors and weighted averages
- In one hand, any representation can be used to create the local model
- Decision trees, rules, neural networks, etc.
- A simple way to do it is to take the nearest neighbors and train a model / then use the model to classify
Weight Functions or Kernels
- To implement LWR we require
- A distance function
- LWR suppose that nearest data are the most relevant
- The distance function doesn’t have to comply with the requirements of a distance metric
- Enough data to construct the models
- Data with an output \(y_i\)
- Adequate representation
Weight Functions or Kernels
- Few data and remarks
- Possible enhancements / future research directions
- Combine different data types (i.e. continuous and discrete)
- Enhance the way to tune parameters
- Local tuning at multiple scales
- Use gradients to tune parameters
- Define how much cross validation is enough
- Use probabilistic methods
- Forget data
- Enhance computational aspects with large amounts of data
- Don’t perform the learning step completely lazy
Radial Basis Functions
- Radial basis functions (RBF) use a combination of kernel functions that decrease with distance
- Correspond to a \(K(d(x_q,x))\) from the previous expressions
- \(\hat{f}(x)=w_0+\sum\limits_{u=1}^{k}w_uK_u(d(x_u,x))\)
- For each instance \(x_u\), there exists a Kernel function that decreases with the distance to \(x_u\)
Radial Basis Functions
- The most common is choosing normal/gaussian functions for the Ks
\[K_u(d(x_u,x))=\frac{1}{\sqrt{2\pi}\sigma^2}e^{-\frac{1}{2\sigma^2}}d^2(x_u,x)\]
- Function \(\hat{f}(x)\) basically consists of two elements:
- One that computes the kernel functions and another that computes the functions weights
- These can be learnt with a two layer neural network
Radial Basis Functions
Radial Basis Functions
- Training takes two steps
- Look for the \(x_u\) and \(\sigma\) for each function
- Look for the weights for the functions minimizing the global error
Radial Basis Functions
- Options
- Center each function on each point and give all of them the same standard deviation
- Select a limited number of functions uniformly distributed through the instances space
- Select functions non-uniformly distributed
- When instances are not uniformly distributed
- We can obtain a sample over the instances or try to identify prototypes (possibly with a clustering algorithm)
- We can use EM to choose \(k\) means of the gaussian distributions that better adjust to the data
Radial Basis Functions
- In the case of RBF, we perform a previous learning with the training instances
- Then, we try to classify the new instances
Case-based Reasoning
- Alternative to instance-based learning
- Uses a richer symbolic representation to describe each instance
- A case base reasoner solves new problems by adapting previous solutions to solve similar problems
Case-based Reasoning
- Instances or cases usually have:
- A representation of the problem that they solve
- A description about how they will solve it
- The obtained result
- It’s clear that the distance measures are more complex
- Generating instances (finding hypotheses) also gets more complicated, they generally include:
- Domain knowledge
- Mechanisms to perform
- Search processes
- Sophisticated reasoning