Clustering

Process to group data into classes or clusters in such a way that objects of one cluster
- Have high similarity between them
- Low similarity with objects of other clusters
Similarity measure based on the attributes that describe the objects
Groups can be
- Exclusive
- With overlaps
- Probabilistic
- Hierarchical

Clustering

Can be applied to
- Characterize customers
- Create taxonomies
- Documents classification
- more…

Challenges

Scalability
- Usually run with small data
Capacity to deal with different types of attributes
- Numeric (most common)
- Binary
- Nominal
- Ordinal
- others
Clusters with arbitrary shapes
- Those based on numeric distances tend to find spherical clusters

Challenges

Minimum requirements to specify parameters
- As the number of clusters
Dealing with noise
- Many approaches are sensible to data with noise
Independent from the order of data
Can work efficiently with high dimensionality
Capacity to add restrictions
Clusters being easy to interpret and use

Similarity Measures

Usually defined by proximity in a multi-dimensional space
For numeric data, we usually perform an standardization process

Similarity Measures

The \(z\) measure (z-score) eliminates the units of the data
- \(z_{if}=\frac{x_{if}-\mu_f}{\sigma_f}\)
where, \(\sigma_f\) is the absolute medium deviation of variable \(f\), \(\mu_f\) is its median, and \(x_{if}\) is the \(i^{th}\) value of \(f\).
- \(\sigma_f = \frac{1}{n}(|x_{1f}-\mu_f| + |x_{2f}-\mu_f| + \dots + |x_{nf}-\mu_f|)\)
- \(\mu_f = \frac{1}{n}(x_{1f} + x_{2f} + \dots + x_{nf})\)

Similarity Measures

For numerical variables (linear):
- Eucledian distance
  - \(d(i,j) = \sqrt{|x_{i1}-x_{j1}|^2 + |x_{i2}-x_{j2}|^2 + \dots + |x_{in}-x_{jn}|^2}\)
- Manhattan distance
  - \(d(i,j) = |x_{i1}-x_{j1}| + |x_{i2}-x_{j2}| + \dots + |x_{in}-x_{jn}|\)
- Minkowski distance
  - \(d(i,j) = (|x_{i1}-x_{j1}|^q + |x_{i2}-x_{j2}|^q + \dots + |x_{in}-x_{jn}|^q)^{1/q}\)
  - If q=1 it is Manhattan and if q = 2 it is Euclidean

Similarity Measures

For numerical variables (linear)
- Weighted distance (e.g. Euclidean):
- \(d(i,j) = \sqrt{w_1|x_{i1}-x_{j1}|^2 + w_2|x_{i2}-x_{j2}|^2 + \dots + w_n|x_{in}-x_{jn}|^2}\)
Properties of distances
1. \(d(i,j) \geq 0\)
2. \(d(i,i) = 0\)
3. \(d(i,j) = d(j,i)\)
4. \(d(i,j) \leq d(i,h) + d(h,j)\).

Similarity Measures

Binary variables (0, 1):
- Symmetric (both values have the same weight):
  - \(d(i,j) = \frac{r+s}{q+r+s+t}\)
  - where:
  - \(q\) = number of values that are 1 in both
  - \(r\) = number of values that are 1 in \(i\) and 0 in \(j\)
  - \(s\) = number of values that are 0 in \(i\) and 1 in \(j\)
  - \(t\) = number of values that are 0 in both
- Non-symmetric (the most important and rare has value of 1), known as the Jaccard Coefficient
  - \(d(i,j)=\frac{r+s}{q+r+s}\)

Similarity Measures

Nominal variables (e.g. color):
- \(d(i,j) = \frac{p-m}{p}\)
- where
- \(m\) = number of equal values
- \(p\) = total number of cases
- May add weights to give more importance to \(m\)
- May create new asymmetric binary variables from the nominal ones (e.g. is yellow or not)
  - We can also use one hot encoding

Similarity Measures

Ordinal variables: nominal with a relevant order. Order is important but not magnitude. Steps:
- Changes the value of each variable for a ranking \(r_{if} \in \{1, \dots, M_f\}\), where \(M_f\) is the index of the highest value of the variable
- Maps the ranking between 0 and 1 to give equal weight
  - \(z_{if} = \frac{r_{if}-1}{M_f - 1}\)
- Uses any of the previous numeric measures
Example:
- Educational level: first, second, third

Similarity Measures

Scalar non-linear variables, for example, variables that follow an exponential scale
Possibilities
- Treat them as a normal numerical
- Obtain its logarithm (or some other transformation) before converting to linear
- Consider them as ordinal variables

Similarity Measures

Mixed variables
- A possibility is to scale all variables to a common interval (between 0 and 1)
- \(d(i,j) = \frac{\sum\limits_{f=1}^{p}\delta_{ij}^{(f)} d_{ij}^{(f)}} {\sum\limits_{f=1}^{p} \delta_{ij}^{(f)}}\)
- where:
  - \(\delta_{ij}^{(f)} = 0\)
    - If either \(x_{if}\) or \(x_{jf}\) is missing or if the two values are 0 and variable \(f\) is binary asimmetric.
    - Otherwise, \(\delta_{ij}^{(f)} = 1\)
  - \(d_{ij}^{(f)}\) depends on the type:
    - If \(f\) is binary or nominal: \(d_{ij}^{(f)} = 0\) if \(x_{if}=x_{jf}\), if not, \(d_{ij}^{(f)} = 1\)
    - If \(f\) is numeric-linear: \(d_{ij}^{(f)}= \frac{|x_{if}-x_{jf}|}{max_h x_{hf} - min_h x_{hf}}\)
    - If \(f\) is ordinal or numeric-non-linear: compute indexes \(r_{if}\) and \(z_{if}=\frac{r_{if}-1}{M_{f}-1}\) and take \(z_{if}\) as a linear-numeric variable

Clustering Methods

There exist many clustering methods
- We will only cover some of them

Clustering Methods

Methods based on partitions
- Build \(k\) partitions from the data
- Each partition represents a group or cluster
- Each group has at least one element and each element belongs to a single group
- These methods create an initial partition and iterate until a stop criterion is reached
- The most popular are \(k\)-means, \(k\)-medians
- Others
  - CLARA and CLARANS

Clustering Methods

Hierarchical methods
Aglomerative method or bottom-up
- Begins with a group for each object and joins the most similar groups until reaching a single group or other stop criterion
- AGNES, BIRCH, CURE, ROCK
Divisive method or top-down
- Begins with a single group and divides it into smaller groups until creating groups with a single element or other stop criterion
- DIANA, MONA

Clustering Methods

Methods based on Densities
Groups objects while its density (number of objects) in the vicinity is in a certain threshold
- DBSCAN, DENCLUE

Clustering Methods

Methods based on grids
Divides the space in grids at different levels
- STING, CLIQUE

Clustering Methods

Methods based on models
Finds a model for each cluster, the one that best adjusts the data of that group
- COWEB, AutoClass

Clustering Methods

Methods based on graph theory
Use graph-based representations
- Chameleon, Delaunay triangulation graph (DTG), highly connected subgraphs (HCS), clustering identification via connectivity kernels (CLICK), cluster affinity search technique (CAST)

Clustering Methods

Techniques based on combinatoric search
- Genetically guided algorithm (GGA)
- TS clustering
- SA clustering

Clustering Methods

Fuzzy techniques
- Fuzzy c-means (FCM)
- Mountain method (MM)
- Possibilistic c-means clustering algorithm (PCM)
- Fuzzy c-shells (FCS)

Clustering Methods

Techniques based on neural networks
- Learning vector quantization (LVQ)
- Self-organizing feature map (SOFM)
- ART
- Simplified ART (SART)
- Hyperellipsoidal clustering network (HEC)
- Self-splitting competitive learning network (SPLL)

Clustering Methods

Techniques based on Kernels
- Kernel k-means
- Support vector clustering (SVC)

Clustering Methods

Techniques for sequential data
- Sequential similarity
- Indirect sequential clustering
- Statistical sequential clustering

Clustering Methods

Techniques for large datasets
- CLARA
- CURE
- CLARANS
- BIRCH
- DBSCAN
- DENCLUE
- WaveCluster
- FC
- ART

The k-Means Algorithm

Takes as parameter \(k\) elements randomly chosen
Represent the center or mean of each cluster
Each of the remaining objects is assigned the most similar cluster
- Based on the distance between the object and the mean of the cluster
Then, compute the new mean of the cluster
Iterates until the means don’t change

The k-Means Algorithm

Usually, we use a similarity measure based on the squared error
- \(E = \sum\limits_{i=1}^{k}\sum\limits_{p \in C_i} |p-m_i|^2\)
- where: \(p\) represents the object and \(m_i\) the mean of cluster \(C_i\) (both are multidimensional objects).

The k-Means Algorithm - the algorithm

randomly select \(k\) objects
repeat
- re(assign) each object to the most similar cluster, with the mean value, update the value of the means of the clusters
until there is no change

The k-Means Algorithm

\(k\)-means is susceptible to extreme values because those distortion the data distribution
We can also use the modes (\(k\)-modes) to group categorical objects

The k-Medoids Algorithm

Another possibility is the use of medians (\(k\)-medoids)
- To group based on the most representative object in the cluster
- Basic idea
  - Find representative object
  - The strategy is to replace one of the medians by other object in a random way and measure if the quality of the resultant clusters is enhanced
- Quality is evaluated based on a cost function that measures the average dissimilarity between an object and the median of its cluster
- In order to see if a random object (\(O_{ramdom}\)) is a good replacement of the median (\(O_{actual}\))
  - Consider all objects that are not medians
  - Analyze the re-distribution of the objects from which we compute a cost (i.e. based on the squared error)
  - This is repeated until there is not enhancement
- As with other methods, this doesn’t guarantee finding the global minima
  - Recommended to run the algorithm several times with different initial values

Hierarchical k-Means

Another variant is to create a hierarchical k-means
- Starts with \(k\) = 2
- Continues creating succesive clusters in each branch
If we want to scale to large databases
- We can take samples of the data

COBWEB (a hierarchical clustering method)

Create a hierarchical cluster with a classification tree
In a classification tree, each node is a concept
- Has a probabilistic description of the concept that summarizes the objects classified under this node
- The probabilistic description includes
  - The probability of the concept (\(P(C_i)\))
  - The conditional probabilities of attribute-value pairs given the concept (\(P(A_i = v_{ij} | C_k)\))

COBWEB (a hierarchical clustering method)

Uses a measure called utility of the category in order to construct the tree

\(CU=\frac{\sum\limits_{k=1}^{n}P(C_k)[\sum\limits_i\sum\limits_j P(A_i=V_{ij}|C_k)^2-\sum\limits_{i}\sum\limits_{j} P(A_i = v_{ij})^2]}{n}\)

where: \(n\) is the number of classes in a level of the tree

COBWEB (a hierarchical clustering method)

Measures the expected value of the attributes values that can be guessed from the partition over the values that can be guessed without that partition
If the partition doesn’t help on this, then it isn’t good
As the proportion of elements of the class with that attribute-value grows, that attribute-value is more predictive over the class

COBWEB (a hierarchical clustering method)

The algorithm descends in the tree looking for the best place or node for each object
- It is based on accomodating the object in each node and in a new node and measures which of them has the highest category utility gain
At each iteration, it also considers joining the two better evaluated nodes and also dividing the best evaluated node
- That is, each time that it considers a place in a level for a new object, it considers the best two objects (with the highest utility) and considers joining them
- Another case is when the best place for the new object is found
  - But joining nodes doesn’t result in benefit
  - Then, it considers dividing that node
COBWEB depends on the order of the objects
- It is convenient to try it with objects in different order
The denominator in the category utility equation was added in order to encourage having clusters with more than one element
COBWEB assumes that the probability distribution of the attributes is independent of the other attributes

COBWEB (a hierarchical clustering method)

COBWEB (a hierarchical clustering method)

The algorithm can be extended to deal with numerical values using gaussian distributions (CLASSIT)

\(f(a) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(a-\mu)^2}{2\sigma^2}}\)

The equivalent of the summation of probabilities is now:

\(\sum\limits_{i} P(A_i = V_{ij})^2 \sim \int f(a_i)^2 da_i = \frac{1}{2\sqrt{\pi\sigma_i}}\)

COBWEB (a hierarchical clustering method)

Now we estimate the standard deviation of the numerical attribute with the data in the cluster and with the data for all clusters

\(CU = \frac{1}{k} \sum\limits_{k=1}{n}P(C_k) \frac{1}{2\sqrt{\pi}}\sum\limits_{i}(\frac{1}{\sigma_{ik}} - \frac{1}{\sigma_i})\)

If the standard deviation is cero, the utility becomes infinite
- Then a minimum variance for each attribute is imposed (acuity)
Another parameter is the cut (cutoff), that is used to stop the generation of new nodes

Clustering based on Probabilities

From a bayesian point of view
- We look for the group of groups that is more probable given the data
- Now the objects have some probability of belonging to a group or cluster
- The base of a probabilistic clustering is an statistical model called finite mixtures (mix of distributions)
- A mix is a set of \(k\) distributions, representing \(k\) clusters

Clustering based on Probabilities

Each distribution gives the probability that an object has a particular set of attribute-value pairs
- Given that the object is member of that cluster
The simplest mix is when we have only numerical attributes with gaussian distributions with different means and variances

Clustering based on Probabilities

The idea is that, given a set of data
- Determine the \(k\) normal distributions (means and variances) and the particular probabilities of each distribution (can be different)
- If we had two distributions \(A\) and \(B\) with \(\mu_A\), \(\sigma_A\), \(\mu_B\), \(\sigma_B\), \(P_A\), and \((P_A + P_B = 1)\), we can generate a dataset.
- If we knew the distributions that gave place to each data point, it is easy to compute its mean and variance, and \(P_A\) and \(P_B\).

\(\mu=\frac{x_1+x_2+\dots+x_n}{n}\)

\(\sigma^2=\frac{(x_1-\mu)^2+(x_2-\mu)^2+\dots+(x_n-\mu)^2}{n-1}\)

Clustering based on Probabilities

Compute the probability that an object (\(x\)) belongs to a cluster (e.g. \(A\)), is:

\(P(A|x) = \frac{P(x|A)P(A)}{P(x)} = \frac{f(x;\mu_A,\sigma_A)P_A}{P(x)}\)

where \(f(x;\mu_A,\sigma_A)\) is a normal distribution

\(f(x;\mu_A,\sigma_A) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\)

We can ignore \(P(x)\) and normalize at the end

Expectation Maximization Algorithm

The problem is that we don’t know the distribution from which each data comes from
- We don’t know the parameters of their distributions
EM (Expectation Maximization)
- Starts by guessing the parameters of the distributions
- It uses these distributions to compute the probabilities that each object belongs to a cluster
- Uses the objects in the clusters to re-estimate the parameters of the distributions representing each cluster
- Until convergence (can start guessing the probabiities that an object belongs to a class)

Expectation Maximization Algorithm

The computation of the probabilities of the classes or expected values of the classes (to assign objects to clusters) is the expectation step
The step of computing the values of the parameters of the distributions (given the objects in the clusters) is the maximization step, maximize the likelihood of the distributions given the data

Expectation Maximization Algorithm

How to estimate the parameters
- Consider that we only have the probabilities of belonging to each cluster but not the actual clusters
- These probabilities act as weights

\(\mu_A=\frac{w_1x_1+w_2x_2+\dots+w_nx_n}{w_1+w_2+\dots+w_n}\)

\(\sigma^2_A=\frac{w_1(x_1-\mu)^2+w_2(x_2-\mu)^2+\dots+w_n(x_n-\mu)^2}{w_1+w_2+\dots+w_n}\)

\(w_i\) is the probability that object \(i\) belongs to cluster \(A\) and we sum over all objects (not only those of A)

Expectation Maximization Algorithm

The algorithm tends to converge but it never reaches a fix point
We can see how much it improves by computing the general likelihood of the data with those parameters, multiplying the probabilities of the individual objects (\(i\)):

\(\prod(P_AP(x_i|A)+P_BP(x_i|B))\)

This measure grows on each iteration, we iterate until that grow is negligible

Expectation Maximization Algorithm

Although EM guarantees convergence, it may be to a local maxima, it is recommended to repeat the process several times.

How many Clusters?

In some applications it is difficult to determine the number of clusters \(k\)
- According to domain knowledge
For most of the cases \(k\) is unknown
- It is estimated from the data
Many clustering algorithms ask for \(k\) as input parameter
- The quality of the results is highly related to this value

How many Clusters?

A partition with many clusters complicates the results because that makes them difficult to interpret and analyze
A partition with too few clusters makes us loose information and may lead to bad decisions
The problem to determine the number of clusters is known as:
- The fundamental cluster validity problem

How many Clusters?

Some methods to find \(k\)
- Visualization of the dataset
  - Works well for 2 dimensions
  - Usually, our datasets are more complicated
- Construction of indexes (or stop rules)
  - Use indexes to emphasize
    - Intra-cluster compactness
    - Inter-cluster isolation
  - Considers effects such as
    - Squared error
    - Geometric or statistical properties of the data
    - Number of patterns
    - Dissimilarity or simmilarity
    - Number of clusters
- Optimization of some function under the frame of the model of probabilities mixtures
  - i.e., the EM algorithm to find the value of \(k\) that maximizes or minimizes the criterion defined as optimum
    - Akaike Information Criterion (AIC)
    - Bayesian Inference Criterion
- Other heuristic methods based on a variety of techniques and theory

Clustering

Machine Learning

Clustering

Clustering

Challenges

Challenges

Similarity Measures

Similarity Measures

Similarity Measures

Similarity Measures

Similarity Measures

Similarity Measures

Similarity Measures

Similarity Measures

Similarity Measures

Clustering Methods

Clustering Methods

Clustering Methods

Clustering Methods

Clustering Methods

Clustering Methods

Clustering Methods

Clustering Methods

Clustering Methods

Clustering Methods

Clustering Methods

Clustering Methods

Clustering Methods

The k-Means Algorithm

The k-Means Algorithm

The k-Means Algorithm - the algorithm

The k-Means Algorithm

The k-Medoids Algorithm

Hierarchical k-Means

COBWEB (a hierarchical clustering method)

COBWEB (a hierarchical clustering method)

COBWEB (a hierarchical clustering method)

COBWEB (a hierarchical clustering method)

COBWEB (a hierarchical clustering method)

COBWEB (a hierarchical clustering method)

COBWEB (a hierarchical clustering method)

Clustering based on Probabilities

Clustering based on Probabilities

Clustering based on Probabilities

Clustering based on Probabilities

Expectation Maximization Algorithm

Expectation Maximization Algorithm

Expectation Maximization Algorithm

Expectation Maximization Algorithm

Expectation Maximization Algorithm

How many Clusters?

How many Clusters?

How many Clusters?