Clustering
Machine Learning
Jesus A. Gonzalez
February 7, 2020
Clustering
- Process to group data into classes or clusters in such a way that objects of one cluster
- Have high similarity between them
- Low similarity with objects of other clusters
- Similarity measure based on the attributes that describe the objects
- Groups can be
- Exclusive
- With overlaps
- Probabilistic
- Hierarchical
Clustering
- Can be applied to
- Characterize customers
- Create taxonomies
- Documents classification
- more…
Challenges
- Scalability
- Usually run with small data
- Capacity to deal with different types of attributes
- Numeric (most common)
- Binary
- Nominal
- Ordinal
- others
- Clusters with arbitrary shapes
- Those based on numeric distances tend to find spherical clusters
Challenges
- Minimum requirements to specify parameters
- As the number of clusters
- Dealing with noise
- Many approaches are sensible to data with noise
- Independent from the order of data
- Can work efficiently with high dimensionality
- Capacity to add restrictions
- Clusters being easy to interpret and use
Similarity Measures
- Usually defined by proximity in a multi-dimensional space
- For numeric data, we usually perform an standardization process
Similarity Measures
- The \(z\) measure (z-score) eliminates the units of the data
- \(z_{if}=\frac{x_{if}-\mu_f}{\sigma_f}\)
- where, \(\sigma_f\) is the absolute medium deviation of variable \(f\), \(\mu_f\) is its median, and \(x_{if}\) is the \(i^{th}\) value of \(f\).
- \(\sigma_f = \frac{1}{n}(|x_{1f}-\mu_f| + |x_{2f}-\mu_f| + \dots + |x_{nf}-\mu_f|)\)
- \(\mu_f = \frac{1}{n}(x_{1f} + x_{2f} + \dots + x_{nf})\)
Similarity Measures
- For numerical variables (linear):
- Eucledian distance
- \(d(i,j) = \sqrt{|x_{i1}-x_{j1}|^2 + |x_{i2}-x_{j2}|^2 + \dots + |x_{in}-x_{jn}|^2}\)
- Manhattan distance
- \(d(i,j) = |x_{i1}-x_{j1}| + |x_{i2}-x_{j2}| + \dots + |x_{in}-x_{jn}|\)
- Minkowski distance
- \(d(i,j) = (|x_{i1}-x_{j1}|^q + |x_{i2}-x_{j2}|^q + \dots + |x_{in}-x_{jn}|^q)^{1/q}\)
- If q=1 it is Manhattan and if q = 2 it is Euclidean
Similarity Measures
- For numerical variables (linear)
- Weighted distance (e.g. Euclidean):
- \(d(i,j) = \sqrt{w_1|x_{i1}-x_{j1}|^2 + w_2|x_{i2}-x_{j2}|^2 + \dots + w_n|x_{in}-x_{jn}|^2}\)
- Properties of distances
- \(d(i,j) \geq 0\)
- \(d(i,i) = 0\)
- \(d(i,j) = d(j,i)\)
- \(d(i,j) \leq d(i,h) + d(h,j)\).
Similarity Measures
- Binary variables (0, 1):
- Symmetric (both values have the same weight):
- \(d(i,j) = \frac{r+s}{q+r+s+t}\)
- where:
- \(q\) = number of values that are 1 in both
- \(r\) = number of values that are 1 in \(i\) and 0 in \(j\)
- \(s\) = number of values that are 0 in \(i\) and 1 in \(j\)
- \(t\) = number of values that are 0 in both
- Non-symmetric (the most important and rare has value of 1), known as the Jaccard Coefficient
- \(d(i,j)=\frac{r+s}{q+r+s}\)
Similarity Measures
- Nominal variables (e.g. color):
- \(d(i,j) = \frac{p-m}{p}\)
- where
- \(m\) = number of equal values
- \(p\) = total number of cases
- May add weights to give more importance to \(m\)
- May create new asymmetric binary variables from the nominal ones (e.g. is yellow or not)
- We can also use one hot encoding
Similarity Measures
- Ordinal variables: nominal with a relevant order. Order is important but not magnitude. Steps:
- Changes the value of each variable for a ranking \(r_{if} \in \{1, \dots, M_f\}\), where \(M_f\) is the index of the highest value of the variable
- Maps the ranking between 0 and 1 to give equal weight
- \(z_{if} = \frac{r_{if}-1}{M_f - 1}\)
- Uses any of the previous numeric measures
- Example:
- Educational level: first, second, third
Similarity Measures
- Scalar non-linear variables, for example, variables that follow an exponential scale
- Possibilities
- Treat them as a normal numerical
- Obtain its logarithm (or some other transformation) before converting to linear
- Consider them as ordinal variables
Similarity Measures
- Mixed variables
- A possibility is to scale all variables to a common interval (between 0 and 1)
- \(d(i,j) = \frac{\sum\limits_{f=1}^{p}\delta_{ij}^{(f)} d_{ij}^{(f)}} {\sum\limits_{f=1}^{p} \delta_{ij}^{(f)}}\)
- where:
- \(\delta_{ij}^{(f)} = 0\)
- If either \(x_{if}\) or \(x_{jf}\) is missing or if the two values are 0 and variable \(f\) is binary asimmetric.
- Otherwise, \(\delta_{ij}^{(f)} = 1\)
- \(d_{ij}^{(f)}\) depends on the type:
- If \(f\) is binary or nominal: \(d_{ij}^{(f)} = 0\) if \(x_{if}=x_{jf}\), if not, \(d_{ij}^{(f)} = 1\)
- If \(f\) is numeric-linear: \(d_{ij}^{(f)}= \frac{|x_{if}-x_{jf}|}{max_h x_{hf} - min_h x_{hf}}\)
- If \(f\) is ordinal or numeric-non-linear: compute indexes \(r_{if}\) and \(z_{if}=\frac{r_{if}-1}{M_{f}-1}\) and take \(z_{if}\) as a linear-numeric variable
Clustering Methods
- There exist many clustering methods
- We will only cover some of them
Clustering Methods
- Methods based on partitions
- Build \(k\) partitions from the data
- Each partition represents a group or cluster
- Each group has at least one element and each element belongs to a single group
- These methods create an initial partition and iterate until a stop criterion is reached
- The most popular are \(k\)-means, \(k\)-medians
- Others
Clustering Methods
- Hierarchical methods
- Aglomerative method or bottom-up
- Begins with a group for each object and joins the most similar groups until reaching a single group or other stop criterion
- AGNES, BIRCH, CURE, ROCK
- Divisive method or top-down
- Begins with a single group and divides it into smaller groups until creating groups with a single element or other stop criterion
- DIANA, MONA
Clustering Methods
- Methods based on Densities
- Groups objects while its density (number of objects) in the vicinity is in a certain threshold
Clustering Methods
- Methods based on grids
- Divides the space in grids at different levels
Clustering Methods
- Methods based on models
- Finds a model for each cluster, the one that best adjusts the data of that group
Clustering Methods
- Methods based on graph theory
- Use graph-based representations
- Chameleon, Delaunay triangulation graph (DTG), highly connected subgraphs (HCS), clustering identification via connectivity kernels (CLICK), cluster affinity search technique (CAST)
Clustering Methods
- Techniques based on combinatoric search
- Genetically guided algorithm (GGA)
- TS clustering
- SA clustering
Clustering Methods
- Fuzzy techniques
- Fuzzy c-means (FCM)
- Mountain method (MM)
- Possibilistic c-means clustering algorithm (PCM)
- Fuzzy c-shells (FCS)
Clustering Methods
- Techniques based on neural networks
- Learning vector quantization (LVQ)
- Self-organizing feature map (SOFM)
- ART
- Simplified ART (SART)
- Hyperellipsoidal clustering network (HEC)
- Self-splitting competitive learning network (SPLL)
Clustering Methods
- Techniques based on Kernels
- Kernel k-means
- Support vector clustering (SVC)
Clustering Methods
- Techniques for sequential data
- Sequential similarity
- Indirect sequential clustering
- Statistical sequential clustering
Clustering Methods
- Techniques for large datasets
- CLARA
- CURE
- CLARANS
- BIRCH
- DBSCAN
- DENCLUE
- WaveCluster
- FC
- ART
The k-Means Algorithm
- Takes as parameter \(k\) elements randomly chosen
- Represent the center or mean of each cluster
- Each of the remaining objects is assigned the most similar cluster
- Based on the distance between the object and the mean of the cluster
- Then, compute the new mean of the cluster
- Iterates until the means don’t change
The k-Means Algorithm
- Usually, we use a similarity measure based on the squared error
- \(E = \sum\limits_{i=1}^{k}\sum\limits_{p \in C_i} |p-m_i|^2\)
- where: \(p\) represents the object and \(m_i\) the mean of cluster \(C_i\) (both are multidimensional objects).
The k-Means Algorithm - the algorithm
- randomly select \(k\) objects
- repeat
- re(assign) each object to the most similar cluster, with the mean value, update the value of the means of the clusters
- until there is no change
The k-Means Algorithm
- \(k\)-means is susceptible to extreme values because those distortion the data distribution
- We can also use the modes (\(k\)-modes) to group categorical objects
The k-Medoids Algorithm
- Another possibility is the use of medians (\(k\)-medoids)
- To group based on the most representative object in the cluster
- Basic idea
- Find representative object
- The strategy is to replace one of the medians by other object in a random way and measure if the quality of the resultant clusters is enhanced
- Quality is evaluated based on a cost function that measures the average dissimilarity between an object and the median of its cluster
- In order to see if a random object (\(O_{ramdom}\)) is a good replacement of the median (\(O_{actual}\))
- Consider all objects that are not medians
- Analyze the re-distribution of the objects from which we compute a cost (i.e. based on the squared error)
- This is repeated until there is not enhancement
- As with other methods, this doesn’t guarantee finding the global minima
- Recommended to run the algorithm several times with different initial values
Hierarchical k-Means
- Another variant is to create a hierarchical k-means
- Starts with \(k\) = 2
- Continues creating succesive clusters in each branch
- If we want to scale to large databases
- We can take samples of the data
COBWEB (a hierarchical clustering method)
- Create a hierarchical cluster with a classification tree
- In a classification tree, each node is a concept
- Has a probabilistic description of the concept that summarizes the objects classified under this node
- The probabilistic description includes
- The probability of the concept (\(P(C_i)\))
- The conditional probabilities of attribute-value pairs given the concept (\(P(A_i = v_{ij} | C_k)\))
COBWEB (a hierarchical clustering method)
- Uses a measure called utility of the category in order to construct the tree
\(CU=\frac{\sum\limits_{k=1}^{n}P(C_k)[\sum\limits_i\sum\limits_j P(A_i=V_{ij}|C_k)^2-\sum\limits_{i}\sum\limits_{j} P(A_i = v_{ij})^2]}{n}\)
- where: \(n\) is the number of classes in a level of the tree
COBWEB (a hierarchical clustering method)
- Measures the expected value of the attributes values that can be guessed from the partition over the values that can be guessed without that partition
- If the partition doesn’t help on this, then it isn’t good
- As the proportion of elements of the class with that attribute-value grows, that attribute-value is more predictive over the class
COBWEB (a hierarchical clustering method)
- The algorithm descends in the tree looking for the best place or node for each object
- It is based on accomodating the object in each node and in a new node and measures which of them has the highest category utility gain
- At each iteration, it also considers joining the two better evaluated nodes and also dividing the best evaluated node
- That is, each time that it considers a place in a level for a new object, it considers the best two objects (with the highest utility) and considers joining them
- Another case is when the best place for the new object is found
- But joining nodes doesn’t result in benefit
- Then, it considers dividing that node
- COBWEB depends on the order of the objects
- It is convenient to try it with objects in different order
- The denominator in the category utility equation was added in order to encourage having clusters with more than one element
- COBWEB assumes that the probability distribution of the attributes is independent of the other attributes
COBWEB (a hierarchical clustering method)
COBWEB (a hierarchical clustering method)
- The algorithm can be extended to deal with numerical values using gaussian distributions (CLASSIT)
\(f(a) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(a-\mu)^2}{2\sigma^2}}\)
- The equivalent of the summation of probabilities is now:
\(\sum\limits_{i} P(A_i = V_{ij})^2 \sim \int f(a_i)^2 da_i = \frac{1}{2\sqrt{\pi\sigma_i}}\)
COBWEB (a hierarchical clustering method)
- Now we estimate the standard deviation of the numerical attribute with the data in the cluster and with the data for all clusters
\(CU = \frac{1}{k} \sum\limits_{k=1}{n}P(C_k) \frac{1}{2\sqrt{\pi}}\sum\limits_{i}(\frac{1}{\sigma_{ik}} - \frac{1}{\sigma_i})\)
- If the standard deviation is cero, the utility becomes infinite
- Then a minimum variance for each attribute is imposed (acuity)
- Another parameter is the cut (cutoff), that is used to stop the generation of new nodes
Clustering based on Probabilities
- From a bayesian point of view
- We look for the group of groups that is more probable given the data
- Now the objects have some probability of belonging to a group or cluster
- The base of a probabilistic clustering is an statistical model called finite mixtures (mix of distributions)
- A mix is a set of \(k\) distributions, representing \(k\) clusters
Clustering based on Probabilities
- Each distribution gives the probability that an object has a particular set of attribute-value pairs
- Given that the object is member of that cluster
- The simplest mix is when we have only numerical attributes with gaussian distributions with different means and variances
Clustering based on Probabilities
- The idea is that, given a set of data
- Determine the \(k\) normal distributions (means and variances) and the particular probabilities of each distribution (can be different)
- If we had two distributions \(A\) and \(B\) with \(\mu_A\), \(\sigma_A\), \(\mu_B\), \(\sigma_B\), \(P_A\), and \((P_A + P_B = 1)\), we can generate a dataset.
- If we knew the distributions that gave place to each data point, it is easy to compute its mean and variance, and \(P_A\) and \(P_B\).
\(\mu=\frac{x_1+x_2+\dots+x_n}{n}\)
\(\sigma^2=\frac{(x_1-\mu)^2+(x_2-\mu)^2+\dots+(x_n-\mu)^2}{n-1}\)
Clustering based on Probabilities
- Compute the probability that an object (\(x\)) belongs to a cluster (e.g. \(A\)), is:
\(P(A|x) = \frac{P(x|A)P(A)}{P(x)} = \frac{f(x;\mu_A,\sigma_A)P_A}{P(x)}\)
where \(f(x;\mu_A,\sigma_A)\) is a normal distribution
\(f(x;\mu_A,\sigma_A) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\)
- We can ignore \(P(x)\) and normalize at the end
Expectation Maximization Algorithm
- The problem is that we don’t know the distribution from which each data comes from
- We don’t know the parameters of their distributions
- EM (Expectation Maximization)
- Starts by guessing the parameters of the distributions
- It uses these distributions to compute the probabilities that each object belongs to a cluster
- Uses the objects in the clusters to re-estimate the parameters of the distributions representing each cluster
- Until convergence (can start guessing the probabiities that an object belongs to a class)
Expectation Maximization Algorithm
- The computation of the probabilities of the classes or expected values of the classes (to assign objects to clusters) is the expectation step
- The step of computing the values of the parameters of the distributions (given the objects in the clusters) is the maximization step, maximize the likelihood of the distributions given the data
Expectation Maximization Algorithm
- How to estimate the parameters
- Consider that we only have the probabilities of belonging to each cluster but not the actual clusters
- These probabilities act as weights
\(\mu_A=\frac{w_1x_1+w_2x_2+\dots+w_nx_n}{w_1+w_2+\dots+w_n}\)
\(\sigma^2_A=\frac{w_1(x_1-\mu)^2+w_2(x_2-\mu)^2+\dots+w_n(x_n-\mu)^2}{w_1+w_2+\dots+w_n}\)
\(w_i\) is the probability that object \(i\) belongs to cluster \(A\) and we sum over all objects (not only those of A)
Expectation Maximization Algorithm
- The algorithm tends to converge but it never reaches a fix point
- We can see how much it improves by computing the general likelihood of the data with those parameters, multiplying the probabilities of the individual objects (\(i\)):
\(\prod(P_AP(x_i|A)+P_BP(x_i|B))\)
- This measure grows on each iteration, we iterate until that grow is negligible
Expectation Maximization Algorithm
- Although EM guarantees convergence, it may be to a local maxima, it is recommended to repeat the process several times.
How many Clusters?
- In some applications it is difficult to determine the number of clusters \(k\)
- According to domain knowledge
- For most of the cases \(k\) is unknown
- It is estimated from the data
- Many clustering algorithms ask for \(k\) as input parameter
- The quality of the results is highly related to this value
How many Clusters?
- A partition with many clusters complicates the results because that makes them difficult to interpret and analyze
- A partition with too few clusters makes us loose information and may lead to bad decisions
- The problem to determine the number of clusters is known as:
- The fundamental cluster validity problem
How many Clusters?
- Some methods to find \(k\)
- Visualization of the dataset
- Works well for 2 dimensions
- Usually, our datasets are more complicated
- Construction of indexes (or stop rules)
- Use indexes to emphasize
- Intra-cluster compactness
- Inter-cluster isolation
- Considers effects such as
- Squared error
- Geometric or statistical properties of the data
- Number of patterns
- Dissimilarity or simmilarity
- Number of clusters
- Optimization of some function under the frame of the model of probabilities mixtures
- i.e., the EM algorithm to find the value of \(k\) that maximizes or minimizes the criterion defined as optimum
- Akaike Information Criterion (AIC)
- Bayesian Inference Criterion
- Other heuristic methods based on a variety of techniques and theory