Technology: Enterprise Software and Integration: Machine Learning: Which clustering algorithm should I use?

Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. There are several clustering methods/algorithms that could be used and you are often confused when to use what. Here is a quick tip on how to categorize and decide which one is best for your data set.

There are mainly 7 types of basic clustering methods. They are:

1) Partitioning

Find mutually exclusive cluster of spherical shape
Based on distance
Represent cluster using mean or medioid
Good for small to medium sized clusters
Methods:

k-means
k-mediods
CLARA - Clustering LARge Applications

2) Hierarchical

Hierarchical decomposition method
based on agglomerate or divisive
cannot correct erroneous merges or splits
Methods:

BIRCH -Balanced Iterative Reducing and Clustering using Hierarchies
ROCK -RObust Clustering using linKs
CHAMELEON -Multiphase hierarchical clustering using dynamic modeling)
DBSCAN -Density Based Clustering on Connected Regions with High Density
OPTICS -Ordering Points To Identify the Clustering Structure
DENCLUE -DENsity CLUstEring
AGNES(AGglomerative NESting)
DIANA (DIvisive ANAlysis)

3) Density based mentjods

Good for arbitrary shapes
Based on density or neighborhood concept
Possible to filter outliers

4) Grid based methods

Uses multi-resolution grid data structure
Fast in processing time irrespective of the number of data objects

5) Model based methods

hypothesizes a model for each of the clusters and finds the best fit of the data to the given model.

It takes “noise” or outliers into account, therein contributing to the robustness of the approach.

6) High Dimensional data methods

As dimensionality increases, the data usually become increasingly sparse because the data points are likely located in different dimensional subspaces
Frequent pattern-based clustering is another clustering methodology, which extracts distinct frequent patterns among subsets of dimensions that occur frequently.
CLIQUE, PROCLUS, Frequent pattern, pCluster

7) Constraint based methods

User-specified or application- oriented constraints. A constraint can express a user’s expectation or describe “properties” of the desired clustering results, and provides an effective means for communicating with the clustering process.
Constraint-based methods are used in spatial clustering for clustering with obstacle objects (e.g., considering obstacles such as rivers and highways when planning the placement of automated banking machines) and user-constrained cluster analysis (e.g, considering specific constraints regarding customer groups when determining the best location for a new service station, such as ‘‘must serve at least 100 high-value customers”).
Semi-supervised clustering employs, for example, pair- wise constraints (such as pairs of instances labeled as belonging to the same or different clusters) in order to improve the quality of the resulting clustering

Based on the above types, there are a lot of commonly used methods that are used for clustering. Here are some of them:

(a) K-means: Based on centroid

Clustering Type: Partitioning
Detectable Shapes: Spherical-shaped clusters;
The number of clusters
Sensitive to noise and outliers.
Works well with small data sets only.
Running time - O(kni); where k is number of clusters, n is number of objects and i is number of iterations

(b) K-medoids: Representative object based

Clustering Type: Partitioning
Detectable Shapes: Spherical-shaped clusters;
The number of clusters;
Small data sets (not scalable).
Running time - O(k(n-k)^2)

(c) CLARA

Type- Partitioning
Detectable Shapes- Spherical-shaped clusters;
The number of clusters
Effective and sensitive to the selection of initial samples.
Running time - O (ks^2 + k(n-k)) ; where k=no: of samples and k is no: of clusters and s is the size of sample'.
Another version called CLARANS (Clustering LArge Applications based on RANdomized Search) based on a randomized algorithm

(d) BIRCH

Clustering Type: Hierarchical
Detectable Shapes: Spherical-shaped clusters

large N d-dimensional data points;
Uses a Clustering Feature (CF) tree can hold only a limited number of entries due to its size, a CF tree does not always correspond to what a user may consider a natural cluster.
Running time: O (n) ; where n is the number of objects to be clustered

(e) ROCK

Clustering Type: Hierarchical
Detectable Shapes: Arbitrary shape
N d-dimensional categorical data points
Designed for categorical data, emphasizes interconnectivity, ignores closeness between clusters.
Running Time: O(n * Mm * Ma), where Ma and Mm are the averages and maximum number of the neighbors for a point; Worst case O(Ma * N^2)

(f) CHAMELEON

Clustering Type: Hierarchical
Detectable Shapes: Arbitrary shape
N d-dimensional categorical points
Running time: O(n^2) worst case

(g) DBSCAN

Clustering Type: Density
Detectable Shape: Arbitrary shape
Maximum possible distance for a point to be considered density-reachable and minimum number of points in a cluster
Running time: With spatial index O(nlog n) ; O(n^2) worst case.

(h) OPTICS

Clustering Type: Density
Detectable Shape: Arbitrary shape
Outputs Cluster ordering which is a linear list of all objects under analysis. It does not require user to provide a specific density threshold like DBSCAN.
Running time: With spatial index O(nlog n) ; O(n^2) worst case.

(i) DENCLUE

Clustering Type: Density
Detectable Shape: Arbitrary shape
Kernel density estimation is used
Running time: O(n+ h*n) where h is the average hill climbing time for an object ; O(n^2) worst case.

Reference: Data Mining Concepts and Techniques (3rd Edition) by J Han, M Kamber and J Pei

Technology: Enterprise Software and Integration

Thursday, February 28, 2019

Machine Learning: Which clustering algorithm should I use?

2 comments: