Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. There are several clustering methods/algorithms that could be used and you are often confused when to use what. Here is a quick tip on how to categorize and decide which one is best for your data set.
There are mainly 7 types of basic clustering methods. They are:
There are mainly 7 types of basic clustering methods. They are:
1) Partitioning
- Find mutually exclusive cluster of spherical shape
- Based on distance
- Represent cluster using mean or medioid
- Good for small to medium sized clusters
- Methods:
- k-means
- k-mediods
- CLARA - Clustering LARge Applications
- Hierarchical decomposition method
- based on agglomerate or divisive
- cannot correct erroneous merges or splits
- Methods:
- BIRCH -Balanced Iterative Reducing and Clustering using Hierarchies
- ROCK -RObust Clustering using linKs
- CHAMELEON -Multiphase hierarchical clustering using dynamic modeling)
- DBSCAN -Density Based Clustering on Connected Regions with High Density
- OPTICS -Ordering Points To Identify the Clustering Structure
- DENCLUE -DENsity CLUstEring
- AGNES(AGglomerative NESting)
- DIANA (DIvisive ANAlysis)
- Good for arbitrary shapes
- Based on density or neighborhood concept
- Possible to filter outliers
- Uses multi-resolution grid data structure
- Fast in processing time irrespective of the number of data objects
- hypothesizes a model for each of the clusters and finds the best fit of the data to the given model.
- It takes “noise” or outliers into account, therein contributing to the robustness of the approach.
- As dimensionality increases, the data usually become increasingly sparse because the data points are likely located in different dimensional subspaces
- Frequent pattern-based clustering is another clustering methodology, which extracts distinct frequent patterns among subsets of dimensions that occur frequently.
- CLIQUE, PROCLUS, Frequent pattern, pCluster
- User-specified or application- oriented constraints. A constraint can express a user’s expectation or describe “properties” of the desired clustering results, and provides an effective means for communicating with the clustering process.
- Constraint-based methods are used in spatial clustering for clustering with obstacle objects (e.g., considering obstacles such as rivers and highways when planning the placement of automated banking machines) and user-constrained cluster analysis (e.g, considering specific constraints regarding customer groups when determining the best location for a new service station, such as ‘‘must serve at least 100 high-value customers”).
- Semi-supervised clustering employs, for example, pair- wise constraints (such as pairs of instances labeled as belonging to the same or different clusters) in order to improve the quality of the resulting clustering
(a) K-means: Based on centroid
- Clustering Type: Partitioning
- Detectable Shapes: Spherical-shaped clusters;
- The number of clusters
- Sensitive to noise and outliers.
- Works well with small data sets only.
- Running time - O(kni); where k is number of clusters, n is number of objects and i is number of iterations
(b) K-medoids: Representative object based
- Clustering Type: Partitioning
- Detectable Shapes: Spherical-shaped clusters;
- The number of clusters;
- Small data sets (not scalable).
- Running time - O(k(n-k)^2)
- Type- Partitioning
- Detectable Shapes- Spherical-shaped clusters;
- The number of clusters
- Effective and sensitive to the selection of initial samples.
- Running time - O (ks^2 + k(n-k)) ; where k=no: of samples and k is no: of clusters and s is the size of sample'.
- Another version called CLARANS (Clustering LArge Applications based on RANdomized Search) based on a randomized algorithm
(d) BIRCH
- Clustering Type: Hierarchical
- Detectable Shapes: Spherical-shaped clusters
- large N d-dimensional data points;
- Uses a Clustering Feature (CF) tree can hold only a limited number of entries due to its size, a CF tree does not always correspond to what a user may consider a natural cluster.
- Running time: O (n) ; where n is the number of objects to be clustered
(e) ROCK
- Clustering Type: Hierarchical
- Detectable Shapes: Arbitrary shape
- N d-dimensional categorical data points
- Designed for categorical data, emphasizes interconnectivity, ignores closeness between clusters.
- Running Time: O(n * Mm * Ma), where Ma and Mm are the averages and maximum number of the neighbors for a point; Worst case O(Ma * N^2)
- Clustering Type: Hierarchical
- Detectable Shapes: Arbitrary shape
- N d-dimensional categorical points
- Running time: O(n^2) worst case
(g) DBSCAN
- Clustering Type: Density
- Detectable Shape: Arbitrary shape
- Maximum possible distance for a point to be considered density-reachable and minimum number of points in a cluster
- Running time: With spatial index O(nlog n) ; O(n^2) worst case.
- Clustering Type: Density
- Detectable Shape: Arbitrary shape
- Outputs Cluster ordering which is a linear list of all objects under analysis. It does not require user to provide a specific density threshold like DBSCAN.
- Running time: With spatial index O(nlog n) ; O(n^2) worst case.
- Clustering Type: Density
- Detectable Shape: Arbitrary shape
- Kernel density estimation is used
- Running time: O(n+ h*n) where h is the average hill climbing time for an object ; O(n^2) worst case.
Reference: Data Mining Concepts and Techniques (3rd Edition) by J Han, M Kamber and J Pei