Clustering: DBSCAN
DBSCAN Introduction
Density-based spatial clustering of applications with noise (DBSCAN)
Summary: DBSCAN is a density-based clustering method that discovers clusters of nonspherical shape.
Main Concept: Locate regions of high density that are separated from one another by regions of low density.
It also marks as outliers the points that are in low-density regions.
Implicit assumptions about the method:
- Densities across all the clusters are the same.
- Cluster sizes or standard deviations are the same.
Density of region:
Mainly defined by 2 parameters- Density at a point P: Number of points within a circle of Radius
from point P. - Dense Region: For each point in the cluster, a circle with radius
contains at least minimum number of points ( )
- Density at a point P: Number of points within a circle of Radius
The Epsilon neighborhood
of a point P in the database D is defined as- The
function is usually defined by Euclidean Distance
- The
3 classification of points:
Core point
: if the point hasBorder point
: if the point hasbut it lies in the neighborhood of another Core point
.Noise
: any data point that is neitherCore
norBorder
point
Density Reachable/Density Connected/Directly Density Reachable
Directly Density Reachable
: Data-pointis directly density reachable from a point if is a core point is in the epsilon neighborhood of
Density Reachable
: Data-pointis density reachable from a point if - For a chain of points
, , is directly density reachable from . Density reachable
is transitive in nature but, just likedirect density reachable
, it is not symmetric
- For a chain of points
Density Connected
: Data-pointis density connected to a point if - with respect to
and there is a point such that, both and are density reachable
fromw.r.t. to and
- with respect to
Procedure
- Starts with an arbitrary point which has not been visited and its neighborhood information is retrieved from the
parameter. - If this point contains
neighborhood points, cluster formation starts.
Otherwise the point is labeled asnoise
.
- This point can be later found within theneighborhood of a different point and, thus can be made a part of the cluster. - If a point is found to be a core point then the points within the
neighborhood is also part of the cluster. So all the points found within neighborhood are added, along with their own neighborhood, if they are also core points. - Continue the steps above (1-3) until the
density-connected
cluster is completely found. - The process restarts with a new point which can be a part of a new cluster or labeled as noise.
- Starts with an arbitrary point which has not been visited and its neighborhood information is retrieved from the
2. Pros & Cons
Pros
- Identifies randomly shaped clusters
- doesn’t necessitate to know the number of clusters in the data previously (as opposed to K-means)
- Handles noise
Cons
- If the database has data points that form clusters of varying density, then DBSCAN fails to cluster the data points well, since the clustering depends on ϵ and MinPts parameter, they cannot be chosen separately for all clusters
- May Overcome this issue by running additional rounds over large clusters
- If the data and features are not so well understood by a domain expert then, setting up
and could be tricky - Computational complexity — when the dimensionality is high, it takes
3. Code Implementation
1 | import pandas as pd |
Clustering: DBSCAN
https://criss-wang.github.io/post/blogs/unsupervised/clustering-3/