# Clustering: DBSCAN

## DBSCAN Introduction

Density-based spatial clustering of applications with noise (DBSCAN)

Summary: DBSCAN is a density-based clustering method that discovers clusters of nonspherical shape.

Main Concept: Locate regions of high density that are separated from one another by regions of low density.

It also marks as outliers the points that are in low-density regions.

Implicit assumptions about the method:

- Densities across all the clusters are the same.
- Cluster sizes or standard deviations are the same.

**Density of region**:

Mainly defined by 2 parameters- Density at a point P: Number of points within a circle of Radius
from point P. - Dense Region: For each point in the cluster, a circle with radius
contains at least minimum number of points ( )

- Density at a point P: Number of points within a circle of Radius
`The Epsilon neighborhood`

of a point P in the database D is defined as- The
function is usually defined by `Euclidean Distance`

- The
**3 classification of points**:`Core point`

: if the point has`Border point`

: if the point hasbut it lies in the neighborhood of another `Core point`

.`Noise`

: any data point that is**neither**`Core`

nor`Border`

point

**Density Reachable/Density Connected/Directly Density Reachable**`Directly Density Reachable`

: Data-pointis directly density reachable from a point if is a core point is in the epsilon neighborhood of

`Density Reachable`

: Data-pointis density reachable from a point if - For a chain of points
, , is directly density reachable from . `Density reachable`

is transitive in nature but, just like`direct density reachable`

, it is not symmetric

- For a chain of points
`Density Connected`

: Data-pointis density connected to a point if - with respect to
and there is a point such that, both and are `density reachable`

fromw.r.t. to and

- with respect to

**Procedure**- Starts with an arbitrary point which has not been visited and its neighborhood information is retrieved from the
parameter. - If this point contains
neighborhood points, cluster formation starts.

Otherwise the point is labeled as`noise`

.

- This point can be later found within theneighborhood of a different point and, thus can be made a part of the cluster. - If a point is found to be a core point then the points within the
neighborhood is also part of the cluster. So all the points found within neighborhood are added, along with their own neighborhood, if they are also core points. - Continue the steps above (1-3) until the
`density-connected`

cluster is completely found. - The process restarts with a new point which can be a part of a new cluster or labeled as noise.

- Starts with an arbitrary point which has not been visited and its neighborhood information is retrieved from the

#### 2. Pros & Cons

**Pros**

- Identifies randomly shaped clusters
- doesn’t necessitate to know the number of clusters in the data previously (as opposed to K-means)
- Handles noise

**Cons**

- If the database has data points that form clusters of
**varying density**, then DBSCAN fails to cluster the data points well, since the clustering depends on ϵ and MinPts parameter, they cannot be chosen separately for all clusters- May Overcome this issue by running additional rounds over large clusters

- If the data and features are not so well understood by a domain expert then, setting up
and could be tricky - Computational complexity — when the dimensionality is high, it takes

#### 3. Code Implementation

1 | import pandas as pd |

Clustering: DBSCAN

https://criss-wang.github.io/post/blogs/unsupervised/clustering-3/