Introduction
DBSCAN is a density-based clustering algorithm. It groups points that are packed closely together and marks isolated points as noise.
Unlike K-means, DBSCAN does not require choosing the number of clusters ahead of time.
Core Ideas
DBSCAN uses two main parameters:
eps: neighborhood radius.min_samples: minimum points needed to form a dense region.
Point types:
- Core point: has at least
min_samplespoints withineps. - Border point: reachable from a core point but not dense enough itself.
- Noise point: not reachable from dense regions.
When DBSCAN Helps
Use DBSCAN when:
- Clusters have irregular shapes.
- Noise detection matters.
- Number of clusters is unknown.
- Density is meaningful in the feature space.
Examples:
- Geographic clusters.
- Sensor anomalies.
- Spatial event detection.
Parameter Choice
The hardest part is choosing eps.
Common approach:
- Compute distance to the k-th nearest neighbor.
- Sort those distances.
- Look for an elbow in the curve.
Feature scaling matters. If one feature dominates distance, DBSCAN will cluster mostly by that feature.
Limitations
DBSCAN struggles when:
- Clusters have very different densities.
- Data is high-dimensional.
- Distance is not meaningful.
epsis difficult to choose.
In high dimensions, consider dimensionality reduction or another method.
Closing
DBSCAN is valuable because it finds dense regions and labels noise. It is a strong choice for spatial or density-shaped clustering problems, but it depends heavily on meaningful distances and good parameter choices.