dislib.cluster.DBSCAN¶

class dislib.cluster.dbscan.base.DBSCAN(eps=0.5, min_samples=5, arrange_data=True, n_regions=1, dimensions=None, max_samples=None)[source]¶

Bases: object

Perform DBSCAN clustering.

This algorithm requires data to be arranged in a multidimensional grid. The default behavior is to re-arrange input data before running the clustering algorithm. See fit() for more details.

Parameters:

Parameters:	eps (float, optional (default=0.5)) – The maximum distance between two samples for them to be considered as in the same neighborhood. min_samples (int, optional (default=5)) – The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself. arrange_data (boolean, optional (default=True)) – Whether to re-arrange input data before performing clustering. If `arrange_data=False`, `n_regions` and `dimensions` have no effect. n_regions (int, optional (default=1)) – Number of regions per dimension in which to divide the feature space. The total number of regions generated is equal to `n_regions` ^ `len(dimensions)`. If `arrange_data=False`, `n_regions` is ignored. dimensions (iterable, optional (default=None)) – Integer indices of the dimensions of the feature space that should be divided. If None, all dimensions are divided. If `arrange_data=False` , `dimensions` is ignored. max_samples (int, optional (default=None)) – Setting max_samples to an integer results in the paralellization of the computation of distances inside each region of the grid. That is, each region is processed using various parallel tasks, where each task finds the neighbours of max_samples samples. This can be used to balance the load in scenarios where samples are not evenly distributed in the feature space.
Variables:	n_clusters (int) – Number of clusters found.

eps (float, optional (default=0.5)) – The maximum distance between two samples for them to be considered as in the same neighborhood.
min_samples (int, optional (default=5)) – The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
arrange_data (boolean, optional (default=True)) – Whether to re-arrange input data before performing clustering. If arrange_data=False, n_regions and dimensions have no effect.
n_regions (int, optional (default=1)) – Number of regions per dimension in which to divide the feature space. The total number of regions generated is equal to n_regions ^ len(dimensions). If arrange_data=False, n_regions is ignored.
dimensions (iterable, optional (default=None)) – Integer indices of the dimensions of the feature space that should be divided. If None, all dimensions are divided. If arrange_data=False , dimensions is ignored.
max_samples (int, optional (default=None)) – Setting max_samples to an integer results in the paralellization of the computation of distances inside each region of the grid. That is, each region is processed using various parallel tasks, where each task finds the neighbours of max_samples samples.

This can be used to balance the load in scenarios where samples are not evenly distributed in the feature space.

Variables:

n_clusters (int) – Number of clusters found.

Examples

>>> from dislib.cluster import DBSCAN
>>> import numpy as np
>>> x = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80]])
>>> from dislib.data import load_data
>>> train_data = load_data(x, subset_size=2)
>>> dbscan = DBSCAN(eps=3, min_samples=2)
>>> dbscan.fit(train_data)
>>> print(train_data.labels)

See also

utils.as_grid

fit(dataset)[source]¶

Perform DBSCAN clustering on data and sets dataset.labels.

If arrange_data=True, data is initially rearranged in a multidimensional grid with n_regions regions per dimension in dimensions. All regions in a specific dimension have the same size.

For example, suppose that data contains N partitions of 2-dimensional samples (n_features=2), where the first feature ranges from 1 to 5 and the second feature ranges from 0 to 1. Then, n_regions=10 re-arranges data into 10^2=100 new partitions, where each partition contains the samples that lie in one region of the grid. numpy.linspace() is employed to divide the feature space into uniform regions.

If data is already arranged in a grid, then the number of partitions in data must be equal to n_regions ^ len(dimensions). The equivalence between partition and region index is computed using numpy.ravel_multi_index().

Parameters:	dataset (Dataset) – Input data.

fit_predict(dataset)[source]¶

Perform DBSCAN clustering on dataset. This method does the same as fit(), and is provided for API standardization purposes.

Parameters:	dataset (Dataset) – Input data.