dislib.cluster.DBSCAN¶

class dislib.cluster.dbscan.base.DBSCAN(eps=0.5, min_samples=5, n_regions=1, dimensions=None, max_samples=None)[source]¶

Bases: sklearn.base.BaseEstimator

Perform DBSCAN clustering.

This algorithm requires data to be arranged in a multidimensional grid. The fit method re-arranges input data before running the clustering algorithm. See fit() for more details.

Parameters:

Parameters:	eps (float, optional (default=0.5)) – The maximum distance between two samples for them to be considered as in the same neighborhood. min_samples (int, optional (default=5)) – The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself. n_regions (int, optional (default=1)) – Number of regions per dimension in which to divide the feature space. The total number of regions generated is equal to `n_regions` ^ `len(dimensions)`. dimensions (iterable, optional (default=None)) – Integer indices of the dimensions of the feature space that should be divided. If None, all dimensions are divided. max_samples (int, optional (default=None)) – Setting max_samples to an integer results in the paralellization of the computation of distances inside each region of the grid. That is, each region is processed using various parallel tasks, where each task finds the neighbours of max_samples samples. This can be used to balance the load in scenarios where samples are not evenly distributed in the feature space.
Variables:	n_clusters (int) – Number of clusters found. Accessing this member performs a synchronization.

eps (float, optional (default=0.5)) – The maximum distance between two samples for them to be considered as in the same neighborhood.
min_samples (int, optional (default=5)) – The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
n_regions (int, optional (default=1)) – Number of regions per dimension in which to divide the feature space. The total number of regions generated is equal to n_regions ^ len(dimensions).
dimensions (iterable, optional (default=None)) – Integer indices of the dimensions of the feature space that should be divided. If None, all dimensions are divided.
max_samples (int, optional (default=None)) – Setting max_samples to an integer results in the paralellization of the computation of distances inside each region of the grid. That is, each region is processed using various parallel tasks, where each task finds the neighbours of max_samples samples.

This can be used to balance the load in scenarios where samples are not evenly distributed in the feature space.

Variables:

n_clusters (int) – Number of clusters found. Accessing this member performs a synchronization.

Examples

>>> from dislib.cluster import DBSCAN
>>> import dislib as ds
>>> import numpy as np
>>>
>>>
>>> if __name__ == '__main__':
>>>     arr = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80]])
>>>     x = ds.array(arr, block_size=(2, 2))
>>>     dbscan = DBSCAN(eps=3, min_samples=2)
>>>     y = dbscan.fit_predict(x)
>>>     print(y.collect())

fit(x, y=None)[source]¶

Perform DBSCAN clustering on x.

Samples are initially rearranged in a multidimensional grid with n_regions regions per dimension in dimensions. All regions in a specific dimension have the same size.

Parameters:	x (ds-array) – Input data. y (ignored) – Not used, present here for API consistency by convention.
Returns:	self
Return type:	DBSCAN

fit_predict(x)[source]¶

Perform DBSCAN clustering on dataset and return cluster labels for x.

Parameters:	x (ds-array) – Input data.
Returns:	y – Cluster labels.
Return type:	ds-array, shape=(n_samples , 1)

n_clusters¶