dislib.cluster.DBSCAN¶
-
class
dislib.cluster.dbscan.base.
DBSCAN
(eps=0.5, min_samples=5, arrange_data=True, n_regions=1, dimensions=None, max_samples=None)[source]¶ Bases:
object
Perform DBSCAN clustering.
This algorithm requires data to be arranged in a multidimensional grid. The default behavior is to re-arrange input data before running the clustering algorithm. See
fit()
for more details.Parameters: eps (float, optional (default=0.5)) – The maximum distance between two samples for them to be considered as in the same neighborhood.
min_samples (int, optional (default=5)) – The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
arrange_data (boolean, optional (default=True)) – Whether to re-arrange input data before performing clustering. If
arrange_data=False
,n_regions
anddimensions
have no effect.n_regions (int, optional (default=1)) – Number of regions per dimension in which to divide the feature space. The total number of regions generated is equal to
n_regions
^len(dimensions)
. Ifarrange_data=False
,n_regions
is ignored.dimensions (iterable, optional (default=None)) – Integer indices of the dimensions of the feature space that should be divided. If None, all dimensions are divided. If
arrange_data=False
,dimensions
is ignored.max_samples (int, optional (default=None)) – Setting max_samples to an integer results in the paralellization of the computation of distances inside each region of the grid. That is, each region is processed using various parallel tasks, where each task finds the neighbours of max_samples samples.
This can be used to balance the load in scenarios where samples are not evenly distributed in the feature space.
Variables: n_clusters (int) – Number of clusters found.
Examples
>>> from dislib.cluster import DBSCAN >>> import numpy as np >>> x = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80]]) >>> from dislib.data import load_data >>> train_data = load_data(x, subset_size=2) >>> dbscan = DBSCAN(eps=3, min_samples=2) >>> dbscan.fit(train_data) >>> print(train_data.labels)
See also
utils.as_grid
-
fit
(dataset)[source]¶ Perform DBSCAN clustering on data and sets dataset.labels.
If arrange_data=True, data is initially rearranged in a multidimensional grid with
n_regions
regions per dimension indimensions
. All regions in a specific dimension have the same size.For example, suppose that data contains N partitions of 2-dimensional samples (
n_features=2
), where the first feature ranges from 1 to 5 and the second feature ranges from 0 to 1. Then, n_regions=10 re-arranges data into 10^2=100 new partitions, where each partition contains the samples that lie in one region of the grid. numpy.linspace() is employed to divide the feature space into uniform regions.If data is already arranged in a grid, then the number of partitions in data must be equal to
n_regions
^len(dimensions)
. The equivalence between partition and region index is computed using numpy.ravel_multi_index().Parameters: dataset (Dataset) – Input data.
-
fit_predict
(dataset)[source]¶ Perform DBSCAN clustering on dataset. This method does the same as fit(), and is provided for API standardization purposes.
Parameters: dataset (Dataset) – Input data. See also
-
n_clusters
¶