dislib.cluster.K-Means¶

class dislib.cluster.kmeans.base.KMeans(n_clusters=8, init='random', max_iter=10, tol=0.0001, arity=50, random_state=None, verbose=False)[source]¶

Bases: sklearn.base.BaseEstimator

Perform K-means clustering.

Parameters:

Parameters:	n_clusters (int, optional (default=8)) – The number of clusters to form as well as the number of centroids to generate. init ({‘random’, nd-array or sparse matrix}, optional (default=’random’)) – Method of initialization, defaults to ‘random’, which generates random centers at the beginning. If an nd-array or sparse matrix is passed, it should be of shape (n_clusters, n_features) and gives the initial centers. max_iter (int, optional (default=10)) – Maximum number of iterations of the k-means algorithm for a single run. tol (float, optional (default=1e-4)) – Tolerance for accepting convergence. arity (int, optional (default=50)) – Arity of the reduction carried out during the computation of the new centroids. random_state (int or RandomState, optional (default=None)) – Seed or numpy.random.RandomState instance to generate random numbers for centroid initialization. verbose (boolean, optional (default=False)) – Whether to print progress information.
Variables:	centers (ndarray) – Computed centroids. n_iter (int) – Number of iterations performed.

n_clusters (int, optional (default=8)) – The number of clusters to form as well as the number of centroids to generate.
init ({‘random’, nd-array or sparse matrix}, optional (default=’random’)) – Method of initialization, defaults to ‘random’, which generates random centers at the beginning.

If an nd-array or sparse matrix is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
max_iter (int, optional (default=10)) – Maximum number of iterations of the k-means algorithm for a single run.
tol (float, optional (default=1e-4)) – Tolerance for accepting convergence.
arity (int, optional (default=50)) – Arity of the reduction carried out during the computation of the new centroids.
random_state (int or RandomState, optional (default=None)) – Seed or numpy.random.RandomState instance to generate random numbers for centroid initialization.
verbose (boolean, optional (default=False)) – Whether to print progress information.

Variables:

centers (ndarray) – Computed centroids.
n_iter (int) – Number of iterations performed.

Examples

>>> import dislib as ds
>>> from dislib.cluster import KMeans
>>> import numpy as np
>>>
>>>
>>> if __name__ == '__main__':
>>>     x = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])
>>>     x_train = ds.array(x, (2, 2))
>>>     kmeans = KMeans(n_clusters=2, random_state=0)
>>>     labels = kmeans.fit_predict(x_train)
>>>     print(labels)
>>>     x_test = ds.array(np.array([[0, 0], [4, 4]]), (2, 2))
>>>     labels = kmeans.predict(x_test)
>>>     print(labels)
>>>     print(kmeans.centers)

fit(x, y=None)[source]¶

Compute K-means clustering.

Parameters:	x (ds-array) – Samples to cluster. y (ignored) – Not used, present here for API consistency by convention.
Returns:	self
Return type:	KMeans

fit_predict(x, y=None)[source]¶

Compute cluster centers and predict cluster index for each sample.

Parameters:	x (ds-array) – Samples to cluster. y (ignored) – Not used, present here for API consistency by convention.
Returns:	labels – Index of the cluster each sample belongs to.
Return type:	ds-array, shape=(n_samples, 1)

load_model(filepath, load_format='json')[source]¶

Loads a model from a file. The model is reinstantiated in the exact same state in which it was saved, without any of the code used for model definition or fitting. :Parameters: * filepath (str) – Path of the saved the model

load_format (str, optional (default=’json’)) – Format used to load the model.

Examples

>>> from dislib.cluster import KMeans
>>> import numpy as np
>>> import dislib as ds
>>> x = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])
>>> x_train = ds.array(x, (2, 2))
>>> model = KMeans(n_clusters=2, random_state=0)
>>> model.fit(x_train)
>>> model.save_model('/tmp/model')
>>> loaded_model = KMeans()
>>> loaded_model.load_model('/tmp/model')
>>> x_test = ds.array(np.array([[0, 0], [4, 4]]), (2, 2))
>>> model_pred = model.predict(x_test)
>>> loaded_model_pred = loaded_model.predict(x_test)
>>> assert np.allclose(model_pred.collect(),
>>> loaded_model_pred.collect())

predict(x)[source]¶

Predict the closest cluster each sample in the data belongs to.

Parameters:	x (ds-array) – New data to predict.
Returns:	labels – Index of the cluster each sample belongs to.
Return type:	ds-array, shape=(n_samples, 1)

save_model(filepath, overwrite=True, save_format='json')[source]¶

Saves a model to a file. The model is synchronized before saving and can be reinstantiated in the exact same state, without any of the code used for model definition or fitting. :Parameters: * filepath (str) – Path where to save the model

overwrite (bool, optional (default=True)) – Whether any existing model at the target location should be overwritten.

save_format (str, optional (default=’json)) – Format used to save the models.

Examples

>>> from dislib.cluster import KMeans
>>> import numpy as np
>>> import dislib as ds
>>> x = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])
>>> x_train = ds.array(x, (2, 2))
>>> model = KMeans(n_clusters=2, random_state=0)
>>> model.fit(x_train)
>>> model.save_model('/tmp/model')
>>> loaded_model = KMeans()
>>> loaded_model.load_model('/tmp/model')
>>> x_test = ds.array(np.array([[0, 0], [4, 4]]), (2, 2))
>>> model_pred = model.predict(x_test)
>>> loaded_model_pred = loaded_model.predict(x_test)
>>> assert np.allclose(model_pred.collect(),
>>> loaded_model_pred.collect())

dislib.cluster.kmeans.base.add_mix_kernel(y_len)[source]¶

dislib.cluster.kmeans.base.distance_gpu(a_gpu, b_gpu)[source]¶

dislib.cluster.kmeans.base.get_sq_sum_kernel()[source]¶