dislib.cluster.GaussianMixture

class dislib.cluster.gm.base.GaussianMixture(n_components=1, covariance_type='full', check_convergence=True, tol=0.001, reg_covar=1e-06, max_iter=100, init_params='kmeans', weights_init=None, means_init=None, precisions_init=None, arity=50, verbose=False, random_state=None)[source]

Bases: sklearn.base.BaseEstimator

Gaussian mixture model.

Estimates the parameters of a Gaussian mixture model probability function that fits the data. Allows clustering.

Parameters:
  • n_components (int, optional (default=1)) – The number of components.

  • covariance_type (str, (default=’full’)) – String describing the type of covariance parameters to use. Must be one of:

    'full' (each component has its own general covariance matrix),
    'tied' (all components share the same general covariance matrix),
    'diag' (each component has its own diagonal covariance matrix),
    'spherical' (each component has its own single variance).
    
  • check_convergence (boolean, optional (default=True)) – Whether to test for convergence at the end of each iteration. Setting it to False removes control dependencies, allowing fitting this model in parallel with other tasks.

  • tol (float, defaults to 1e-3.) – The convergence threshold. If the absolute change of the lower bound respect to the previous iteration is below this threshold, the iterations will stop. Ignored if check_convergence is False.

  • reg_covar (float, defaults to 1e-6.) – Non-negative regularization added to the diagonal of covariance. Allows to assure that the covariance matrices are all positive.

  • max_iter (int, defaults to 100.) – The number of EM iterations to perform.

  • init_params ({‘kmeans’, ‘random’}, defaults to ‘kmeans’.) – The method used to initialize the weights, the means and the precisions. This method defines the responsibilities and a maximization step gives the model parameters. This is not used if weights_init, means_init and precisions_init are all provided. Must be one of:

    'kmeans' : responsibilities are initialized using kmeans,
    'random' : responsibilities are initialized randomly.
    
  • weights_init (array-like, shape=(n_components, ), optional) – The user-provided initial weights, defaults to None. If None, weights are initialized using the init_params method.

  • means_init (array-like, shape=(n_components, n_features), optional) – The user-provided initial means, defaults to None. If None, means are initialized using the init_params method.

  • precisions_init (array-like, optional.) – The user-provided initial precisions (inverse of the covariance matrices), defaults to None. If None, precisions are initialized using the init_params method. The shape depends on ‘covariance_type’:

    (n_components,)                        if 'spherical',
    (n_features, n_features)               if 'tied',
    (n_components, n_features)             if 'diag',
    (n_components, n_features, n_features) if 'full'
    
  • random_state (int, RandomState or None, optional (default=None)) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • arity (int, optional (default=50)) – Arity of the reductions.

  • verbose (boolean, optional (default=False)) – Whether to print progress information.

Variables:
  • weights (array-like, shape=(n_components,)) – The weight of each mixture component.
  • means (array-like, shape=(n_components, n_features)) – The mean of each mixture component.
  • covariances (array-like) –

    The covariance of each mixture component. The shape depends on covariance_type:

    (n_components,)                        if 'spherical',
    (n_features, n_features)               if 'tied',
    (n_components, n_features)             if 'diag',
    (n_components, n_features, n_features) if 'full'
    
  • precisions_cholesky (array-like) –

    The cholesky decomposition of the precision matrices of each mixture component. A precision matrix is the inverse of a covariance matrix. A covariance matrix is symmetric positive definite so the mixture of Gaussian can be equivalently parameterized by the precision matrices. Storing the precision matrices instead of the covariance matrices makes it more efficient to compute the log-likelihood of new samples at test time. The shape depends on covariance_type:

    (n_components,)                        if 'spherical',
    (n_features, n_features)               if 'tied',
    (n_components, n_features)             if 'diag',
    (n_components, n_features, n_features) if 'full'
    
  • converged (bool) – True if check_convergence is True and convergence is reached, False otherwise.
  • n_iter (int) – Number of EM iterations done.
  • lower_bound (float) – Lower bound value on the log-likelihood of the training data with respect to the model.

Examples

>>> import dislib as ds
>>> from dislib.cluster import GaussianMixture
>>> from pycompss.api.api import compss_wait_on
>>>
>>>
>>> if __name__ == '__main__':
>>>     x = ds.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]],
>>>                  (3, 2))
>>>     gm = GaussianMixture(n_components=2, random_state=0)
>>>     labels = gm.fit_predict(x).collect()
>>>     print(labels)
>>>     x_test = ds.array([[0, 0], [4, 4]], (2, 2))
>>>     labels_test = gm.predict(x_test).collect()
>>>     print(labels_test)
>>>     print(compss_wait_on(gm.means_))
fit(x, y=None)[source]

Estimate model parameters with the EM algorithm.

Iterates between E-steps and M-steps until convergence or until max_iter iterations are reached. It estimates the model parameters weights_, means_ and covariances_.

Parameters:
  • x (ds-array, shape=(n_samples, n_features)) – Data points.
  • y (ignored) – Not used, present here for API consistency by convention.
Warns:

ConvergenceWarning – If tol is not None and max_iter iterations are reached without convergence.

fit_predict(x)[source]

Estimate model parameters and predict clusters for the same data.

Fits the model and, after fitting, uses the model to predict cluster labels for the same training data.

Parameters:x (ds-array, shape=(n_samples, n_features)) – Data points.
Returns:y – Cluster labels for x.
Return type:ds-array, shape(n_samples, 1)
Warns:ConvergenceWarning – If tol is not None and max_iter iterations are reached without convergence.
predict(x)[source]

Predict cluster labels for the given data using the trained model.

Parameters:x (ds-array, shape=(n_samples, n_features)) – Data points.
Returns:y – Cluster labels for x.
Return type:ds-array, shape(n_samples, 1)