dislib.decomposition

dislib.decomposition.qr.base.qr(a: Array, mode='full', overwrite_a=False)[source]

QR Decomposition (blocked).

Parameters
  • a (ds-arrays) – Input ds-array.

  • mode (string) – Mode of the algorithm ‘full’ - computes full Q matrix of size m x m and R of size m x n ‘economic’ - computes Q of size m x n and R of size n x n ‘r’ - computes only R of size m x n

  • overwrite_a (bool) – Overwriting the input matrix as R.

Returns

  • q (ds-array) – only for modes ‘full’ and ‘economic’

  • r (ds-array) – for all modes

Raises

ValueError – If m < n for the provided matrix m x n or If blocks are not square or If top left shape is different than regular or If bottom right block is different than regular

class dislib.decomposition.pca.base.PCA(n_components=None, arity=50, method='eig', eps=1e-09)[source]

Bases: BaseEstimator

Principal component analysis (PCA).

Parameters
  • n_components (int or None, optional (default=None)) – Number of components to keep. If None, all components are kept.

  • arity (int, optional (default=50)) – Arity of the reductions. Only if method=’eig’.

  • method (str, optional (default=’eig’)) – Method to use in the decomposition. Can be ‘svd’ for singular value decomposition and ‘eig’ for eigendecomposition of the covariance matrix. ‘svd’ is recommended when having a large number of features. Falls back to ‘eig’ if the method is not recognized.

  • eps (float, optional (default=1e-9)) – Tolerance for the convergence criterion when method=’svd’.

Variables
  • components (ds-array, shape (n_components, n_features)) –

    Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance_.

    Equal to the n_components eigenvectors of the covariance matrix with greater eigenvalues.

  • explained_variance (ds-array, shape (1, n_components)) –

    The amount of variance explained by each of the selected components.

    Equal to the first n_components largest eigenvalues of the covariance matrix.

  • mean (ds-array, shape (1, n_features)) – Per-feature empirical mean, estimated from the training set.

Examples

>>> import dislib as ds
>>> from dislib.decomposition import PCA
>>> import numpy as np
>>>
>>>
>>> if __name__ == '__main__':
>>>     x = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])
>>>     bn, bm = 2, 2
>>>     data = ds.array(x=x, block_size=(bn, bm))
>>>     pca = PCA()
>>>     transformed_data = pca.fit_transform(data)
>>>     print(transformed_data)
>>>     print(pca.components_.collect())
>>>     print(pca.explained_variance_.collect())
fit(x, y=None)[source]

Fit the model with the dataset.

Parameters
  • x (ds-array, shape (n_samples, n_features)) – Training data.

  • y (ignored) – Not used, present here for API consistency by convention.

Returns

self

Return type

PCA

fit_transform(x)[source]

Fit the model with the dataset and apply the dimensionality reduction to it.

Parameters

x (ds-array, shape (n_samples, n_features)) – Training data.

Returns

transformed_darray

Return type

ds-array, shape (n_samples, n_components)

load_model(filepath, load_format='json')[source]

Loads a model from a file. The model is reinstantiated in the exact same state in which it was saved, without any of the code used for model definition or fitting. :Parameters: * filepath (str) – Path of the saved the model

  • load_format (str, optional (default=’json’)) – Format used to load the model.

Examples

>>> from dislib.decomposition import PCA
>>> import numpy as np
>>> import dislib as ds
>>> x = ds.random_array((1000, 100),
>>> block_size=(100, 50), random_state=0)
>>> pca = PCA()
>>> x_transformed = pca.fit_transform(x)
>>> pca.save_model('/tmp/model')
>>> load_pca = PCA()
>>> load_pca.load_model('/tmp/model')
>>> x_load_transform = load_pca.transform(x)
>>> assert np.allclose(x_transformed.collect(),
>>> x_load_transform.collect())
save_model(filepath, overwrite=True, save_format='json')[source]

Saves a model to a file. The model is synchronized before saving and can be reinstantiated in the exact same state, without any of the code used for model definition or fitting. :Parameters: * filepath (str) – Path where to save the model

  • overwrite (bool, optional (default=True)) – Whether any existing model at the target location should be overwritten.

  • save_format (str, optional (default=’json’)) – Format used to save the models.

Examples

>>> from dislib.decomposition import PCA
>>> import numpy as np
>>> import dislib as ds
>>> x = ds.random_array((1000, 100),
>>> block_size=(100, 50), random_state=0)
>>> pca = PCA()
>>> x_transformed = pca.fit_transform(x)
>>> pca.save_model('/tmp/model')
>>> load_pca = PCA()
>>> load_pca.load_model('/tmp/model')
>>> x_load_transform = load_pca.transform(x)
>>> assert np.allclose(x_transformed.collect(),
>>> x_load_transform.collect())
transform(x)[source]

Apply dimensionality reduction to ds-array.

The given dataset is projected on the first principal components previously extracted from a training ds-array.

Parameters

x (ds-array, shape (n_samples, n_features)) – New ds-array, with the same n_features as the training dataset.

Returns

transformed_darray

Return type

ds-array, shape (n_samples, n_components)

dislib.decomposition.tsqr.base.small_rechunk(rechunk_array)[source]
dislib.decomposition.tsqr.base.tsqr(a: Array, mode='complete', indexes=None)[source]

QR Decomposition for vertically long arrays.

Parameters
  • a (ds-arrays) – Input ds-array.

  • mode (basestring) – Mode of execution of the tsqr. The options are: - complete: q=mxm, r=mxn computed from beginning to end - complete_inverse: q=mxm, r=mxn computed from end to beginning - reduced: q=mxn, r=nxn computed from beginning to end - reduced_inverse: q=mxn, r=nxn computed from end to beginning - r_complete: returns only r. This r is mxn - r_reduced: returns only r. This r is nxn

  • indexes (list) – Columns to return, it only works when it is set with an inverse mode. In other cases it will be ignored.

Returns

  • q (ds-array) – The q of the matrix, it is an orthonormal matrix, multiplying it by r will return the initial matrix In r_complete and r_reduced modes this will not be returned

  • r (ds-array) – The r of the matrix, it is the upper triangular matrix, being multiplied by q it will return the initial matrix

Raises

ValueError – If top left shape is different than regular or If m < n or If the mode is reduced or reduced_inverse and the number of rows per block is smaller than the total number of columns of the matrix or If the mode is complete_inverse and the number of blocks is not a power of 2 or If the mode is reduced_inverse and the number of blocks is not a power of 2

dislib.decomposition.lanczos.base.check_tolerance(m, n, nsv, S, epsilon)[source]
dislib.decomposition.lanczos.base.lanczos_svd(a: Array, k, bs, rank, num_sv, tolerance, epsilon, max_num_iterations)[source]

Lanczos SVD

Parameters
  • a (ds-arrays) – Input ds-array.

  • k (int) – Number of iterations of the Lanczos algorithm, in order to compute the inner iterations this parameter is divided by the bs parameter (must be a multiple of b).

  • bs (int) – Block size (in the column axis)

  • rank (int) – Number of restarting vectors (must be a multiple of b)

  • num_sv (int) – Number of desired singular values

  • tolerance (float64) – If the residual value of a singular value is less than the tolerante, that singular value is considered to be converged.

  • epsilon (float64) – Value that defines the number of singular values required, as it is reduced, the number of singular values required is increased.

  • max_num_iterations (int) – Maximum number of iterations executed in the lanczos. It is supposed that the desired singular values will converge before reaching this value. If it is not the case this defines a limit on the iterations executed.

  • Returns – ——- U : ds-array

    The U of the matrix, Unitary array returned as ds-array, the shape is A.shape[0] x rank, and the block size is the block size of A in the row axis x bs.

    Sds-array

    The S of the matrix. It is represented as a 2-dimensional matrix, the diagonal of this matrix is the vector with the singular values. Its shape is rank x rank and the block size is bs x bs

    Vds-array

    The V of the matrix, Unitary array returned as ds-array, the shape is A.shape[1] x rank, and the block size is bs x bs

    raises ValueError

    If num_sv is bigger than the number of columns or If rank < num_nsv or If k <= rank

dislib.decomposition.lanczos.base.svd_lanczos_t_conv_criteria(A, P, Q, k=None, b=1, rank=2, max_it=None, singular_values=1, tol=1e-08)[source]
dislib.decomposition.randomsvd.base.check_convergeces(res, tol, nsv)[source]
dislib.decomposition.randomsvd.base.my_norm(A)[source]
dislib.decomposition.randomsvd.base.nsv_tolerance(m, n, nsv, S)[source]
dislib.decomposition.randomsvd.base.random_svd(a, iters, epsilon, tol, nsv=None, k=None, verbose=False)[source]

Random SVD

Parameters
  • a (ds-arrays) – Input ds-array. Its blocksize will condition this

  • iters (int) – Number of inner iterations for the Random algorithm to converge.

  • epsilon (float64) – Value that defines a tolerance for how many singular values are required to satisfy that value, as it is reduced, the number of singular values required is increased. The algorithm will automatically try to reach that level of tolerance.

  • tol (float64) – If the residual value of a singular value is smaller than this tolerance, that singular value is considered to be converged.

  • nsv (int) – Number of desired singular values

  • k (int) – Number of restarting vectors. Must be a multiple of a blocksize and greater than nsv.

  • verbose (bool) – Controls the verbosity for the algorithm convergence. Shows convergence accuracy and singular value criteria for each iteration.

  • Returns – ——- U : ds-array

    The U of the matrix, Unitary array returned as ds-array, the shape is A.shape[0] x rank, and the block size is the block size of A in the row axis x bs.

    Sds-array

    The S of the matrix. It is represented as a 2-dimensional matrix, the diagonal of this matrix is the vector with the singular values. Its shape is rank x rank and the block size is bs x bs

    Vds-array

    The V of the matrix, Unitary array returned as ds-array, the shape is A.shape[1] x rank, and the block size is bs x bs

    raises ValueError

    If num_sv is bigger than the number of columns or If k < num_nsv or If k % b != 0