dislib package¶
Subpackages¶
- dislib.array
- dislib.classification.CascadeSVM
- dislib.cluster.Daura
- dislib.cluster.DBSCAN
- dislib.cluster.GaussianMixture
- dislib.cluster.K-Means
- dislib.decomposition
- dislib.model_selection
- dislib.neighbors.NearestNeighbors
- dislib.optimization.ADMM
- dislib.preprocessing
- dislib.recommendation.ALS
- dislib.regression.Lasso
- dislib.regression.LinearRegression
- dislib.trees
- dislib.utils
-
dislib.
array
(x, block_size)[source]¶ Loads data into a Distributed Array.
Parameters: - x (spmatrix or array-like, shape=(n_samples, n_features)) – Array of samples.
- block_size ((int, int)) – Block sizes in number of samples.
Returns: dsarray – A distributed representation of the data divided in blocks.
Return type: ds-array
-
dislib.
random_array
(shape, block_size, random_state=None)[source]¶ Returns a distributed array of random floats in the open interval [0.0, 1.0). Values are from the “continuous uniform” distribution over the stated interval.
Parameters: - shape (tuple of two ints) – Shape of the output ds-array.
- block_size (tuple of two ints) – Size of the ds-array blocks.
- random_state (int or RandomState, optional (default=None)) – Seed or numpy.random.RandomState instance to generate the random numbers.
Returns: x – Distributed array of random floats.
Return type: ds-array
-
dislib.
zeros
(shape, block_size, dtype=None)[source]¶ Returns a ds-array of given shape and block size, filled with zeros.
Parameters: - shape (tuple of two ints) – Shape of the output ds-array.
- block_size (tuple of two ints) – Size of the ds-array blocks.
- dtype (data type, optional (default=None)) – The desired type of the array. Defaults to float.
Returns: x – Distributed array filled with zeros.
Return type: ds-array
-
dislib.
full
(shape, block_size, fill_value, dtype=None)[source]¶ Returns a ds-array of ‘shape’ filled with ‘fill_value’.
Parameters: - shape (tuple of two ints) – Shape of the output ds-array.
- block_size (tuple of two ints) – Size of the ds-array blocks.
- fill_value (scalar) – Fill value.
- dtype (data type, optional (default=None)) – The desired type of the array. Defaults to float.
Returns: x – Distributed array filled with the fill value.
Return type: ds-array
-
dislib.
identity
(n, block_size, dtype=None)[source]¶ Returns the identity matrix.
Parameters: - n (int) – Size of the matrix.
- block_size (tuple of two ints) – Block size.
- dtype (data type, optional (default=None)) – The desired type of the ds-array. Defaults to float.
Returns: x – Identity matrix of shape n x n.
Return type: ds-array
Raises: ValueError
– If block_size is greater than n.
-
dislib.
eye
(n, m, block_size, dtype=None)[source]¶ Returns a matrix filled with ones on the diagonal and zeros elsewhere.
Parameters: - n (int) – number of rows.
- m (int) – number of columns.
- block_size (tuple of two ints) – Block size.
- dtype (data type, optional (default=None)) – The desired type of the ds-array. Defaults to float.
Returns: x – Identity matrix of shape n x m.
Return type: ds-array
Raises: ValueError
– If block_size is greater than n.
-
dislib.
load_txt_file
(path, block_size, delimiter=', ')[source]¶ Loads a text file into a distributed array.
Parameters: - path (string) – File path.
- block_size (tuple (int, int)) – Size of the blocks of the array.
- delimiter (string, optional (default=”,”)) – String that separates columns in the file.
Returns: x – A distributed representation of the data divided in blocks.
Return type: ds-array
-
dislib.
load_svmlight_file
(path, block_size, n_features, store_sparse)[source]¶ Loads a SVMLight file into a distributed array.
Parameters: - path (string) – File path.
- block_size (tuple (int, int)) – Size of the blocks for the output ds-array.
- n_features (int) – Number of features.
- store_sparse (boolean) – Whether to use scipy.sparse data structures to store data. If False, numpy.array is used instead.
Returns: x, y – A distributed representation (ds-array) of the X and y.
Return type: (ds-array, ds-array)
-
dislib.
load_npy_file
(path, block_size)[source]¶ Loads a file in npy format (must be 2-dimensional).
Parameters: - path (str) – Path to the npy file.
- block_size (tuple (int, int)) – Block size of the resulting ds-array.
Returns: x
Return type: ds-array
-
dislib.
load_mdcrd_file
(path, block_size, n_atoms, copy=False)[source]¶ Loads a mdcrd trajectory file into a distributed array.
Parameters: - path (string) – File path.
- block_size (tuple (int, int)) – Size of the blocks of the array.
- n_atoms (int) – Number of atoms in the trajectory. Each frame in the mdcrd file has 3*n_atoms float values (corresponding to 3-dimensional position).
- copy (boolean, default=False) – Send the file to every task, as opposed to reading it once in the master program.
Returns: x – A distributed representation of the data divided in blocks.
Return type: ds-array
-
dislib.
matmul
(a: dislib.data.array.Array, b: dislib.data.array.Array, transpose_a=False, transpose_b=False)[source]¶ Matrix multiplication with a possible transpose of the input.
Parameters: - a (ds-array) – First matrix.
- b (ds-array) – Second matrix.
- transpose_a (bool) – Transpose of the first matrix before multiplication.
- transpose_b (any) – Transpose of the second matrix before multiplication.
Returns: out – The output array.
Return type: ds-array
Raises: NotImplementedError
– If _top_left shape does not match _reg_shape. This case will be implemented in the future.ValueError
– If any of the block sizes does not match.
Examples
>>> import dislib as ds >>> >>> >>> if __name__ == "__main__": >>> x = ds.random_array((8, 4), block_size=(2, 2)) >>> y = ds.random_array((5, 8), block_size=(2, 2)) >>> result = ds.matmul(x, y, transpose_a=True, transpose_b=True) >>> print(result.collect())
-
dislib.
save_txt
(arr, dir, merge_rows=False)[source]¶ Save a ds-array by blocks to a directory in txt format.
Parameters: - arr (ds-array) – Array data to be saved.
- dir (str) – Directory into which the data is saved.
- merge_rows (boolean, default=False) – Merge blocks along rows before saving.
-
dislib.
apply_along_axis
(func, axis, x, *args, **kwargs)[source]¶ Apply a function to slices along the given axis.
Execute func(a, *args, **kwargs) where func operates on nd-arrays and a is a slice of arr along axis. The size of the slices is determined by the blocks shape of x.
func must meet the following conditions:
- Take an nd-array as argument
- Accept axis as a keyword argument
- Return an array-like structure
Parameters: - func (function) – This function should accept nd-arrays and an axis argument. It is applied to slices of arr along the specified axis.
- axis (integer) – Axis along which arr is sliced. Can be 0 or 1.
- x (ds-array) – Input distributed array.
- args (any) – Additional arguments to func.
- kwargs (any) – Additional named arguments to func.
Returns: out – The output array. The shape of out is identical to the shape of arr, except along the axis dimension. The output ds-array is dense regardless of the type of the input array.
Return type: ds-array
Examples
>>> import dislib as ds >>> import numpy as np >>> >>> >>> if __name__ == '__main__': >>> x = ds.random_array((100, 100), block_size=(25, 25)) >>> mean = ds.apply_along_axis(np.mean, 0, x) >>> print(mean.collect())
-
dislib.
kron
(a, b, block_size=None)[source]¶ Kronecker product of two ds-arrays.
Parameters: - a, b (ds-arrays) – Input ds-arrays.
- block_size (tuple of two ints, optional) – Block size of the resulting array. Defaults to the block size of b.
Returns: out
Return type: ds-array
Raises: NotImplementedError
– If a or b are sparse.
-
dislib.
svd
(a, compute_uv=True, sort=True, copy=True, eps=1e-09)[source]¶ Performs singular value decomposition of a ds-array via the one-sided block Jacobi algorithm described in Arbenz and Slapnicar [1] and Dongarra et al. [2].
Singular value decomposition is a factorization of the form A = USV’, where U and V are unitary matrices and S is a rectangular diagonal matrix.
Parameters: - a (ds-array, shape=(m, n)) – Input matrix (m >= n). Needs to be partitioned in two column blocks at least due to the design of the block Jacobi algorithm.
- compute_uv (boolean, optional (default=True)) – Whether or not to compute u and v in addition to s.
- sort (boolean, optional (default=True)) – Whether to return sorted u, s and v. Sorting requires a significant amount of additional computation.
- copy (boolean, optional (default=True)) – Whether to create a copy of a or to apply transformations on a directly. Only valid if a is regular (i.e., top left block is of regular shape).
- eps (float, optional (default=1e-9)) – Tolerance for the convergence criterion.
Returns: - u (ds-array, shape=(m, n)) – U matrix. Only returned if compute_uv is True.
- s (ds-array, shape=(1, n)) – Diagonal entries of S.
- v (ds-array, shape=(n, n)) – V matrix. Only returned if compute_uv is True.
Raises: ValueError
– If a has less than 2 column blocks or m < n.References
[1] Arbenz, P. and Slapnicar, A. (1995). An Analysis of Parallel Implementations of the Block-Jacobi Algorithm for Computing the SVD. In Proceedings of the 17th International Conference on Information Technology Interfaces ITI (pp. 13-16). [2] Dongarra, J., Gates, M., Haidar, A. et al. (2018). The singular value decomposition: Anatomy of optimizing an algorithm for extreme scale. In SIAM review, 60(4) (pp. 808-865). Examples
>>> import dislib as ds >>> import numpy as np >>> >>> >>> if __name__ == '__main__': >>> x = ds.random_array((10, 6), (2, 2), random_state=7) >>> u, s, v = ds.svd(x) >>> u = u.collect() >>> s = np.diag(s.collect()) >>> v = v.collect() >>> print(np.allclose(x.collect(), u @ s @ v.T))