dislib package

Subpackages

dislib.array(x, block_size)[source]

Loads data into a Distributed Array.

Parameters:
  • x (spmatrix or array-like, shape=(n_samples, n_features)) – Array of samples.
  • block_size ((int, int)) – Block sizes in number of samples.
Returns:

dsarray – A distributed representation of the data divided in blocks.

Return type:

ds-array

dislib.random_array(shape, block_size, random_state=None)[source]

Returns a distributed array of random floats in the open interval [0.0, 1.0). Values are from the “continuous uniform” distribution over the stated interval.

Parameters:
  • shape (tuple of two ints) – Shape of the output ds-array.
  • block_size (tuple of two ints) – Size of the ds-array blocks.
  • random_state (int or RandomState, optional (default=None)) – Seed or numpy.random.RandomState instance to generate the random numbers.
Returns:

x – Distributed array of random floats.

Return type:

ds-array

dislib.zeros(shape, block_size, dtype=None)[source]

Returns a ds-array of given shape and block size, filled with zeros.

Parameters:
  • shape (tuple of two ints) – Shape of the output ds-array.
  • block_size (tuple of two ints) – Size of the ds-array blocks.
  • dtype (data type, optional (default=None)) – The desired type of the array. Defaults to float.
Returns:

x – Distributed array filled with zeros.

Return type:

ds-array

dislib.full(shape, block_size, fill_value, dtype=None)[source]

Returns a ds-array of ‘shape’ filled with ‘fill_value’.

Parameters:
  • shape (tuple of two ints) – Shape of the output ds-array.
  • block_size (tuple of two ints) – Size of the ds-array blocks.
  • fill_value (scalar) – Fill value.
  • dtype (data type, optional (default=None)) – The desired type of the array. Defaults to float.
Returns:

x – Distributed array filled with the fill value.

Return type:

ds-array

dislib.identity(n, block_size, dtype=None)[source]

Returns the identity matrix.

Parameters:
  • n (int) – Size of the matrix.
  • block_size (tuple of two ints) – Block size.
  • dtype (data type, optional (default=None)) – The desired type of the ds-array. Defaults to float.
Returns:

x – Identity matrix of shape n x n.

Return type:

ds-array

Raises:

ValueError – If block_size is greater than n.

dislib.eye(n, m, block_size, dtype=None)[source]

Returns a matrix filled with ones on the diagonal and zeros elsewhere.

Parameters:
  • n (int) – number of rows.
  • m (int) – number of columns.
  • block_size (tuple of two ints) – Block size.
  • dtype (data type, optional (default=None)) – The desired type of the ds-array. Defaults to float.
Returns:

x – Identity matrix of shape n x m.

Return type:

ds-array

Raises:

ValueError – If block_size is greater than n.

dislib.load_txt_file(path, block_size, delimiter=', ')[source]

Loads a text file into a distributed array.

Parameters:
  • path (string) – File path.
  • block_size (tuple (int, int)) – Size of the blocks of the array.
  • delimiter (string, optional (default=”,”)) – String that separates columns in the file.
Returns:

x – A distributed representation of the data divided in blocks.

Return type:

ds-array

dislib.load_svmlight_file(path, block_size, n_features, store_sparse)[source]

Loads a SVMLight file into a distributed array.

Parameters:
  • path (string) – File path.
  • block_size (tuple (int, int)) – Size of the blocks for the output ds-array.
  • n_features (int) – Number of features.
  • store_sparse (boolean) – Whether to use scipy.sparse data structures to store data. If False, numpy.array is used instead.
Returns:

x, y – A distributed representation (ds-array) of the X and y.

Return type:

(ds-array, ds-array)

dislib.load_npy_file(path, block_size)[source]

Loads a file in npy format (must be 2-dimensional).

Parameters:
  • path (str) – Path to the npy file.
  • block_size (tuple (int, int)) – Block size of the resulting ds-array.
Returns:

x

Return type:

ds-array

dislib.load_mdcrd_file(path, block_size, n_atoms, copy=False)[source]

Loads a mdcrd trajectory file into a distributed array.

Parameters:
  • path (string) – File path.
  • block_size (tuple (int, int)) – Size of the blocks of the array.
  • n_atoms (int) – Number of atoms in the trajectory. Each frame in the mdcrd file has 3*n_atoms float values (corresponding to 3-dimensional position).
  • copy (boolean, default=False) – Send the file to every task, as opposed to reading it once in the master program.
Returns:

x – A distributed representation of the data divided in blocks.

Return type:

ds-array

dislib.matmul(a: dislib.data.array.Array, b: dislib.data.array.Array, transpose_a=False, transpose_b=False)[source]

Matrix multiplication with a possible transpose of the input.

Parameters:
  • a (ds-array) – First matrix.
  • b (ds-array) – Second matrix.
  • transpose_a (bool) – Transpose of the first matrix before multiplication.
  • transpose_b (any) – Transpose of the second matrix before multiplication.
Returns:

out – The output array.

Return type:

ds-array

Raises:
  • NotImplementedError – If _top_left shape does not match _reg_shape. This case will be implemented in the future.
  • ValueError – If any of the block sizes does not match.

Examples

>>> import dislib as ds
>>>
>>>
>>> if __name__ == "__main__":
>>>     x = ds.random_array((8, 4), block_size=(2, 2))
>>>     y = ds.random_array((5, 8), block_size=(2, 2))
>>>     result = ds.matmul(x, y, transpose_a=True, transpose_b=True)
>>>     print(result.collect())
dislib.save_txt(arr, dir, merge_rows=False)[source]

Save a ds-array by blocks to a directory in txt format.

Parameters:
  • arr (ds-array) – Array data to be saved.
  • dir (str) – Directory into which the data is saved.
  • merge_rows (boolean, default=False) – Merge blocks along rows before saving.
dislib.apply_along_axis(func, axis, x, *args, **kwargs)[source]

Apply a function to slices along the given axis.

Execute func(a, *args, **kwargs) where func operates on nd-arrays and a is a slice of arr along axis. The size of the slices is determined by the blocks shape of x.

func must meet the following conditions:

  • Take an nd-array as argument
  • Accept axis as a keyword argument
  • Return an array-like structure
Parameters:
  • func (function) – This function should accept nd-arrays and an axis argument. It is applied to slices of arr along the specified axis.
  • axis (integer) – Axis along which arr is sliced. Can be 0 or 1.
  • x (ds-array) – Input distributed array.
  • args (any) – Additional arguments to func.
  • kwargs (any) – Additional named arguments to func.
Returns:

out – The output array. The shape of out is identical to the shape of arr, except along the axis dimension. The output ds-array is dense regardless of the type of the input array.

Return type:

ds-array

Examples

>>> import dislib as ds
>>> import numpy as np
>>>
>>>
>>> if __name__ == '__main__':
>>>     x = ds.random_array((100, 100), block_size=(25, 25))
>>>     mean = ds.apply_along_axis(np.mean, 0, x)
>>>     print(mean.collect())
dislib.kron(a, b, block_size=None)[source]

Kronecker product of two ds-arrays.

Parameters:
  • a, b (ds-arrays) – Input ds-arrays.
  • block_size (tuple of two ints, optional) – Block size of the resulting array. Defaults to the block size of b.
Returns:

out

Return type:

ds-array

Raises:

NotImplementedError – If a or b are sparse.

dislib.svd(a, compute_uv=True, sort=True, copy=True, eps=1e-09)[source]

Performs singular value decomposition of a ds-array via the one-sided block Jacobi algorithm described in Arbenz and Slapnicar [1] and Dongarra et al. [2].

Singular value decomposition is a factorization of the form A = USV’, where U and V are unitary matrices and S is a rectangular diagonal matrix.

Parameters:
  • a (ds-array, shape=(m, n)) – Input matrix (m >= n). Needs to be partitioned in two column blocks at least due to the design of the block Jacobi algorithm.
  • compute_uv (boolean, optional (default=True)) – Whether or not to compute u and v in addition to s.
  • sort (boolean, optional (default=True)) – Whether to return sorted u, s and v. Sorting requires a significant amount of additional computation.
  • copy (boolean, optional (default=True)) – Whether to create a copy of a or to apply transformations on a directly. Only valid if a is regular (i.e., top left block is of regular shape).
  • eps (float, optional (default=1e-9)) – Tolerance for the convergence criterion.
Returns:

  • u (ds-array, shape=(m, n)) – U matrix. Only returned if compute_uv is True.
  • s (ds-array, shape=(1, n)) – Diagonal entries of S.
  • v (ds-array, shape=(n, n)) – V matrix. Only returned if compute_uv is True.

Raises:

ValueError – If a has less than 2 column blocks or m < n.

References

[1]Arbenz, P. and Slapnicar, A. (1995). An Analysis of Parallel Implementations of the Block-Jacobi Algorithm for Computing the SVD. In Proceedings of the 17th International Conference on Information Technology Interfaces ITI (pp. 13-16).
[2]Dongarra, J., Gates, M., Haidar, A. et al. (2018). The singular value decomposition: Anatomy of optimizing an algorithm for extreme scale. In SIAM review, 60(4) (pp. 808-865).

Examples

>>> import dislib as ds
>>> import numpy as np
>>>
>>>
>>> if __name__ == '__main__':
>>>     x = ds.random_array((10, 6), (2, 2), random_state=7)
>>>     u, s, v = ds.svd(x)
>>>     u = u.collect()
>>>     s = np.diag(s.collect())
>>>     v = v.collect()
>>>     print(np.allclose(x.collect(), u @ s @ v.T))