dislib package

Subpackages

dislib.apply_along_axis(func, axis, x, *args, **kwargs)[source]

Apply a function to slices along the given axis.

Execute func(a, *args, **kwargs) where func operates on nd-arrays and a is a slice of arr along axis. The size of the slices is determined by the blocks shape of x.

func must meet the following conditions:

  • Take an nd-array as argument

  • Accept axis as a keyword argument

  • Return an array-like structure

Parameters
  • func (function) – This function should accept nd-arrays and an axis argument. It is applied to slices of arr along the specified axis.

  • axis (integer) – Axis along which arr is sliced. Can be 0 or 1.

  • x (ds-array) – Input distributed array.

  • args (any) – Additional arguments to func.

  • kwargs (any) – Additional named arguments to func.

Returns

out – The output array. The shape of out is identical to the shape of arr, except along the axis dimension. The output ds-array is dense regardless of the type of the input array.

Return type

ds-array

Examples

>>> import dislib as ds
>>> import numpy as np
>>>
>>>
>>> if __name__ == '__main__':
>>>     x = ds.random_array((100, 100), block_size=(25, 25))
>>>     mean = ds.apply_along_axis(np.mean, 0, x)
>>>     print(mean.collect())
dislib.array(x, block_size)[source]

Loads data into a Distributed Array.

Parameters
  • x (spmatrix or array-like, shape=(n_samples, n_features)) – Array of samples.

  • block_size ((int, int)) – Block sizes in number of samples.

Returns

dsarray – A distributed representation of the data divided in blocks.

Return type

ds-array

dislib.concat_columns(a: Array, b: Array)[source]

Matrix concatenation by columns. :Parameters: * a (ds-array) – First matrix.

  • b (ds-array) – Second matrix.

Returns

out – The output array.

Return type

ds-array

Raises
  • NotImplementedError – If _top_left shape does not match _reg_shape. This case will be implemented in the future.

  • ValueError – If the arrays do not match in the number of rows.

Examples

>>> import dislib as ds
>>>
>>>
>>> if __name__ == "__main__":
>>>     x = ds.random_array((8, 4), block_size=(2, 2))
>>>     y = ds.random_array((8, 4), block_size=(2, 2))
>>>     result = ds.concat_columns(x, y)
>>>     print(result.collect())
dislib.concat_rows(a, b)[source]

Matrix concatenation by rows. :Parameters: * a (ds-array) – First matrix.

  • b (ds-array) – Second matrix.

Returns

out – The output array.

Return type

ds-array

Raises

ValueError – If the arrays do not match in the number of rows. If the block size is different between the arrays.

Examples

>>> import dislib as ds
>>>
>>>
>>> if __name__ == "__main__":
>>>     x = ds.random_array((8, 4), block_size=(2, 2))
>>>     y = ds.random_array((8, 4), block_size=(2, 2))
>>>     result = ds.concat_rows(x, y)
>>>     print(result.collect())
dislib.eye(n, m, block_size, dtype=None)[source]

Returns a matrix filled with ones on the diagonal and zeros elsewhere.

Parameters
  • n (int) – number of rows.

  • m (int) – number of columns.

  • block_size (tuple of two ints) – Block size.

  • dtype (data type, optional (default=None)) – The desired type of the ds-array. Defaults to float.

Returns

x – Identity matrix of shape n x m.

Return type

ds-array

Raises

ValueError – If block_size is greater than n.

dislib.full(shape, block_size, fill_value, dtype=None)[source]

Returns a ds-array of ‘shape’ filled with ‘fill_value’.

Parameters
  • shape (tuple of two ints) – Shape of the output ds-array.

  • block_size (tuple of two ints) – Size of the ds-array blocks.

  • fill_value (scalar) – Fill value.

  • dtype (data type, optional (default=None)) – The desired type of the array. Defaults to float.

Returns

x – Distributed array filled with the fill value.

Return type

ds-array

dislib.identity(n, block_size, dtype=None)[source]

Returns the identity matrix.

Parameters
  • n (int) – Size of the matrix.

  • block_size (tuple of two ints) – Block size.

  • dtype (data type, optional (default=None)) – The desired type of the ds-array. Defaults to float.

Returns

x – Identity matrix of shape n x n.

Return type

ds-array

Raises

ValueError – If block_size is greater than n.

dislib.kron(a, b, block_size=None)[source]

Kronecker product of two ds-arrays.

Parameters
  • a, b (ds-arrays) – Input ds-arrays.

  • block_size (tuple of two ints, optional) – Block size of the resulting array. Defaults to the block size of b.

Returns

out

Return type

ds-array

Raises

NotImplementedError – If a or b are sparse.

dislib.load_mdcrd_file(path, block_size, n_atoms, copy=False)[source]

Loads a mdcrd trajectory file into a distributed array.

Parameters
  • path (string) – File path.

  • block_size (tuple (int, int)) – Size of the blocks of the array.

  • n_atoms (int) – Number of atoms in the trajectory. Each frame in the mdcrd file has 3*n_atoms float values (corresponding to 3-dimensional position).

  • copy (boolean, default=False) – Send the file to every task, as opposed to reading it once in the master program.

Returns

x – A distributed representation of the data divided in blocks.

Return type

ds-array

dislib.load_npy_file(path, block_size)[source]

Loads a file in npy format (must be 2-dimensional).

Parameters
  • path (str) – Path to the npy file.

  • block_size (tuple (int, int)) – Block size of the resulting ds-array.

Returns

x

Return type

ds-array

dislib.load_svmlight_file(path, block_size, n_features, store_sparse)[source]

Loads a SVMLight file into a distributed array.

Parameters
  • path (string) – File path.

  • block_size (tuple (int, int)) – Size of the blocks for the output ds-array.

  • n_features (int) – Number of features.

  • store_sparse (boolean) – Whether to use scipy.sparse data structures to store data. If False, numpy.array is used instead.

Returns

x, y – A distributed representation (ds-array) of the X and y.

Return type

(ds-array, ds-array)

dislib.load_txt_file(path, block_size, discard_first_row=False, col_of_index=False, delimiter=',')[source]

Loads a text file into a distributed array.

Parameters
  • path (string) – File path.

  • block_size (tuple (int, int)) – Size of the blocks of the array.

  • discard_first_row (bool) – Boolean that indicates if the first row should be discarded.

  • col_of_index (bool) – Boolean that indicates if the first column is a column of indexes and therefore it should be discarded.

  • delimiter (string, optional (default=”,”)) – String that separates columns in the file.

Returns

x – A distributed representation of the data divided in blocks.

Return type

ds-array

dislib.matadd(a: Array, b: Array)[source]

Addition of two matrices.

Parameters
  • a (ds-array) – First matrix.

  • b (ds-array) – Second matrix.

Returns

out – The output array.

Return type

ds-array

Raises
  • NotImplementedError – If _top_left shape does not match _reg_shape. This case will be implemented in the future.

  • ValueError – If any of the block sizes does not match.

  • ValueError – If the ds-arrays have different shape.

Examples

>>> import dislib as ds
>>>
>>>
>>> if __name__ == "__main__":
>>>     x = ds.random_array((8, 4), block_size=(2, 2))
>>>     y = ds.random_array((8, 4), block_size=(2, 2))
>>>     result = ds.matadd(x, y)
>>>     print(result.collect())
dislib.matmul(a: Array, b: Array, transpose_a=False, transpose_b=False)[source]

Matrix multiplication with a possible transpose of the input.

Parameters
  • a (ds-array) – First matrix.

  • b (ds-array) – Second matrix.

  • transpose_a (bool) – Transpose of the first matrix before multiplication.

  • transpose_b (any) – Transpose of the second matrix before multiplication.

Returns

out – The output array.

Return type

ds-array

Raises
  • NotImplementedError – If _top_left shape does not match _reg_shape. This case will be implemented in the future.

  • ValueError – If any of the block sizes does not match.

Examples

>>> import dislib as ds
>>>
>>>
>>> if __name__ == "__main__":
>>>     x = ds.random_array((8, 4), block_size=(2, 2))
>>>     y = ds.random_array((5, 8), block_size=(2, 2))
>>>     result = ds.matmul(x, y, transpose_a=True, transpose_b=True)
>>>     print(result.collect())
dislib.matsubtract(a: Array, b: Array)[source]

Subtraction of two matrices.

Parameters
  • a (ds-array) – First matrix.

  • b (ds-array) – Second matrix.

Returns

out – The output array.

Return type

ds-array

Raises
  • NotImplementedError – If _top_left shape does not match _reg_shape. This case will be implemented in the future.

  • ValueError – If any of the block sizes does not match.

  • ValueError – If the ds-arrays have different shape.

Examples

>>> import dislib as ds
>>>
>>>
>>> if __name__ == "__main__":
>>>     x = ds.random_array((8, 4), block_size=(2, 2))
>>>     y = ds.random_array((8, 4), block_size=(2, 2))
>>>     result = ds.matsubtract(x, y)
>>>     print(result.collect())
dislib.random_array(shape, block_size, random_state=None)[source]

Returns a distributed array of random floats in the open interval [0.0, 1.0). Values are from the “continuous uniform” distribution over the stated interval.

Parameters
  • shape (tuple of two ints) – Shape of the output ds-array.

  • block_size (tuple of two ints) – Size of the ds-array blocks.

  • random_state (int or RandomState, optional (default=None)) – Seed or numpy.random.RandomState instance to generate the random numbers.

Returns

x – Distributed array of random floats.

Return type

ds-array

dislib.save_txt(arr, dir, merge_rows=False)[source]

Save a ds-array by blocks to a directory in txt format.

Parameters
  • arr (ds-array) – Array data to be saved.

  • dir (str) – Directory into which the data is saved.

  • merge_rows (boolean, default=False) – Merge blocks along rows before saving.

dislib.svd(a, compute_uv=True, sort=True, copy=True, eps=1e-09)[source]

Performs singular value decomposition of a ds-array via the one-sided block Jacobi algorithm described in Arbenz and Slapnicar 1 and Dongarra et al. 2.

Singular value decomposition is a factorization of the form A = USV’, where U and V are unitary matrices and S is a rectangular diagonal matrix.

Parameters
  • a (ds-array, shape=(m, n)) – Input matrix (m >= n). Needs to be partitioned in two column blocks at least due to the design of the block Jacobi algorithm.

  • compute_uv (boolean, optional (default=True)) – Whether or not to compute u and v in addition to s.

  • sort (boolean, optional (default=True)) – Whether to return sorted u, s and v. Sorting requires a significant amount of additional computation.

  • copy (boolean, optional (default=True)) – Whether to create a copy of a or to apply transformations on a directly. Only valid if a is regular (i.e., top left block is of regular shape).

  • eps (float, optional (default=1e-9)) – Tolerance for the convergence criterion.

Returns

  • u (ds-array, shape=(m, n)) – U matrix. Only returned if compute_uv is True.

  • s (ds-array, shape=(1, n)) – Diagonal entries of S.

  • v (ds-array, shape=(n, n)) – V matrix. Only returned if compute_uv is True.

Raises

ValueError – If a has less than 2 column blocks or m < n.

References

1

Arbenz, P. and Slapnicar, A. (1995). An Analysis of Parallel Implementations of the Block-Jacobi Algorithm for Computing the SVD. In Proceedings of the 17th International Conference on Information Technology Interfaces ITI (pp. 13-16).

2

Dongarra, J., Gates, M., Haidar, A. et al. (2018). The singular value decomposition: Anatomy of optimizing an algorithm for extreme scale. In SIAM review, 60(4) (pp. 808-865).

Examples

>>> import dislib as ds
>>> import numpy as np
>>>
>>>
>>> if __name__ == '__main__':
>>>     x = ds.random_array((10, 6), (2, 2), random_state=7)
>>>     u, s, v = ds.svd(x)
>>>     u = u.collect()
>>>     s = np.diag(s.collect())
>>>     v = v.collect()
>>>     print(np.allclose(x.collect(), u @ s @ v.T))
dislib.zeros(shape, block_size, dtype=None)[source]

Returns a ds-array of given shape and block size, filled with zeros.

Parameters
  • shape (tuple of two ints) – Shape of the output ds-array.

  • block_size (tuple of two ints) – Size of the ds-array blocks.

  • dtype (data type, optional (default=None)) – The desired type of the array. Defaults to float.

Returns

x – Distributed array filled with zeros.

Return type

ds-array