dislib.array

class dislib.data.array.Array(blocks, top_left_shape, reg_shape, shape, sparse, delete=True)[source]

Bases: object

A distributed 2-dimensional array divided in blocks.

Normally, this class should not be instantiated directly, but created using one of the array creation routines provided.

Apart from the different methods provided, this class also supports the following types of indexing:

  • A[i] : returns a single row

  • A[i, j] : returns a single element

  • A[i:j] : returns a set of rows (with i and j optional)

  • A[:, i:j] : returns a set of columns (with i and j optional)

  • A[[i,j,k]] : returns a set of non-consecutive rows. Rows are

returned ordered by their index in the input array. - A[:, [i,j,k]] : returns a set of non-consecutive columns. Columns are returned ordered by their index in the input array. - A[i:j, k:m] : returns a set of elements (with i, j,

k, and m optional)

Parameters
  • blocks (list) – List of lists of nd-array or spmatrix.

  • top_left_shape (tuple) – A single tuple indicating the shape of the top-left block.

  • reg_shape (tuple) – A single tuple indicating the shape of the regular block.

  • shape (tuple (int, int)) – Total number of elements in the array.

  • sparse (boolean, optional (default=False)) – Whether this array stores sparse data.

  • delete (boolean, optional (default=True)) – Whether to call compss_delete_object on the blocks when the garbage collector deletes this ds-array.

Variables

shape (tuple (int, int)) – Total number of elements in the array.

property T

Returns the transpose of this ds-array

collect(squeeze=True)[source]

Collects the contents of this ds-array and returns the equivalent in-memory array that this ds-array represents. This method creates a synchronization point in the execution of the application.

Warning: This method may fail if the ds-array does not fit in memory.

Parameters

squeeze (boolean, optional (default=True)) – Whether to remove single-dimensional entries from the shape of the resulting ndarray.

Returns

array – The actual contents of the ds-array.

Return type

nd-array or spmatrix

conj()[source]

Returns the complex conjugate, element-wise.

Returns

x

Return type

ds-array

copy()[source]

Creates a copy of this ds-array.

Returns

x_copy

Return type

ds-array

max(axis=0)[source]

Returns the maximum along the given axis.

Parameters

axis (int, optional (default=0))

Returns

max – Maximum along axis.

Return type

ds-array

mean(axis=0)[source]

Returns the mean along the given axis.

Parameters

axis (int, optional (default=0))

Returns

mean – Mean along axis.

Return type

ds-array

median(axis=0)[source]

Returns the median along the given axis.

Parameters

axis (int, optional (default=0))

Returns

median – Median along axis.

Return type

ds-array

Raises

NotImplementedError – If the ds-array is sparse.

min(axis=0)[source]

Returns the minimum along the given axis.

Parameters

axis (int, optional (default=0))

Returns

min – Minimum along axis.

Return type

ds-array

norm(axis=0)[source]

Returns the Frobenius norm along an axis.

Parameters

axis (int, optional (default=0)) – Specifies the axis of the array along which to compute the vector norms.

Returns

norm – Norm along axis.

Return type

ds-array

Raises

NotImplementedError – If the ds-array is sparse.

rechunk(block_size)[source]

Re-partitions the ds-array into blocks of the given block size.

Parameters

block_size (tuple of two ints) – The desired block size.

Returns

x – Re-partitioned ds-array.

Return type

ds-array

replace_block(i, j, new_block)[source]
property shape

Total shape of the ds-array

sqrt()[source]

Returns the element-wise square root of the elements in the ds-array

Returns

x

Return type

ds-array

sum(axis=0)[source]

Returns the sum along the given axis.

Parameters

axis (int, optional (default=0))

Returns

sum – Sum along axis.

Return type

ds-array

transpose(mode='rows')[source]

Returns the transpose of the ds-array following the method indicated by mode. ‘All’ uses a single task to transpose all the blocks (slow with high number of blocks). ‘rows’ and ‘columns’ transpose each block of rows or columns independently (i.e. a task per row/col block).

Parameters

mode (string, optional (default=rows)) – Array of samples.

Returns

dsarray – A transposed ds-array.

Return type

ds-array

Array creation routines

dislib.array(x, block_size)[source]

Loads data into a Distributed Array.

Parameters
  • x (spmatrix or array-like, shape=(n_samples, n_features)) – Array of samples.

  • block_size ((int, int)) – Block sizes in number of samples.

Returns

dsarray – A distributed representation of the data divided in blocks.

Return type

ds-array

dislib.random_array(shape, block_size, random_state=None)[source]

Returns a distributed array of random floats in the open interval [0.0, 1.0). Values are from the “continuous uniform” distribution over the stated interval.

Parameters
  • shape (tuple of two ints) – Shape of the output ds-array.

  • block_size (tuple of two ints) – Size of the ds-array blocks.

  • random_state (int or RandomState, optional (default=None)) – Seed or numpy.random.RandomState instance to generate the random numbers.

Returns

x – Distributed array of random floats.

Return type

ds-array

dislib.zeros(shape, block_size, dtype=None)[source]

Returns a ds-array of given shape and block size, filled with zeros.

Parameters
  • shape (tuple of two ints) – Shape of the output ds-array.

  • block_size (tuple of two ints) – Size of the ds-array blocks.

  • dtype (data type, optional (default=None)) – The desired type of the array. Defaults to float.

Returns

x – Distributed array filled with zeros.

Return type

ds-array

dislib.full(shape, block_size, fill_value, dtype=None)[source]

Returns a ds-array of ‘shape’ filled with ‘fill_value’.

Parameters
  • shape (tuple of two ints) – Shape of the output ds-array.

  • block_size (tuple of two ints) – Size of the ds-array blocks.

  • fill_value (scalar) – Fill value.

  • dtype (data type, optional (default=None)) – The desired type of the array. Defaults to float.

Returns

x – Distributed array filled with the fill value.

Return type

ds-array

dislib.eye(n, m, block_size, dtype=None)[source]

Returns a matrix filled with ones on the diagonal and zeros elsewhere.

Parameters
  • n (int) – number of rows.

  • m (int) – number of columns.

  • block_size (tuple of two ints) – Block size.

  • dtype (data type, optional (default=None)) – The desired type of the ds-array. Defaults to float.

Returns

x – Identity matrix of shape n x m.

Return type

ds-array

Raises

ValueError – If block_size is greater than n.

dislib.identity(n, block_size, dtype=None)[source]

Returns the identity matrix.

Parameters
  • n (int) – Size of the matrix.

  • block_size (tuple of two ints) – Block size.

  • dtype (data type, optional (default=None)) – The desired type of the ds-array. Defaults to float.

Returns

x – Identity matrix of shape n x n.

Return type

ds-array

Raises

ValueError – If block_size is greater than n.

dislib.matmul(a: Array, b: Array, transpose_a=False, transpose_b=False)[source]

Matrix multiplication with a possible transpose of the input.

Parameters
  • a (ds-array) – First matrix.

  • b (ds-array) – Second matrix.

  • transpose_a (bool) – Transpose of the first matrix before multiplication.

  • transpose_b (any) – Transpose of the second matrix before multiplication.

Returns

out – The output array.

Return type

ds-array

Raises
  • NotImplementedError – If _top_left shape does not match _reg_shape. This case will be implemented in the future.

  • ValueError – If any of the block sizes does not match.

Examples

>>> import dislib as ds
>>>
>>>
>>> if __name__ == "__main__":
>>>     x = ds.random_array((8, 4), block_size=(2, 2))
>>>     y = ds.random_array((5, 8), block_size=(2, 2))
>>>     result = ds.matmul(x, y, transpose_a=True, transpose_b=True)
>>>     print(result.collect())
dislib.concat_columns(a: Array, b: Array)[source]

Matrix concatenation by columns. :Parameters: * a (ds-array) – First matrix.

  • b (ds-array) – Second matrix.

Returns

out – The output array.

Return type

ds-array

Raises
  • NotImplementedError – If _top_left shape does not match _reg_shape. This case will be implemented in the future.

  • ValueError – If the arrays do not match in the number of rows.

Examples

>>> import dislib as ds
>>>
>>>
>>> if __name__ == "__main__":
>>>     x = ds.random_array((8, 4), block_size=(2, 2))
>>>     y = ds.random_array((8, 4), block_size=(2, 2))
>>>     result = ds.concat_columns(x, y)
>>>     print(result.collect())
dislib.concat_rows(a, b)[source]

Matrix concatenation by rows. :Parameters: * a (ds-array) – First matrix.

  • b (ds-array) – Second matrix.

Returns

out – The output array.

Return type

ds-array

Raises

ValueError – If the arrays do not match in the number of rows. If the block size is different between the arrays.

Examples

>>> import dislib as ds
>>>
>>>
>>> if __name__ == "__main__":
>>>     x = ds.random_array((8, 4), block_size=(2, 2))
>>>     y = ds.random_array((8, 4), block_size=(2, 2))
>>>     result = ds.concat_rows(x, y)
>>>     print(result.collect())
dislib.load_txt_file(path, block_size, discard_first_row=False, col_of_index=False, delimiter=',')[source]

Loads a text file into a distributed array.

Parameters
  • path (string) – File path.

  • block_size (tuple (int, int)) – Size of the blocks of the array.

  • discard_first_row (bool) – Boolean that indicates if the first row should be discarded.

  • col_of_index (bool) – Boolean that indicates if the first column is a column of indexes and therefore it should be discarded.

  • delimiter (string, optional (default=”,”)) – String that separates columns in the file.

Returns

x – A distributed representation of the data divided in blocks.

Return type

ds-array

dislib.load_svmlight_file(path, block_size, n_features, store_sparse)[source]

Loads a SVMLight file into a distributed array.

Parameters
  • path (string) – File path.

  • block_size (tuple (int, int)) – Size of the blocks for the output ds-array.

  • n_features (int) – Number of features.

  • store_sparse (boolean) – Whether to use scipy.sparse data structures to store data. If False, numpy.array is used instead.

Returns

x, y – A distributed representation (ds-array) of the X and y.

Return type

(ds-array, ds-array)

dislib.load_npy_file(path, block_size)[source]

Loads a file in npy format (must be 2-dimensional).

Parameters
  • path (str) – Path to the npy file.

  • block_size (tuple (int, int)) – Block size of the resulting ds-array.

Returns

x

Return type

ds-array

dislib.load_mdcrd_file(path, block_size, n_atoms, copy=False)[source]

Loads a mdcrd trajectory file into a distributed array.

Parameters
  • path (string) – File path.

  • block_size (tuple (int, int)) – Size of the blocks of the array.

  • n_atoms (int) – Number of atoms in the trajectory. Each frame in the mdcrd file has 3*n_atoms float values (corresponding to 3-dimensional position).

  • copy (boolean, default=False) – Send the file to every task, as opposed to reading it once in the master program.

Returns

x – A distributed representation of the data divided in blocks.

Return type

ds-array

dislib.data.load_hstack_npy_files(path, cols_per_block=None)[source]

Loads the .npy files in a directory into a ds-array, stacking them horizontally, like (A|B|C). The order of concatenation is alphanumeric.

At least 1 valid .npy file must exist in the directory, and every .npy file must contain a valid array. Every array must have the same dtype, order, and number of rows.

The blocks of the returned ds-array will have the same number of rows as the input arrays, and cols_per_block columns, which defaults to the number of columns of the first array.

Parameters
  • path (string) – Folder path.

  • cols_per_block (tuple (int, int)) – Number of columns of the blocks for the output ds-array. If None, the number of columns of the first array is used.

Returns

x – A distributed representation (ds-array) of the stacked arrays.

Return type

ds-array

Utility functions

dislib.data.util.compute_bottom_right_shape(a: Array)[source]

Computes a shape of the bottom right block. :Parameters: a (ds-array) – The array to pad.

Returns

  • size0 (int) – size of the first dimension

  • size1 (int) – size of the second dimension

dislib.data.util.decoder_helper(class_name, obj)[source]
dislib.data.util.encoder_helper(obj)[source]
dislib.data.util.pad(a: Array, pad_width, **kwargs)[source]

Pad array blocks with the desired value. :Parameters: * a (array_like of rank N) – The array to pad.

  • pad_width (((top, bottom), (left, right))) – Number of values padded to the edges of each axis.

  • constant_value (scalar, optional) – The value to set in the padded rows and columns. Default is 0.

dislib.data.util.pad_last_blocks_with_zeros(a: Array)[source]

Pad array blocks with zeros. :Parameters: a (ds-array) – The array to pad.

dislib.data.util.remove_last_columns(a: Array, n_columns)[source]

Removes last columns from the right-most blocks of the ds-array. :Parameters: * a (ds-array) – The array to pad.

  • n_columns (int) – The number of columns to remove

Raises

ValueError – if n_columns >= the width of the right-most blocks

dislib.data.util.remove_last_rows(a: Array, n_rows)[source]

Removes last rows from the bottom blocks of the ds-array. :Parameters: * a (ds-array) – The array to pad.

  • n_rows (int) – The array to pad.

dislib.data.util.sync_obj(obj)[source]

Recursively synchronizes the Future objects of a list or dictionary by using compss_wait_on(obj).

Other functions

dislib.save_txt(arr, dir, merge_rows=False)[source]

Save a ds-array by blocks to a directory in txt format.

Parameters
  • arr (ds-array) – Array data to be saved.

  • dir (str) – Directory into which the data is saved.

  • merge_rows (boolean, default=False) – Merge blocks along rows before saving.

dislib.data.array.apply_along_axis(func, axis, x, *args, **kwargs)[source]

Apply a function to slices along the given axis.

Execute func(a, *args, **kwargs) where func operates on nd-arrays and a is a slice of arr along axis. The size of the slices is determined by the blocks shape of x.

func must meet the following conditions:

  • Take an nd-array as argument

  • Accept axis as a keyword argument

  • Return an array-like structure

Parameters
  • func (function) – This function should accept nd-arrays and an axis argument. It is applied to slices of arr along the specified axis.

  • axis (integer) – Axis along which arr is sliced. Can be 0 or 1.

  • x (ds-array) – Input distributed array.

  • args (any) – Additional arguments to func.

  • kwargs (any) – Additional named arguments to func.

Returns

out – The output array. The shape of out is identical to the shape of arr, except along the axis dimension. The output ds-array is dense regardless of the type of the input array.

Return type

ds-array

Examples

>>> import dislib as ds
>>> import numpy as np
>>>
>>>
>>> if __name__ == '__main__':
>>>     x = ds.random_array((100, 100), block_size=(25, 25))
>>>     mean = ds.apply_along_axis(np.mean, 0, x)
>>>     print(mean.collect())