dislib.array

class dislib.data.array.Array(blocks, top_left_shape, reg_shape, shape, sparse, delete=True)[source]

Bases: object

A distributed 2-dimensional array divided in blocks.

Normally, this class should not be instantiated directly, but created using one of the array creation routines provided.

Apart from the different methods provided, this class also supports the following types of indexing:

  • A[i] : returns a single row
  • A[i, j] : returns a single element
  • A[i:j] : returns a set of rows (with i and j optional)
  • A[:, i:j] : returns a set of columns (with i and j optional)
  • A[[i,j,k]] : returns a set of non-consecutive rows. Rows are

returned ordered by their index in the input array. - A[:, [i,j,k]] : returns a set of non-consecutive columns. Columns are returned ordered by their index in the input array. - A[i:j, k:m] : returns a set of elements (with i, j,

k, and m optional)
Parameters:
  • blocks (list) – List of lists of nd-array or spmatrix.
  • top_left_shape (tuple) – A single tuple indicating the shape of the top-left block.
  • reg_shape (tuple) – A single tuple indicating the shape of the regular block.
  • shape (tuple (int, int)) – Total number of elements in the array.
  • sparse (boolean, optional (default=False)) – Whether this array stores sparse data.
  • delete (boolean, optional (default=True)) – Whether to call compss_delete_object on the blocks when the garbage collector deletes this ds-array.
Variables:

shape (tuple (int, int)) – Total number of elements in the array.

T

Returns the transpose of this ds-array

collect(squeeze=True)[source]

Collects the contents of this ds-array and returns the equivalent in-memory array that this ds-array represents. This method creates a synchronization point in the execution of the application.

Warning: This method may fail if the ds-array does not fit in memory.

Parameters:squeeze (boolean, optional (default=True)) – Whether to remove single-dimensional entries from the shape of the resulting ndarray.
Returns:array – The actual contents of the ds-array.
Return type:nd-array or spmatrix
conj()[source]

Returns the complex conjugate, element-wise.

Returns:x
Return type:ds-array
copy()[source]

Creates a copy of this ds-array.

Returns:x_copy
Return type:ds-array
max(axis=0)[source]

Returns the maximum along the given axis.

Parameters:axis (int, optional (default=0))
Returns:max – Maximum along axis.
Return type:ds-array
mean(axis=0)[source]

Returns the mean along the given axis.

Parameters:axis (int, optional (default=0))
Returns:mean – Mean along axis.
Return type:ds-array
median(axis=0)[source]

Returns the median along the given axis.

Parameters:axis (int, optional (default=0))
Returns:median – Median along axis.
Return type:ds-array
Raises:NotImplementedError – If the ds-array is sparse.
min(axis=0)[source]

Returns the minimum along the given axis.

Parameters:axis (int, optional (default=0))
Returns:min – Minimum along axis.
Return type:ds-array
norm(axis=0)[source]

Returns the Frobenius norm along an axis.

Parameters:axis (int, optional (default=0)) – Specifies the axis of the array along which to compute the vector norms.
Returns:norm – Norm along axis.
Return type:ds-array
Raises:NotImplementedError – If the ds-array is sparse.
rechunk(block_size)[source]

Re-partitions the ds-array into blocks of the given block size.

Parameters:block_size (tuple of two ints) – The desired block size.
Returns:x – Re-partitioned ds-array.
Return type:ds-array
replace_block(i, j, new_block)[source]
shape

Total shape of the ds-array

sqrt()[source]

Returns the element-wise square root of the elements in the ds-array

Returns:x
Return type:ds-array
sum(axis=0)[source]

Returns the sum along the given axis.

Parameters:axis (int, optional (default=0))
Returns:sum – Sum along axis.
Return type:ds-array
transpose(mode='rows')[source]

Returns the transpose of the ds-array following the method indicated by mode. ‘All’ uses a single task to transpose all the blocks (slow with high number of blocks). ‘rows’ and ‘columns’ transpose each block of rows or columns independently (i.e. a task per row/col block).

Parameters:mode (string, optional (default=rows)) – Array of samples.
Returns:dsarray – A transposed ds-array.
Return type:ds-array

Array creation routines

dislib.array(x, block_size)[source]

Loads data into a Distributed Array.

Parameters:
  • x (spmatrix or array-like, shape=(n_samples, n_features)) – Array of samples.
  • block_size ((int, int)) – Block sizes in number of samples.
Returns:

dsarray – A distributed representation of the data divided in blocks.

Return type:

ds-array

dislib.random_array(shape, block_size, random_state=None)[source]

Returns a distributed array of random floats in the open interval [0.0, 1.0). Values are from the “continuous uniform” distribution over the stated interval.

Parameters:
  • shape (tuple of two ints) – Shape of the output ds-array.
  • block_size (tuple of two ints) – Size of the ds-array blocks.
  • random_state (int or RandomState, optional (default=None)) – Seed or numpy.random.RandomState instance to generate the random numbers.
Returns:

x – Distributed array of random floats.

Return type:

ds-array

dislib.zeros(shape, block_size, dtype=None)[source]

Returns a ds-array of given shape and block size, filled with zeros.

Parameters:
  • shape (tuple of two ints) – Shape of the output ds-array.
  • block_size (tuple of two ints) – Size of the ds-array blocks.
  • dtype (data type, optional (default=None)) – The desired type of the array. Defaults to float.
Returns:

x – Distributed array filled with zeros.

Return type:

ds-array

dislib.full(shape, block_size, fill_value, dtype=None)[source]

Returns a ds-array of ‘shape’ filled with ‘fill_value’.

Parameters:
  • shape (tuple of two ints) – Shape of the output ds-array.
  • block_size (tuple of two ints) – Size of the ds-array blocks.
  • fill_value (scalar) – Fill value.
  • dtype (data type, optional (default=None)) – The desired type of the array. Defaults to float.
Returns:

x – Distributed array filled with the fill value.

Return type:

ds-array

dislib.eye(n, m, block_size, dtype=None)[source]

Returns a matrix filled with ones on the diagonal and zeros elsewhere.

Parameters:
  • n (int) – number of rows.
  • m (int) – number of columns.
  • block_size (tuple of two ints) – Block size.
  • dtype (data type, optional (default=None)) – The desired type of the ds-array. Defaults to float.
Returns:

x – Identity matrix of shape n x m.

Return type:

ds-array

Raises:

ValueError – If block_size is greater than n.

dislib.identity(n, block_size, dtype=None)[source]

Returns the identity matrix.

Parameters:
  • n (int) – Size of the matrix.
  • block_size (tuple of two ints) – Block size.
  • dtype (data type, optional (default=None)) – The desired type of the ds-array. Defaults to float.
Returns:

x – Identity matrix of shape n x n.

Return type:

ds-array

Raises:

ValueError – If block_size is greater than n.

dislib.load_txt_file(path, block_size, delimiter=', ')[source]

Loads a text file into a distributed array.

Parameters:
  • path (string) – File path.
  • block_size (tuple (int, int)) – Size of the blocks of the array.
  • delimiter (string, optional (default=”,”)) – String that separates columns in the file.
Returns:

x – A distributed representation of the data divided in blocks.

Return type:

ds-array

dislib.load_svmlight_file(path, block_size, n_features, store_sparse)[source]

Loads a SVMLight file into a distributed array.

Parameters:
  • path (string) – File path.
  • block_size (tuple (int, int)) – Size of the blocks for the output ds-array.
  • n_features (int) – Number of features.
  • store_sparse (boolean) – Whether to use scipy.sparse data structures to store data. If False, numpy.array is used instead.
Returns:

x, y – A distributed representation (ds-array) of the X and y.

Return type:

(ds-array, ds-array)

dislib.load_npy_file(path, block_size)[source]

Loads a file in npy format (must be 2-dimensional).

Parameters:
  • path (str) – Path to the npy file.
  • block_size (tuple (int, int)) – Block size of the resulting ds-array.
Returns:

x

Return type:

ds-array

dislib.load_mdcrd_file(path, block_size, n_atoms, copy=False)[source]

Loads a mdcrd trajectory file into a distributed array.

Parameters:
  • path (string) – File path.
  • block_size (tuple (int, int)) – Size of the blocks of the array.
  • n_atoms (int) – Number of atoms in the trajectory. Each frame in the mdcrd file has 3*n_atoms float values (corresponding to 3-dimensional position).
  • copy (boolean, default=False) – Send the file to every task, as opposed to reading it once in the master program.
Returns:

x – A distributed representation of the data divided in blocks.

Return type:

ds-array

dislib.data.load_hstack_npy_files(path, cols_per_block=None)[source]

Loads the .npy files in a directory into a ds-array, stacking them horizontally, like (A|B|C). The order of concatenation is alphanumeric.

At least 1 valid .npy file must exist in the directory, and every .npy file must contain a valid array. Every array must have the same dtype, order, and number of rows.

The blocks of the returned ds-array will have the same number of rows as the input arrays, and cols_per_block columns, which defaults to the number of columns of the first array.

Parameters:
  • path (string) – Folder path.
  • cols_per_block (tuple (int, int)) – Number of columns of the blocks for the output ds-array. If None, the number of columns of the first array is used.
Returns:

x – A distributed representation (ds-array) of the stacked arrays.

Return type:

ds-array

Utility functions

dislib.data.util.pad(a: dislib.data.array.Array, pad_width, **kwargs)[source]

Pad array blocks with the desired value. :Parameters: * a (array_like of rank N) – The array to pad.

  • pad_width (((top, bottom), (left, right))) – Number of values padded to the edges of each axis.
  • constant_value (scalar, optional) – The value to set in the padded rows and columns. Default is 0.
dislib.data.util.pad_last_blocks_with_zeros(a: dislib.data.array.Array)[source]

Pad array blocks with zeros. :Parameters: a (ds-array) – The array to pad.

dislib.data.util.compute_bottom_right_shape(a: dislib.data.array.Array)[source]

Computes a shape of the bottom right block. :Parameters: a (ds-array) – The array to pad.

Returns:
  • size0 (int) – size of the first dimension
  • size1 (int) – size of the second dimension
dislib.data.util.remove_last_columns(a: dislib.data.array.Array, n_columns)[source]

Removes last columns from the right-most blocks of the ds-array. :Parameters: * a (ds-array) – The array to pad.

  • n_columns (int) – The number of columns to remove
Raises:ValueError – if n_columns >= the width of the right-most blocks
dislib.data.util.remove_last_rows(a: dislib.data.array.Array, n_rows)[source]

Removes last rows from the bottom blocks of the ds-array. :Parameters: * a (ds-array) – The array to pad.

  • n_rows (int) – The array to pad.

Other functions

dislib.save_txt(arr, dir, merge_rows=False)[source]

Save a ds-array by blocks to a directory in txt format.

Parameters:
  • arr (ds-array) – Array data to be saved.
  • dir (str) – Directory into which the data is saved.
  • merge_rows (boolean, default=False) – Merge blocks along rows before saving.
dislib.data.array.apply_along_axis(func, axis, x, *args, **kwargs)[source]

Apply a function to slices along the given axis.

Execute func(a, *args, **kwargs) where func operates on nd-arrays and a is a slice of arr along axis. The size of the slices is determined by the blocks shape of x.

func must meet the following conditions:

  • Take an nd-array as argument
  • Accept axis as a keyword argument
  • Return an array-like structure
Parameters:
  • func (function) – This function should accept nd-arrays and an axis argument. It is applied to slices of arr along the specified axis.
  • axis (integer) – Axis along which arr is sliced. Can be 0 or 1.
  • x (ds-array) – Input distributed array.
  • args (any) – Additional arguments to func.
  • kwargs (any) – Additional named arguments to func.
Returns:

out – The output array. The shape of out is identical to the shape of arr, except along the axis dimension. The output ds-array is dense regardless of the type of the input array.

Return type:

ds-array

Examples

>>> import dislib as ds
>>> import numpy as np
>>>
>>>
>>> if __name__ == '__main__':
>>>     x = ds.random_array((100, 100), block_size=(25, 25))
>>>     mean = ds.apply_along_axis(np.mean, 0, x)
>>>     print(mean.collect())