dislib.array#
- class dislib.data.array.Array(blocks, top_left_shape, reg_shape, shape, sparse, delete=True)[source]#
Bases:
objectA distributed 2-dimensional array divided in blocks.
Normally, this class should not be instantiated directly, but created using one of the array creation routines provided.
Apart from the different methods provided, this class also supports the following types of indexing:
A[i]: returns a single rowA[i, j]: returns a single elementA[i:j]: returns a set of rows (withiandjoptional)A[:, i:j]: returns a set of columns (withiandjoptional)A[[i,j,k]]: returns a set of non-consecutive rows. Rows are returned ordered by their index in the input array.A[:, [i,j,k]]: returns a set of non-consecutive columns. Columns are returned ordered by their index in the input array.A[i:j, k:m]: returns a set of elements (withi,j,k, andmoptional)
- Parameters:
blocks (list) – List of lists of nd-array or spmatrix.
top_left_shape (tuple) – A single tuple indicating the shape of the top-left block.
reg_shape (tuple) – A single tuple indicating the shape of the regular block.
shape (tuple (int, int)) – Total number of elements in the array.
sparse (boolean, optional (default=False)) – Whether this array stores sparse data.
delete (boolean, optional (default=True)) – Whether to call compss_delete_object on the blocks when the garbage collector deletes this ds-array.
- Variables:
shape (tuple (int, int)) – Total number of elements in the array.
- property T#
Returns the transpose of this ds-array
- collect(squeeze=True)[source]#
Collects the contents of this ds-array and returns the equivalent in-memory array that this ds-array represents. This method creates a synchronization point in the execution of the application.
Warning: This method may fail if the ds-array does not fit in memory.
- Parameters:
squeeze (boolean, optional (default=True)) – Whether to remove single-dimensional entries from the shape of the resulting ndarray.
- Returns:
array – The actual contents of the ds-array.
- Return type:
nd-array or spmatrix
- delete(i=None, j=None)[source]#
Deletes several columns and/or rows and returns the ds-array with the blocks containing adjusted.
- Parameters:
i (list of ints) – Row or rows to remove from the ds-array
j (list of ints) – Column or columns to remove from the ds-array
- Returns:
array – ds-array without the deleted elements
- Return type:
ds-array
- max(axis=0)[source]#
Returns the maximum along the given axis.
- Parameters:
axis (int, optional (default=0))
- Returns:
max – Maximum along axis.
- Return type:
ds-array
- mean(axis=0)[source]#
Returns the mean along the given axis.
- Parameters:
axis (int, optional (default=0))
- Returns:
mean – Mean along axis.
- Return type:
ds-array
- median(axis=0)[source]#
Returns the median along the given axis.
- Parameters:
axis (int, optional (default=0))
- Returns:
median – Median along axis.
- Return type:
ds-array
- Raises:
NotImplementedError – If the ds-array is sparse.
- min(axis=0)[source]#
Returns the minimum along the given axis.
- Parameters:
axis (int, optional (default=0))
- Returns:
min – Minimum along axis.
- Return type:
ds-array
- norm(axis=0)[source]#
Returns the Frobenius norm along an axis.
- Parameters:
axis (int, optional (default=0)) – Specifies the axis of the array along which to compute the vector norms.
- Returns:
norm – Norm along axis.
- Return type:
ds-array
- Raises:
NotImplementedError – If the ds-array is sparse.
- rechunk(block_size)[source]#
Re-partitions the ds-array into blocks of the given block size.
- Parameters:
block_size (tuple of two ints) – The desired block size.
- Returns:
x – Re-partitioned ds-array.
- Return type:
ds-array
- property shape#
Total shape of the ds-array
- sqrt()[source]#
Returns the element-wise square root of the elements in the ds-array
- Returns:
x
- Return type:
ds-array
- sum(axis=0)[source]#
Returns the sum along the given axis.
- Parameters:
axis (int, optional (default=0))
- Returns:
sum – Sum along axis.
- Return type:
ds-array
- transpose(mode='rows')[source]#
Returns the transpose of the ds-array following the method indicated by mode. ‘All’ uses a single task to transpose all the blocks (slow with high number of blocks). ‘rows’ and ‘columns’ transpose each block of rows or columns independently (i.e. a task per row/col block).
- Parameters:
mode (string, optional (default=rows)) – Array of samples.
- Returns:
dsarray – A transposed ds-array.
- Return type:
ds-array
Array creation routines#
- dislib.array(x, block_size)[source]#
Loads data into a Distributed Array.
- Parameters:
x (spmatrix or array-like, shape=(n_samples, n_features)) – Array of samples.
block_size ((int, int)) – Block sizes in number of samples.
- Returns:
dsarray – A distributed representation of the data divided in blocks.
- Return type:
ds-array
- dislib.random_array(shape, block_size, random_state=None)[source]#
Returns a distributed array of random floats in the open interval [0.0, 1.0). Values are from the “continuous uniform” distribution over the stated interval.
- Parameters:
shape (tuple of two ints) – Shape of the output ds-array.
block_size (tuple of two ints) – Size of the ds-array blocks.
random_state (int or RandomState, optional (default=None)) – Seed or numpy.random.RandomState instance to generate the random numbers.
- Returns:
x – Distributed array of random floats.
- Return type:
ds-array
- dislib.zeros(shape, block_size, dtype=None)[source]#
Returns a ds-array of given shape and block size, filled with zeros.
- Parameters:
shape (tuple of two ints) – Shape of the output ds-array.
block_size (tuple of two ints) – Size of the ds-array blocks.
dtype (data type, optional (default=None)) – The desired type of the array. Defaults to float.
- Returns:
x – Distributed array filled with zeros.
- Return type:
ds-array
- dislib.full(shape, block_size, fill_value, dtype=None)[source]#
Returns a ds-array of ‘shape’ filled with ‘fill_value’.
- Parameters:
shape (tuple of two ints) – Shape of the output ds-array.
block_size (tuple of two ints) – Size of the ds-array blocks.
fill_value (scalar) – Fill value.
dtype (data type, optional (default=None)) – The desired type of the array. Defaults to float.
- Returns:
x – Distributed array filled with the fill value.
- Return type:
ds-array
- dislib.eye(n, m, block_size, dtype=None)[source]#
Returns a matrix filled with ones on the diagonal and zeros elsewhere.
- Parameters:
n (int) – number of rows.
m (int) – number of columns.
block_size (tuple of two ints) – Block size.
dtype (data type, optional (default=None)) – The desired type of the ds-array. Defaults to float.
- Returns:
x – Identity matrix of shape n x m.
- Return type:
ds-array
- Raises:
ValueError – If block_size is greater than n.
- dislib.identity(n, block_size, dtype=None)[source]#
Returns the identity matrix.
- Parameters:
n (int) – Size of the matrix.
block_size (tuple of two ints) – Block size.
dtype (data type, optional (default=None)) – The desired type of the ds-array. Defaults to float.
- Returns:
x – Identity matrix of shape n x n.
- Return type:
ds-array
- Raises:
ValueError – If block_size is greater than n.
- dislib.matmul(a: Array, b: Array, transpose_a=False, transpose_b=False)[source]#
Matrix multiplication with a possible transpose of the input.
- Parameters:
a (ds-array) – First matrix.
b (ds-array) – Second matrix.
transpose_a (bool) – Transpose of the first matrix before multiplication.
transpose_b (any) – Transpose of the second matrix before multiplication.
- Returns:
out – The output array.
- Return type:
ds-array
- Raises:
NotImplementedError – If _top_left shape does not match _reg_shape. This case will be implemented in the future.
ValueError – If any of the block sizes does not match.
Examples
>>> import dislib as ds >>> >>> >>> if __name__ == "__main__": >>> x = ds.random_array((8, 4), block_size=(2, 2)) >>> y = ds.random_array((5, 8), block_size=(2, 2)) >>> result = ds.matmul(x, y, transpose_a=True, transpose_b=True) >>> print(result.collect())
- dislib.matadd(a: Array, b: Array)[source]#
Addition of two matrices.
- Parameters:
a (ds-array) – First matrix.
b (ds-array) – Second matrix.
- Returns:
out – The output array.
- Return type:
ds-array
- Raises:
NotImplementedError – If _top_left shape does not match _reg_shape. This case will be implemented in the future.
ValueError – If any of the block sizes does not match.
ValueError – If the ds-arrays have different shape.
Examples
>>> import dislib as ds >>> >>> >>> if __name__ == "__main__": >>> x = ds.random_array((8, 4), block_size=(2, 2)) >>> y = ds.random_array((8, 4), block_size=(2, 2)) >>> result = ds.matadd(x, y) >>> print(result.collect())
- dislib.matsubtract(a: Array, b: Array)[source]#
Subtraction of two matrices.
- Parameters:
a (ds-array) – First matrix.
b (ds-array) – Second matrix.
- Returns:
out – The output array.
- Return type:
ds-array
- Raises:
NotImplementedError – If _top_left shape does not match _reg_shape. This case will be implemented in the future.
ValueError – If any of the block sizes does not match.
ValueError – If the ds-arrays have different shape.
Examples
>>> import dislib as ds >>> >>> >>> if __name__ == "__main__": >>> x = ds.random_array((8, 4), block_size=(2, 2)) >>> y = ds.random_array((8, 4), block_size=(2, 2)) >>> result = ds.matsubtract(x, y) >>> print(result.collect())
- dislib.concat_columns(a: Array, b: Array)[source]#
Matrix concatenation by columns.
- Parameters:
a (ds-array) – First matrix.
b (ds-array) – Second matrix.
- Returns:
out – The output array.
- Return type:
ds-array
- Raises:
NotImplementedError – If _top_left shape does not match _reg_shape. This case will be implemented in the future.
ValueError – If the arrays do not match in the number of rows.
Examples
>>> import dislib as ds >>> >>> >>> if __name__ == "__main__": >>> x = ds.random_array((8, 4), block_size=(2, 2)) >>> y = ds.random_array((8, 4), block_size=(2, 2)) >>> result = ds.concat_columns(x, y) >>> print(result.collect())
- dislib.concat_rows(a, b)[source]#
Matrix concatenation by rows.
- Parameters:
a (ds-array) – First matrix.
b (ds-array) – Second matrix.
- Returns:
out – The output array.
- Return type:
ds-array
- Raises:
ValueError – If the arrays do not match in the number of rows. If the block size is different between the arrays.
Examples
>>> import dislib as ds >>> >>> >>> if __name__ == "__main__": >>> x = ds.random_array((8, 4), block_size=(2, 2)) >>> y = ds.random_array((8, 4), block_size=(2, 2)) >>> result = ds.concat_rows(x, y) >>> print(result.collect())
- dislib.load_txt_file(path, block_size, discard_first_row=False, col_of_index=False, delimiter=',')[source]#
Loads a text file into a distributed array.
- Parameters:
path (string) – File path.
block_size (tuple (int, int)) – Size of the blocks of the array.
discard_first_row (bool) – Boolean that indicates if the first row should be discarded.
col_of_index (bool) – Boolean that indicates if the first column is a column of indexes and therefore it should be discarded.
delimiter (string, optional (default=”,”)) – String that separates columns in the file.
- Returns:
x – A distributed representation of the data divided in blocks.
- Return type:
ds-array
- dislib.load_svmlight_file(path, block_size, n_features, store_sparse)[source]#
Loads a SVMLight file into a distributed array.
- Parameters:
path (string) – File path.
block_size (tuple (int, int)) – Size of the blocks for the output ds-array.
n_features (int) – Number of features.
store_sparse (boolean) – Whether to use scipy.sparse data structures to store data. If False, numpy.array is used instead.
- Returns:
x, y – A distributed representation (ds-array) of the X and y.
- Return type:
(ds-array, ds-array)
- dislib.load_npy_file(path, block_size)[source]#
Loads a file in npy format (must be 2-dimensional).
- Parameters:
path (str) – Path to the npy file.
block_size (tuple (int, int)) – Block size of the resulting ds-array.
- Returns:
x
- Return type:
ds-array
- dislib.load_mdcrd_file(path, block_size, n_atoms, copy=False)[source]#
Loads a mdcrd trajectory file into a distributed array.
- Parameters:
path (string) – File path.
block_size (tuple (int, int)) – Size of the blocks of the array.
n_atoms (int) – Number of atoms in the trajectory. Each frame in the mdcrd file has 3*n_atoms float values (corresponding to 3-dimensional position).
copy (boolean, default=False) – Send the file to every task, as opposed to reading it once in the master program.
- Returns:
x – A distributed representation of the data divided in blocks.
- Return type:
ds-array
- dislib.data.load_hstack_npy_files(path, cols_per_block=None)[source]#
Loads the .npy files in a directory into a ds-array, stacking them horizontally, like (A|B|C). The order of concatenation is alphanumeric.
At least 1 valid .npy file must exist in the directory, and every .npy file must contain a valid array. Every array must have the same dtype, order, and number of rows.
The blocks of the returned ds-array will have the same number of rows as the input arrays, and cols_per_block columns, which defaults to the number of columns of the first array.
- Parameters:
path (string) – Folder path.
cols_per_block (tuple (int, int)) – Number of columns of the blocks for the output ds-array. If None, the number of columns of the first array is used.
- Returns:
x – A distributed representation (ds-array) of the stacked arrays.
- Return type:
ds-array
Utility functions#
- dislib.data.util.compute_bottom_right_shape(a: Array)[source]
Computes a shape of the bottom right block. :Parameters: a (ds-array) – The array to pad.
- Returns:
size0 (int) – size of the first dimension
size1 (int) – size of the second dimension
- dislib.data.util.decoder_helper(class_name, obj)[source]
- dislib.data.util.encoder_helper(obj)[source]
- dislib.data.util.pad(a: Array, pad_width, **kwargs)[source]
Pad array blocks with the desired value.
- Parameters:
a (array_like of rank N) – The array to pad.
pad_width (((top, bottom), (left, right))) – Number of values padded to the edges of each axis.
constant_value (scalar, optional) – The value to set in the padded rows and columns. Default is 0.
- dislib.data.util.pad_last_blocks_with_zeros(a: Array)[source]
Pad array blocks with zeros. :Parameters: a (ds-array) – The array to pad.
- dislib.data.util.remove_last_columns(a: Array, n_columns)[source]
Removes last columns from the right-most blocks of the ds-array.
- Parameters:
a (ds-array) – The array to remove columns from.
n_columns (int) – The number of columns to remove.
- Raises:
ValueError – if n_columns >= the width of the right-most blocks
- dislib.data.util.remove_last_rows(a: Array, n_rows)[source]
Removes last rows from the bottom blocks of the ds-array.
- Parameters:
a (ds-array) – The array to remove rows from.
n_rows (int) – The number of rows to remove.
- dislib.data.util.sync_obj(obj)[source]
Recursively synchronizes the Future objects of a list or dictionary by using compss_wait_on(obj).
Other functions#
- dislib.save_txt(arr, dir, merge_rows=False)[source]#
Save a ds-array by blocks to a directory in txt format.
- Parameters:
arr (ds-array) – Array data to be saved.
dir (str) – Directory into which the data is saved.
merge_rows (boolean, default=False) – Merge blocks along rows before saving.
- dislib.apply_along_axis(func, axis, x, *args, **kwargs)[source]#
Apply a function to slices along the given axis.
Execute func(a, *args, **kwargs) where func operates on nd-arrays and a is a slice of arr along axis. The size of the slices is determined by the blocks shape of x.
func must meet the following conditions:
Take an nd-array as argument
Accept axis as a keyword argument
Return an array-like structure
- Parameters:
func (function) – This function should accept nd-arrays and an axis argument. It is applied to slices of arr along the specified axis.
axis (integer) – Axis along which arr is sliced. Can be 0 or 1.
x (ds-array) – Input distributed array.
args (any) – Additional arguments to func.
kwargs (any) – Additional named arguments to func.
- Returns:
out – The output array. The shape of out is identical to the shape of arr, except along the axis dimension. The output ds-array is dense regardless of the type of the input array.
- Return type:
ds-array
Examples
>>> import dislib as ds >>> import numpy as np >>> >>> >>> if __name__ == '__main__': >>> x = ds.random_array((100, 100), block_size=(25, 25)) >>> mean = ds.apply_along_axis(np.mean, 0, x) >>> print(mean.collect())