dislib.array¶
- class dislib.data.array.Array(blocks, top_left_shape, reg_shape, shape, sparse, delete=True)[source]¶
Bases:
object
A distributed 2-dimensional array divided in blocks.
Normally, this class should not be instantiated directly, but created using one of the array creation routines provided.
Apart from the different methods provided, this class also supports the following types of indexing:
A[i]
: returns a single rowA[i, j]
: returns a single elementA[i:j]
: returns a set of rows (withi
andj
optional)A[:, i:j]
: returns a set of columns (withi
andj
optional)A[[i,j,k]]
: returns a set of non-consecutive rows. Rows are
returned ordered by their index in the input array. -
A[:, [i,j,k]]
: returns a set of non-consecutive columns. Columns are returned ordered by their index in the input array. -A[i:j, k:m]
: returns a set of elements (withi
,j
,k
, andm
optional)- Parameters
blocks (list) – List of lists of nd-array or spmatrix.
top_left_shape (tuple) – A single tuple indicating the shape of the top-left block.
reg_shape (tuple) – A single tuple indicating the shape of the regular block.
shape (tuple (int, int)) – Total number of elements in the array.
sparse (boolean, optional (default=False)) – Whether this array stores sparse data.
delete (boolean, optional (default=True)) – Whether to call compss_delete_object on the blocks when the garbage collector deletes this ds-array.
- Variables
shape (tuple (int, int)) – Total number of elements in the array.
- property T¶
Returns the transpose of this ds-array
- collect(squeeze=True)[source]¶
Collects the contents of this ds-array and returns the equivalent in-memory array that this ds-array represents. This method creates a synchronization point in the execution of the application.
Warning: This method may fail if the ds-array does not fit in memory.
- Parameters
squeeze (boolean, optional (default=True)) – Whether to remove single-dimensional entries from the shape of the resulting ndarray.
- Returns
array – The actual contents of the ds-array.
- Return type
nd-array or spmatrix
- max(axis=0)[source]¶
Returns the maximum along the given axis.
- Parameters
axis (int, optional (default=0))
- Returns
max – Maximum along axis.
- Return type
ds-array
- mean(axis=0)[source]¶
Returns the mean along the given axis.
- Parameters
axis (int, optional (default=0))
- Returns
mean – Mean along axis.
- Return type
ds-array
- median(axis=0)[source]¶
Returns the median along the given axis.
- Parameters
axis (int, optional (default=0))
- Returns
median – Median along axis.
- Return type
ds-array
- Raises
NotImplementedError – If the ds-array is sparse.
- min(axis=0)[source]¶
Returns the minimum along the given axis.
- Parameters
axis (int, optional (default=0))
- Returns
min – Minimum along axis.
- Return type
ds-array
- norm(axis=0)[source]¶
Returns the Frobenius norm along an axis.
- Parameters
axis (int, optional (default=0)) – Specifies the axis of the array along which to compute the vector norms.
- Returns
norm – Norm along axis.
- Return type
ds-array
- Raises
NotImplementedError – If the ds-array is sparse.
- rechunk(block_size)[source]¶
Re-partitions the ds-array into blocks of the given block size.
- Parameters
block_size (tuple of two ints) – The desired block size.
- Returns
x – Re-partitioned ds-array.
- Return type
ds-array
- property shape¶
Total shape of the ds-array
- sqrt()[source]¶
Returns the element-wise square root of the elements in the ds-array
- Returns
x
- Return type
ds-array
- sum(axis=0)[source]¶
Returns the sum along the given axis.
- Parameters
axis (int, optional (default=0))
- Returns
sum – Sum along axis.
- Return type
ds-array
- transpose(mode='rows')[source]¶
Returns the transpose of the ds-array following the method indicated by mode. ‘All’ uses a single task to transpose all the blocks (slow with high number of blocks). ‘rows’ and ‘columns’ transpose each block of rows or columns independently (i.e. a task per row/col block).
- Parameters
mode (string, optional (default=rows)) – Array of samples.
- Returns
dsarray – A transposed ds-array.
- Return type
ds-array
Array creation routines¶
- dislib.array(x, block_size)[source]¶
Loads data into a Distributed Array.
- Parameters
x (spmatrix or array-like, shape=(n_samples, n_features)) – Array of samples.
block_size ((int, int)) – Block sizes in number of samples.
- Returns
dsarray – A distributed representation of the data divided in blocks.
- Return type
ds-array
- dislib.random_array(shape, block_size, random_state=None)[source]¶
Returns a distributed array of random floats in the open interval [0.0, 1.0). Values are from the “continuous uniform” distribution over the stated interval.
- Parameters
shape (tuple of two ints) – Shape of the output ds-array.
block_size (tuple of two ints) – Size of the ds-array blocks.
random_state (int or RandomState, optional (default=None)) – Seed or numpy.random.RandomState instance to generate the random numbers.
- Returns
x – Distributed array of random floats.
- Return type
ds-array
- dislib.zeros(shape, block_size, dtype=None)[source]¶
Returns a ds-array of given shape and block size, filled with zeros.
- Parameters
shape (tuple of two ints) – Shape of the output ds-array.
block_size (tuple of two ints) – Size of the ds-array blocks.
dtype (data type, optional (default=None)) – The desired type of the array. Defaults to float.
- Returns
x – Distributed array filled with zeros.
- Return type
ds-array
- dislib.full(shape, block_size, fill_value, dtype=None)[source]¶
Returns a ds-array of ‘shape’ filled with ‘fill_value’.
- Parameters
shape (tuple of two ints) – Shape of the output ds-array.
block_size (tuple of two ints) – Size of the ds-array blocks.
fill_value (scalar) – Fill value.
dtype (data type, optional (default=None)) – The desired type of the array. Defaults to float.
- Returns
x – Distributed array filled with the fill value.
- Return type
ds-array
- dislib.eye(n, m, block_size, dtype=None)[source]¶
Returns a matrix filled with ones on the diagonal and zeros elsewhere.
- Parameters
n (int) – number of rows.
m (int) – number of columns.
block_size (tuple of two ints) – Block size.
dtype (data type, optional (default=None)) – The desired type of the ds-array. Defaults to float.
- Returns
x – Identity matrix of shape n x m.
- Return type
ds-array
- Raises
ValueError – If block_size is greater than n.
- dislib.identity(n, block_size, dtype=None)[source]¶
Returns the identity matrix.
- Parameters
n (int) – Size of the matrix.
block_size (tuple of two ints) – Block size.
dtype (data type, optional (default=None)) – The desired type of the ds-array. Defaults to float.
- Returns
x – Identity matrix of shape n x n.
- Return type
ds-array
- Raises
ValueError – If block_size is greater than n.
- dislib.matmul(a: Array, b: Array, transpose_a=False, transpose_b=False)[source]¶
Matrix multiplication with a possible transpose of the input.
- Parameters
a (ds-array) – First matrix.
b (ds-array) – Second matrix.
transpose_a (bool) – Transpose of the first matrix before multiplication.
transpose_b (any) – Transpose of the second matrix before multiplication.
- Returns
out – The output array.
- Return type
ds-array
- Raises
NotImplementedError – If _top_left shape does not match _reg_shape. This case will be implemented in the future.
ValueError – If any of the block sizes does not match.
Examples
>>> import dislib as ds >>> >>> >>> if __name__ == "__main__": >>> x = ds.random_array((8, 4), block_size=(2, 2)) >>> y = ds.random_array((5, 8), block_size=(2, 2)) >>> result = ds.matmul(x, y, transpose_a=True, transpose_b=True) >>> print(result.collect())
- dislib.concat_columns(a: Array, b: Array)[source]¶
Matrix concatenation by columns. :Parameters: * a (ds-array) – First matrix.
b (ds-array) – Second matrix.
- Returns
out – The output array.
- Return type
ds-array
- Raises
NotImplementedError – If _top_left shape does not match _reg_shape. This case will be implemented in the future.
ValueError – If the arrays do not match in the number of rows.
Examples
>>> import dislib as ds >>> >>> >>> if __name__ == "__main__": >>> x = ds.random_array((8, 4), block_size=(2, 2)) >>> y = ds.random_array((8, 4), block_size=(2, 2)) >>> result = ds.concat_columns(x, y) >>> print(result.collect())
- dislib.concat_rows(a, b)[source]¶
Matrix concatenation by rows. :Parameters: * a (ds-array) – First matrix.
b (ds-array) – Second matrix.
- Returns
out – The output array.
- Return type
ds-array
- Raises
ValueError – If the arrays do not match in the number of rows. If the block size is different between the arrays.
Examples
>>> import dislib as ds >>> >>> >>> if __name__ == "__main__": >>> x = ds.random_array((8, 4), block_size=(2, 2)) >>> y = ds.random_array((8, 4), block_size=(2, 2)) >>> result = ds.concat_rows(x, y) >>> print(result.collect())
- dislib.load_txt_file(path, block_size, discard_first_row=False, col_of_index=False, delimiter=',')[source]¶
Loads a text file into a distributed array.
- Parameters
path (string) – File path.
block_size (tuple (int, int)) – Size of the blocks of the array.
discard_first_row (bool) – Boolean that indicates if the first row should be discarded.
col_of_index (bool) – Boolean that indicates if the first column is a column of indexes and therefore it should be discarded.
delimiter (string, optional (default=”,”)) – String that separates columns in the file.
- Returns
x – A distributed representation of the data divided in blocks.
- Return type
ds-array
- dislib.load_svmlight_file(path, block_size, n_features, store_sparse)[source]¶
Loads a SVMLight file into a distributed array.
- Parameters
path (string) – File path.
block_size (tuple (int, int)) – Size of the blocks for the output ds-array.
n_features (int) – Number of features.
store_sparse (boolean) – Whether to use scipy.sparse data structures to store data. If False, numpy.array is used instead.
- Returns
x, y – A distributed representation (ds-array) of the X and y.
- Return type
(ds-array, ds-array)
- dislib.load_npy_file(path, block_size)[source]¶
Loads a file in npy format (must be 2-dimensional).
- Parameters
path (str) – Path to the npy file.
block_size (tuple (int, int)) – Block size of the resulting ds-array.
- Returns
x
- Return type
ds-array
- dislib.load_mdcrd_file(path, block_size, n_atoms, copy=False)[source]¶
Loads a mdcrd trajectory file into a distributed array.
- Parameters
path (string) – File path.
block_size (tuple (int, int)) – Size of the blocks of the array.
n_atoms (int) – Number of atoms in the trajectory. Each frame in the mdcrd file has 3*n_atoms float values (corresponding to 3-dimensional position).
copy (boolean, default=False) – Send the file to every task, as opposed to reading it once in the master program.
- Returns
x – A distributed representation of the data divided in blocks.
- Return type
ds-array
- dislib.data.load_hstack_npy_files(path, cols_per_block=None)[source]¶
Loads the .npy files in a directory into a ds-array, stacking them horizontally, like (A|B|C). The order of concatenation is alphanumeric.
At least 1 valid .npy file must exist in the directory, and every .npy file must contain a valid array. Every array must have the same dtype, order, and number of rows.
The blocks of the returned ds-array will have the same number of rows as the input arrays, and cols_per_block columns, which defaults to the number of columns of the first array.
- Parameters
path (string) – Folder path.
cols_per_block (tuple (int, int)) – Number of columns of the blocks for the output ds-array. If None, the number of columns of the first array is used.
- Returns
x – A distributed representation (ds-array) of the stacked arrays.
- Return type
ds-array
Utility functions¶
- dislib.data.util.compute_bottom_right_shape(a: Array)[source]¶
Computes a shape of the bottom right block. :Parameters: a (ds-array) – The array to pad.
- Returns
size0 (int) – size of the first dimension
size1 (int) – size of the second dimension
- dislib.data.util.pad(a: Array, pad_width, **kwargs)[source]¶
Pad array blocks with the desired value. :Parameters: * a (array_like of rank N) – The array to pad.
pad_width (((top, bottom), (left, right))) – Number of values padded to the edges of each axis.
constant_value (scalar, optional) – The value to set in the padded rows and columns. Default is 0.
- dislib.data.util.pad_last_blocks_with_zeros(a: Array)[source]¶
Pad array blocks with zeros. :Parameters: a (ds-array) – The array to pad.
- dislib.data.util.remove_last_columns(a: Array, n_columns)[source]¶
Removes last columns from the right-most blocks of the ds-array. :Parameters: * a (ds-array) – The array to pad.
n_columns (int) – The number of columns to remove
- Raises
ValueError – if n_columns >= the width of the right-most blocks
Other functions¶
- dislib.save_txt(arr, dir, merge_rows=False)[source]¶
Save a ds-array by blocks to a directory in txt format.
- Parameters
arr (ds-array) – Array data to be saved.
dir (str) – Directory into which the data is saved.
merge_rows (boolean, default=False) – Merge blocks along rows before saving.
- dislib.data.array.apply_along_axis(func, axis, x, *args, **kwargs)[source]¶
Apply a function to slices along the given axis.
Execute func(a, *args, **kwargs) where func operates on nd-arrays and a is a slice of arr along axis. The size of the slices is determined by the blocks shape of x.
func must meet the following conditions:
Take an nd-array as argument
Accept axis as a keyword argument
Return an array-like structure
- Parameters
func (function) – This function should accept nd-arrays and an axis argument. It is applied to slices of arr along the specified axis.
axis (integer) – Axis along which arr is sliced. Can be 0 or 1.
x (ds-array) – Input distributed array.
args (any) – Additional arguments to func.
kwargs (any) – Additional named arguments to func.
- Returns
out – The output array. The shape of out is identical to the shape of arr, except along the axis dimension. The output ds-array is dense regardless of the type of the input array.
- Return type
ds-array
Examples
>>> import dislib as ds >>> import numpy as np >>> >>> >>> if __name__ == '__main__': >>> x = ds.random_array((100, 100), block_size=(25, 25)) >>> mean = ds.apply_along_axis(np.mean, 0, x) >>> print(mean.collect())