dislib.Data¶

Classes¶

class dislib.data.classes.Dataset(n_features, sparse=False)[source]¶

Bases: object

A dataset containing samples and, optionally, labels that can be stored in a distributed manner.

Dataset works as a list of Subset instances, which can be future objects stored remotely. Accessing Dataset.labels and Dataset.samples runs collect() and transfers all the data to the local machine.

Parameters:	n_features (int) – Number of features of the samples. sparse (boolean, optional (default=False)) – Whether this dataset uses sparse data structures.
Variables:	n_features (int) – Number of features of the samples. _samples (ndarray) – Samples of the dataset. _labels (ndarray) – Labels of the samples. sparse (boolean) – True if this dataset uses sparse data structures.

append(subset, n_samples=None)[source]¶

Appends a Subset to this Dataset.

Parameters:	subset (Subset) – Subset to add to this Dataset. n_samples (int, optional (default=None)) – Number of samples in subset.

collect()[source]¶

extend(subsets)[source]¶

Appends one or more Subset instances to this Dataset.

Parameters:	subsets (list) – A list of Subset instances.

labels¶

max_features()[source]¶

Returns the maximum value of each feature in the dataset. This method might compute the maximum and perform a synchronization.

Returns:	max_features – Array representing the maximum value that each feature takes in the dataset.
Return type:	array, shape = [n_features,]

min_features()[source]¶

Returns the minimum value of each feature in the dataset. This method might compute the minimum and perform a synchronization.

Returns:	min_features – Array representing the minimum value that each feature takes in the dataset.
Return type:	array, shape = [n_features,]

samples¶

sparse¶

subset_size(index)[source]¶

Returns the number of samples in the Subset referenced by index. If the size is unknown, this method performs a synchronization on Subset.samples.shape[0].

Parameters:	index (int) – Index of the Subset.
Returns:	n_samples – Number of samples.
Return type:	int

subsets_sizes()[source]¶

Returns the number of samples in all the Subsets. If the size is unknown, this method performs a synchronization on Subset.samples.shape[0] for all subsets.

Returns:	subsets_sizes – Number of samples in each subset.
Return type:	ndarray

transpose(n_subsets=None)[source]¶

Transposes the Dataset.

Parameters:	n_subsets (int, optional (default=None)) – Number of subsets in the transposed dataset. If none, defaults to the original number of subsets
Returns:	dataset_t – Transposed dataset divided by rows.
Return type:	Dataset

class dislib.data.classes.Subset(samples, labels=None)[source]¶

Bases: object

A subset of data for machine learning.

Parameters:	samples (ndarray) – Array of shape (n_samples, n_features). labels (ndarray, optional) – Array of shape (n_samples)
Variables:	samples (ndarray) – Samples. labels (ndarray) – Labels.

concatenate(subset)[source]¶

Vertically concatenates this Subset to another.

Parameters:	subset (Subset) – Subset to concatenate.

copy()[source]¶

Return a copy of this Subset

Returns:	subset – A copy of this Subset.
Return type:	Subset

set_label(index, label)[source]¶

Sets sample labels.

Parameters:	index (int or sequence of ints) – Indices of the target samples. label (float) – Label value.

Notes

If the Subset does not contain labels, this method initializes all labels different from ``index’’ to ``None’‘.

Functions¶

dislib.data.base.load_data(x, subset_size, y=None)[source]¶

Loads data into a Dataset.

Parameters:	x (ndarray, shape=[n_samples, n_features]) – Array of samples. y (ndarray, optional, shape=[n_features,]) – Array of labels. subset_size (int) – Subset size in number of samples.
Returns:	dataset – A distributed representation of the data divided in Subsets of subset_size.
Return type:	Dataset

dislib.data.base.load_libsvm_file(path, subset_size, n_features, store_sparse=True)[source]¶

Loads a LibSVM file into a Dataset.

Parameters

path : string: File path.
subset_size : int: Subset size in lines.
n_features : int: Number of features.
store_sparse : boolean, optional (default = True).: Whether to use scipy.sparse data structures to store data. If False, numpy.array is used instead.

Returns:	dataset – A distributed representation of the data divided in Subsets of subset_size.
Return type:	Dataset

dislib.data.base.load_libsvm_files(path, n_features, store_sparse=True)[source]¶

Loads a set of LibSVM files into a Dataset.

Parameters

path : string: Path to a directory containing LibSVM files.
n_features : int: Number of features.
store_sparse : boolean, optional (default = True).: Whether to use scipy.sparse data structures to store data. If False, numpy.array is used instead.

Returns:	dataset – A distributed representation of the data divided in a Subset for each file in path.
Return type:	Dataset

dislib.data.base.load_txt_file(path, subset_size, n_features, delimiter=', ', label_col=None)[source]¶

Loads a text file into a Dataset.

Parameters

path : string: File path.
subset_size : int: Subset size in lines.
n_features : int: Number of features.
delimiter : string, optional (default “,”): String that separates features in the file.
label_col : int, optional (default=None): Column representing data labels. Can be ‘first’ or ‘last’.

Returns:	dataset – A distributed representation of the data divided in Subsets of subset_size.
Return type:	Dataset

dislib.data.base.load_txt_files(path, n_features, delimiter=', ', label_col=None)[source]¶

Loads a set of text files into a Dataset.

Parameters

path : string

Path to a directory containing text files.

n_features : int

Number of features.

delimiter : string, optional (default “,”)

String that separates features in the file.

label_col : int, optional (default=None)

Column representing data labels. Can be ‘first’ or ‘last’.

Returns:	dataset – A distributed representation of the data divided in a Subset for each file in path.
Return type:	Dataset

dislib.Data
- Classes
- Functions

dislib.Data¶

Classes¶

Functions¶

Table of Contents