dislib.Data

Classes

class dislib.data.classes.Dataset(n_features, sparse=False)[source]

Bases: object

A dataset containing samples and, optionally, labels that can be stored in a distributed manner.

Dataset works as a list of Subset instances, which can be future objects stored remotely. Accessing Dataset.labels and Dataset.samples runs collect() and transfers all the data to the local machine.

Parameters:
  • n_features (int) – Number of features of the samples.
  • sparse (boolean, optional (default=False)) – Whether this dataset uses sparse data structures.
Variables:
  • n_features (int) – Number of features of the samples.
  • _samples (ndarray) – Samples of the dataset.
  • _labels (ndarray) – Labels of the samples.
  • sparse (boolean) – True if this dataset uses sparse data structures.
append(subset, n_samples=None)[source]

Appends a Subset to this Dataset.

Parameters:
  • subset (Subset) – Subset to add to this Dataset.
  • n_samples (int, optional (default=None)) – Number of samples in subset.
collect()[source]
extend(subsets)[source]

Appends one or more Subset instances to this Dataset.

Parameters:subsets (list) – A list of Subset instances.
labels
max_features()[source]

Returns the maximum value of each feature in the dataset. This method might compute the maximum and perform a synchronization.

Returns:max_features – Array representing the maximum value that each feature takes in the dataset.
Return type:array, shape = [n_features,]
min_features()[source]

Returns the minimum value of each feature in the dataset. This method might compute the minimum and perform a synchronization.

Returns:min_features – Array representing the minimum value that each feature takes in the dataset.
Return type:array, shape = [n_features,]
samples
sparse
subset_size(index)[source]

Returns the number of samples in the Subset referenced by index. If the size is unknown, this method performs a synchronization on Subset.samples.shape[0].

Parameters:index (int) – Index of the Subset.
Returns:n_samples – Number of samples.
Return type:int
subsets_sizes()[source]

Returns the number of samples in all the Subsets. If the size is unknown, this method performs a synchronization on Subset.samples.shape[0] for all subsets.

Returns:subsets_sizes – Number of samples in each subset.
Return type:ndarray
transpose(n_subsets=None)[source]

Transposes the Dataset.

Parameters:n_subsets (int, optional (default=None)) – Number of subsets in the transposed dataset. If none, defaults to the original number of subsets
Returns:dataset_t – Transposed dataset divided by rows.
Return type:Dataset
class dislib.data.classes.Subset(samples, labels=None)[source]

Bases: object

A subset of data for machine learning.

Parameters:
  • samples (ndarray) – Array of shape (n_samples, n_features).
  • labels (ndarray, optional) – Array of shape (n_samples)
Variables:
concatenate(subset)[source]

Vertically concatenates this Subset to another.

Parameters:subset (Subset) – Subset to concatenate.
copy()[source]

Return a copy of this Subset

Returns:subset – A copy of this Subset.
Return type:Subset
set_label(index, label)[source]

Sets sample labels.

Parameters:
  • index (int or sequence of ints) – Indices of the target samples.
  • label (float) – Label value.

Notes

If the Subset does not contain labels, this method initializes all labels different from ``index’’ to ``None’‘.

Functions

dislib.data.base.load_data(x, subset_size, y=None)[source]

Loads data into a Dataset.

Parameters:
  • x (ndarray, shape=[n_samples, n_features]) – Array of samples.
  • y (ndarray, optional, shape=[n_features,]) – Array of labels.
  • subset_size (int) – Subset size in number of samples.
Returns:

dataset – A distributed representation of the data divided in Subsets of subset_size.

Return type:

Dataset

dislib.data.base.load_libsvm_file(path, subset_size, n_features, store_sparse=True)[source]

Loads a LibSVM file into a Dataset.

Parameters
path : string
File path.
subset_size : int
Subset size in lines.
n_features : int
Number of features.
store_sparse : boolean, optional (default = True).
Whether to use scipy.sparse data structures to store data. If False, numpy.array is used instead.
Returns:dataset – A distributed representation of the data divided in Subsets of subset_size.
Return type:Dataset
dislib.data.base.load_libsvm_files(path, n_features, store_sparse=True)[source]

Loads a set of LibSVM files into a Dataset.

Parameters
path : string
Path to a directory containing LibSVM files.
n_features : int
Number of features.
store_sparse : boolean, optional (default = True).
Whether to use scipy.sparse data structures to store data. If False, numpy.array is used instead.
Returns:dataset – A distributed representation of the data divided in a Subset for each file in path.
Return type:Dataset
dislib.data.base.load_txt_file(path, subset_size, n_features, delimiter=', ', label_col=None)[source]

Loads a text file into a Dataset.

Parameters
path : string
File path.
subset_size : int
Subset size in lines.
n_features : int
Number of features.
delimiter : string, optional (default “,”)
String that separates features in the file.
label_col : int, optional (default=None)
Column representing data labels. Can be ‘first’ or ‘last’.
Returns:dataset – A distributed representation of the data divided in Subsets of subset_size.
Return type:Dataset
dislib.data.base.load_txt_files(path, n_features, delimiter=', ', label_col=None)[source]

Loads a set of text files into a Dataset.

Parameters
path : string
Path to a directory containing text files.
n_features : int
Number of features.
delimiter : string, optional (default “,”)
String that separates features in the file.
label_col : int, optional (default=None)
Column representing data labels. Can be ‘first’ or ‘last’.
Returns:dataset – A distributed representation of the data divided in a Subset for each file in path.
Return type:Dataset