dislib.Data¶
Classes¶
-
class
dislib.data.classes.
Dataset
(n_features, sparse=False)[source]¶ Bases:
object
A dataset containing samples and, optionally, labels that can be stored in a distributed manner.
Dataset works as a list of Subset instances, which can be future objects stored remotely. Accessing Dataset.labels and Dataset.samples runs collect() and transfers all the data to the local machine.
Parameters: - n_features (int) – Number of features of the samples.
- sparse (boolean, optional (default=False)) – Whether this dataset uses sparse data structures.
Variables: -
append
(subset, n_samples=None)[source]¶ Appends a Subset to this Dataset.
Parameters: - subset (Subset) – Subset to add to this Dataset.
- n_samples (int, optional (default=None)) – Number of samples in subset.
-
extend
(subsets)[source]¶ Appends one or more Subset instances to this Dataset.
Parameters: subsets (list) – A list of Subset instances.
-
labels
¶
-
max_features
()[source]¶ Returns the maximum value of each feature in the dataset. This method might compute the maximum and perform a synchronization.
Returns: max_features – Array representing the maximum value that each feature takes in the dataset. Return type: array, shape = [n_features,]
-
min_features
()[source]¶ Returns the minimum value of each feature in the dataset. This method might compute the minimum and perform a synchronization.
Returns: min_features – Array representing the minimum value that each feature takes in the dataset. Return type: array, shape = [n_features,]
-
samples
¶
-
sparse
¶
-
subset_size
(index)[source]¶ Returns the number of samples in the Subset referenced by index. If the size is unknown, this method performs a synchronization on Subset.samples.shape[0].
Parameters: index (int) – Index of the Subset. Returns: n_samples – Number of samples. Return type: int
-
class
dislib.data.classes.
Subset
(samples, labels=None)[source]¶ Bases:
object
A subset of data for machine learning.
Parameters: - samples (ndarray) – Array of shape (n_samples, n_features).
- labels (ndarray, optional) – Array of shape (n_samples)
Variables: -
concatenate
(subset)[source]¶ Vertically concatenates this Subset to another.
Parameters: subset (Subset) – Subset to concatenate.
Functions¶
-
dislib.data.base.
load_data
(x, subset_size, y=None)[source]¶ Loads data into a Dataset.
Parameters: - x (ndarray, shape=[n_samples, n_features]) – Array of samples.
- y (ndarray, optional, shape=[n_features,]) – Array of labels.
- subset_size (int) – Subset size in number of samples.
Returns: dataset – A distributed representation of the data divided in Subsets of subset_size.
Return type:
-
dislib.data.base.
load_libsvm_file
(path, subset_size, n_features, store_sparse=True)[source]¶ Loads a LibSVM file into a Dataset.
Parameters- path : string
- File path.
- subset_size : int
- Subset size in lines.
- n_features : int
- Number of features.
- store_sparse : boolean, optional (default = True).
- Whether to use scipy.sparse data structures to store data. If False, numpy.array is used instead.
Returns: dataset – A distributed representation of the data divided in Subsets of subset_size. Return type: Dataset
-
dislib.data.base.
load_libsvm_files
(path, n_features, store_sparse=True)[source]¶ Loads a set of LibSVM files into a Dataset.
Parameters- path : string
- Path to a directory containing LibSVM files.
- n_features : int
- Number of features.
- store_sparse : boolean, optional (default = True).
- Whether to use scipy.sparse data structures to store data. If False, numpy.array is used instead.
Returns: dataset – A distributed representation of the data divided in a Subset for each file in path. Return type: Dataset
-
dislib.data.base.
load_txt_file
(path, subset_size, n_features, delimiter=', ', label_col=None)[source]¶ Loads a text file into a Dataset.
Parameters- path : string
- File path.
- subset_size : int
- Subset size in lines.
- n_features : int
- Number of features.
- delimiter : string, optional (default “,”)
- String that separates features in the file.
- label_col : int, optional (default=None)
- Column representing data labels. Can be ‘first’ or ‘last’.
Returns: dataset – A distributed representation of the data divided in Subsets of subset_size. Return type: Dataset
-
dislib.data.base.
load_txt_files
(path, n_features, delimiter=', ', label_col=None)[source]¶ Loads a set of text files into a Dataset.
Parameters- path : string
- Path to a directory containing text files.
- n_features : int
- Number of features.
- delimiter : string, optional (default “,”)
- String that separates features in the file.
- label_col : int, optional (default=None)
- Column representing data labels. Can be ‘first’ or ‘last’.
Returns: dataset – A distributed representation of the data divided in a Subset for each file in path. Return type: Dataset