dislib.utils

Functions

dislib.utils.base.as_grid(dataset, n_regions, dimensions=None, return_indices=False)[source]

Arranges samples in an n-dimensional grid where each Subset contains samples lying in one region of the feature space. The feature space is divided in n_regions equally sized regions on each dimension based on the maximum and minimum values of each feature in the dataset.

Parameters:
  • dataset (Dataset) – Input data.
  • n_regions (int) – Number of regions per dimension in which to split the feature space.
  • dimensions (iterable, optional (default=None)) – Integer indices of the dimensions to split. If None, all dimensions are split.
  • return_indices (boolean, optional (default=False)) – Whether to return sorting indices.
Returns:

  • grid_data (Dataset) – A new Dataset with one Subset per region in the feature space.
  • index_array (array, shape = [n_samples, ]) – Array of indices that sort the samples in grid_data back to the order they have in the input Dataset.

dislib.utils.base.resample(dataset, n_samples, random_state=None)[source]

Resamples a dataset without replacement.

Parameters:
  • dataset (Dataset) – Input data.
  • n_samples (int) – Number of samples to generate.
  • random_state (int or RandomState, optional (default=None)) – Seed or numpy.random.RandomState instance to use in the generation of random numbers.
Returns:

resampled_data – Resampled dataset. The number of subsets in the returned dataset is less or equal to the number of subsets in the input dataset.

Return type:

Dataset

dislib.utils.base.shuffle(dataset_in, n_subsets_out=None, random_state=None)[source]

Randomly shuffles a Dataset.

Parameters:
  • dataset_in (Dataset) – Input Dataset.
  • n_subsets_out (int, optional (default=None)) – Number of Subsets in the shuffled dataset. If None, it is the same as in the input Dataset.
  • random_state (int or RandomState, optional (default=None)) – Seed or numpy.random.RandomState instance to use in the generation of random numbers.
Returns:

shuffled_dataset – A new randomly shuffled Dataset with n_subsets_out balanced Subsets. If even splits are impossible, some Subsets contain 1 extra instance. These extra instances are evenly distributed to make k-fold splits (with k divisor of the number of subsets) as balanced as possible.

Return type:

Dataset