dislib.model_selection

class dislib.model_selection.GridSearchCV(estimator, param_grid, scoring=None, cv=None, refit=True)[source]

Bases: BaseSearchCV

Exhaustive search over specified parameter values for an estimator.

GridSearchCV implements a “fit” and a “score” method.

The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

Parameters
  • estimator (estimator object.) – This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a score function, or scoring must be passed.

  • param_grid (dict or list of dictionaries) – Dictionary with parameters names (string) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.

  • scoring (callable, dict or None, optional (default=None)) – A callable to evaluate the predictions on the test set. It should take 3 parameters, estimator, x and y, and return a score (higher meaning better). For evaluating multiple metrics, give a dict with names as keys and callables as values. If None, the estimator’s score method is used.

  • cv (int or cv generator, optional (default=None)) – Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - integer, to specify the number of folds in a KFold, - custom cv generator.

  • refit (boolean, string, or callable, optional (default=True)) – Refit an estimator using the best found parameters on the whole dataset. For multiple metric evaluation, this needs to be a string denoting the scorer that would be used to find the best parameters for refitting the estimator at the end. Where there are considerations other than maximum score in choosing a best estimator, refit can be set to a function which returns the selected best_index_ given cv_results_. The refitted estimator is made available at the best_estimator_ attribute and permits using predict directly on this GridSearchCV instance. Also for multiple metric evaluation, the attributes best_index_, best_score_ and best_params_ will only be available if refit is set and all of them will be determined w.r.t this specific scorer. best_score_ is not returned if refit is callable. See scoring parameter to know more about multiple metric evaluation.

Examples

>>> import dislib as ds
>>> from dislib.model_selection import GridSearchCV
>>> from dislib.classification import RandomForestClassifier
>>> import numpy as np
>>> from sklearn import datasets
>>>
>>>
>>> if __name__ == '__main__':
>>>     x_np, y_np = datasets.load_iris(return_X_y=True)
>>>     x = ds.array(x_np, (30, 4))
>>>     y = ds.array(y_np[:, np.newaxis], (30, 1))
>>>     param_grid = {'n_estimators': (2, 4), 'max_depth': range(3, 5)}
>>>     rf = RandomForestClassifier()
>>>     searcher = GridSearchCV(rf, param_grid)
>>>     searcher.fit(x, y)
>>>     searcher.cv_results_
Variables
  • cv_results (dict of numpy (masked) ndarrays) –

    A dict with keys as column headers and values as columns, that can be imported into a pandas DataFrame. For instance the below given table:

    param_kernel

    param_degree

    split0_test_score

    rank_t…

    ’poly’

    2

    0.80

    2

    ’poly’

    3

    0.70

    4

    ’rbf’

    0.80

    3

    ’rbf’

    0.93

    1

    will be represented by a cv_results_ dict of:

    {
    'param_kernel': masked_array(data = ['poly', 'poly', 'rbf', 'rbf'],
                                 mask = [False False False False]...),
    'param_degree': masked_array(data = [2.0 3.0 -- --],
                                 mask = [False False  True  True]...),
    'split0_test_score'  : [0.80, 0.70, 0.80, 0.93],
    'split1_test_score'  : [0.82, 0.50, 0.68, 0.78],
    'split2_test_score'  : [0.79, 0.55, 0.71, 0.93],
    ...
    'mean_test_score'    : [0.81, 0.60, 0.75, 0.85],
    'std_test_score'     : [0.01, 0.10, 0.05, 0.08],
    'rank_test_score'    : [2, 4, 3, 1],
    'params'             : [{'kernel': 'poly', 'degree': 2}, ...],
    }
    

    NOTES:

    The key 'params' is used to store a list of parameter settings dicts for all the parameter candidates.

    The mean_fit_time, std_fit_time, mean_score_time and std_score_time are all in seconds.

    For multi-metric evaluation, the scores for all the scorers are available in the cv_results_ dict at the keys ending with that scorer’s name ('_<scorer_name>') instead of '_score' shown above (‘split0_test_precision’, ‘mean_train_precision’ etc.).

  • best_estimator (estimator or dict) – Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data. Not available if refit=False. See refit parameter for more information on allowed values.

  • best_score (float) – Mean cross-validated score of the best_estimator For multi-metric evaluation, this is present only if refit is specified.

  • best_params (dict) – Parameter setting that gave the best results on the hold out data. For multi-metric evaluation, this is present only if refit is specified.

  • best_index (int) – The index (of the cv_results_ arrays) which corresponds to the best candidate parameter setting. The dict at search.cv_results_['params'][search.best_index_] gives the parameter setting for the best model, that gives the highest mean score (search.best_score_). For multi-metric evaluation, this is present only if refit is specified.

  • scorer (function or a dict) – Scorer function used on the held out data to choose the best parameters for the model. For multi-metric evaluation, this attribute holds the validated scoring dict which maps the scorer key to the scorer callable.

  • n_splits (int) – The number of cross-validation splits (folds/iterations).

fit(x, y=None, **fit_params)[source]

Run fit with all sets of parameters.

Parameters
  • x (ds-array) – Training data samples.

  • y (ds-array, optional (default = None)) – Training data labels or values.

  • **fit_params (dict of string -> object) – Parameters passed to the fit method of the estimator

class dislib.model_selection.KFold(n_splits=5, shuffle=False, random_state=None)[source]

Bases: object

K-fold splitter for cross-validation

Returns k partitions of the dataset into train and validation datasets. The dataset is shuffled and split into k folds; each fold is used once as validation dataset while the k - 1 remaining folds form the training dataset.

Each fold contains n//k or n//k + 1 samples, where n is the number of samples in the input dataset.

Parameters
  • n_splits (int, optional (default=5)) – Number of folds. Must be at least 2.

  • shuffle (boolean, optional (default=False)) – Shuffles and balances the data before splitting into batches.

  • random_state (int, RandomState instance or None, optional, default=None) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when shuffle == True.

get_n_splits()[source]

Get the number of CV splits that this splitter does.

Returns

n_splits – The number of splits performed by this CV splitter.

Return type

int

split(x, y=None)[source]

Generates K-fold splits.

Parameters
  • x (ds-array) – Samples array.

  • y (ds-array, optional (default=None)) – Corresponding labels or values.

Yields
  • train_data (train_x, train_y) – The training ds-arrays for that split. If y is None, train_y is None.

  • test_data (test_x, test_y) – The testing ds-arrays data for that split. If y is None, test_y is None.

class dislib.model_selection.RandomizedSearchCV(estimator, param_distributions, n_iter=10, scoring=None, cv=None, refit=True, random_state=None)[source]

Bases: BaseSearchCV

Randomized search on hyper parameters.

RandomizedSearchCV implements a “fit” and a “score” method.

The parameters of the estimator used to apply these methods are optimized by cross-validated search over parameter settings.

In contrast to GridSearchCV, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter.

If all parameters are presented as a list, sampling without replacement is performed. If at least one parameter is given as a distribution, sampling with replacement is used.

Parameters
  • estimator (estimator object.) – This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a score function, or scoring must be passed.

  • param_distributions (dict) – Dictionary with parameters names (string) as keys and distributions or lists of parameters to try. Distributions must provide a rvs method for sampling (such as those from scipy.stats.distributions). If a list is given, it is sampled uniformly.

  • n_iter (int, optional (default=10)) – Number of parameter settings that are sampled.

  • scoring (callable, dict or None, optional (default=None)) – A callable to evaluate the predictions on the test set. It should take 3 parameters, estimator, x and y, and return a score (higher meaning better). For evaluating multiple metrics, give a dict with names as keys and callables as values. If None, the estimator’s score method is used.

  • cv (int or cv generator, optional (default=None)) – Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - integer, to specify the number of folds in a KFold, - custom cv generator.

  • refit (boolean, string, or callable, optional (default=True)) – Refit an estimator using the best found parameters on the whole dataset. For multiple metric evaluation, this needs to be a string denoting the scorer that would be used to find the best parameters for refitting the estimator at the end. Where there are considerations other than maximum score in choosing a best estimator, refit can be set to a function which returns the selected best_index_ given cv_results_. The refitted estimator is made available at the best_estimator_ attribute and permits using predict directly on this GridSearchCV instance. Also for multiple metric evaluation, the attributes best_index_, best_score_ and best_params_ will only be available if refit is set and all of them will be determined w.r.t this specific scorer. best_score_ is not returned if refit is callable. See scoring parameter to know more about multiple metric evaluation.

  • random_state (int, RandomState instance or None, optional, default=None) – Pseudo random number generator state used for random sampling of params in param_distributions. This is not passed to each estimator. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Examples

>>> import dislib as ds
>>> from dislib.model_selection import RandomizedSearchCV
>>> from dislib.classification import CascadeSVM
>>> import numpy as np
>>> import scipy.stats as stats
>>> from sklearn import datasets
>>>
>>>
>>> if __name__ == '__main__':
>>>     x_np, y_np = datasets.load_iris(return_X_y=True)
>>>     # Pre-shuffling required for CSVM
>>>     p = np.random.permutation(len(x_np))
>>>     x = ds.array(x_np[p], (30, 4))
>>>     y = ds.array((y_np[p] == 0)[:, np.newaxis], (30, 1))
>>>     param_distributions = {'c': stats.expon(scale=0.5),
>>>                            'gamma': stats.expon(scale=10)}
>>>     csvm = CascadeSVM()
>>>     searcher = RandomizedSearchCV(csvm, param_distributions, n_iter=10)
>>>     searcher.fit(x, y)
>>>     searcher.cv_results_
Variables
  • cv_results (dict of numpy (masked) ndarrays) –

    A dict with keys as column headers and values as columns, that can be imported into a pandas DataFrame.

    For instance the below given table

    param_c

    param_gamma

    split0_test_score

    rank_test_score

    0.193

    1.883

    0.82

    3

    1.452

    0.327

    0.81

    2

    0.926

    3.452

    0.94

    1

    will be represented by a cv_results_ dict of:

    {
    'param_kernel' : masked_array(data = ['rbf', 'rbf', 'rbf'],
                                  mask = False),
    'param_gamma'  : masked_array(data = [0.1 0.2 0.3], mask = False),
    'split0_test_score'  : [0.82, 0.81, 0.94],
    'split1_test_score'  : [0.66, 0.75, 0.79],
    'split2_test_score'  : [0.82, 0.87, 0.84],
    ...
    'mean_test_score'    : [0.76, 0.84, 0.86],
    'std_test_score'     : [0.01, 0.20, 0.04],
    'rank_test_score'    : [3, 2, 1],
    'params'             : [{'c' : 0.193, 'gamma' : 1.883}, ...],
    }
    

    NOTE

    The key 'params' is used to store a list of parameter settings dicts for all the parameter candidates.

    The mean_fit_time, std_fit_time, mean_score_time and std_score_time are all in seconds.

    For multi-metric evaluation, the scores for all the scorers are available in the cv_results_ dict at the keys ending with that scorer’s name ('_<scorer_name>') instead of '_score' shown above. (‘split0_test_precision’, ‘mean_train_precision’ etc.)

  • best_estimator (estimator or dict) –

    Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data. Not available if refit=False.

    For multi-metric evaluation, this attribute is present only if refit is specified.

    See refit parameter for more information on allowed values.

  • best_score (float) –

    Mean cross-validated score of the best_estimator.

    For multi-metric evaluation, this is not available if refit is False. See refit parameter for more information.

  • best_params (dict) –

    Parameter setting that gave the best results on the hold out data.

    For multi-metric evaluation, this is not available if refit is False. See refit parameter for more information.

  • best_index (int) –

    The index (of the cv_results_ arrays) which corresponds to the best candidate parameter setting.

    The dict at search.cv_results_['params'][search.best_index_] gives the parameter setting for the best model, that gives the highest mean score (search.best_score_).

    For multi-metric evaluation, this is not available if refit is False. See refit parameter for more information.

  • scorer (function or a dict) –

    Scorer function used on the held out data to choose the best parameters for the model.

    For multi-metric evaluation, this attribute holds the validated scoring dict which maps the scorer key to the scorer callable.

  • n_splits (int) – The number of cross-validation splits (folds/iterations).

fit(x, y=None, **fit_params)[source]

Run fit with all sets of parameters.

Parameters
  • x (ds-array) – Training data samples.

  • y (ds-array, optional (default = None)) – Training data labels or values.

  • **fit_params (dict of string -> object) – Parameters passed to the fit method of the estimator

class dislib.model_selection.SimulationGridSearch(estimator, param_grid, sim_number=1, order='max')[source]

Bases: object

Exhaustive execution of all combinations of specified parameters values in parallel simulations.

SimulationGridSearch implements a “fit” method.

Parameters
  • estimator (simulator object.) – This should receive the parameters specified in param_grid and use that parameters for the corresponding operation.

  • param_grid (dict or list of dictionaries) – Dictionary with parameters names (string) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.

  • sim_number (Integer) – Number of simulations that are going to be executed with each of the parameter combination.

  • order (string “max” or “min”.) – String that specifies how to order the results obtained from the simulation, “max” will set first the highest values and “min” the lowest values.

Examples

>>> import dislib as ds
    >>> from dislib.model_selection import SimulationGridSearch
    >>> from dislib.classification import RandomForestClassifier
    >>> import numpy as np
    >>> from sklearn import datasets
>>> def my_simulation(a, b):
>>>    return (a*a)/(b*b)+a*(a+b)-b*(2*b)
>>>
>>> param_grid = {'a': [-1.1, -0.1, 1.5, 2.5], 'b': [0.1, 1.5, 2.5, 3.5]}
>>> searcher = SimulationGridSearch(my_simulation, param_grid, order="min")
>>> searcher.fit(None)
>>> best_params = searcher.best_params_
>>>
Variables
  • raw_results (list of objects) – List containing the results obtained from the different simulations. In the list the results are saved as returned from the simulation, with no changes in the format.

  • cv_results (dict of numpy (masked) ndarrays) –

    A dict with keys as column headers and values as columns, that can be imported into a pandas DataFrame. For instance the below given table:

    will be represented by a cv_results_ dict of:

    {
    'param_kernel': masked_array(data = [-1.1, -1.1, -0.1, -0.1],
                                 mask = [False False False False]...),
    'param_degree': masked_array(data = [0.1 1.5 0.1 1.5],
                                 mask = [False False  False  False]
                                 ...),
    ...
    'mean_test_simulation'    : [122.08, -4.40, 0.98, -4.63],
    'std_test_simulation'     : [0.0, --, --, --],
    'rank_test_score'    : [2, 4, 3, 1],
    'params'             : [{'a': '-1.1', 'b': 0.1}, ...],
    }
    

    NOTES:

    The key 'params' is used to store a list of parameter settings dicts for all the parameter used in the simulation.

  • best_score (float) – Best value obtained from a simulation, if several runs of each simulation are done the best mean of the values obtained is used

  • best_params (dict) – Parameter setting that gave the best results on the hold out data.

  • best_index (int) – The index (of the cv_results_ arrays) which corresponds to the best candidate parameter setting. The dict at search.cv_results_['params'][search.best_index_] gives the parameter setting for the best model, that gives the highest mean score (search.best_score_).

fit(x, y=None, **fit_params)[source]

Run fit with all sets of parameters.

Parameters
  • x (ds-array) – Training data samples.

  • y (ds-array, optional (default = None)) – Training data labels or values.

  • **fit_params (dict of string -> object) – Parameters passed to the fit method of the estimator