dislib.model_selection¶
-
class
dislib.model_selection.
GridSearchCV
(estimator, param_grid, scoring=None, cv=None, refit=True)[source]¶ Bases:
dislib.model_selection._search.BaseSearchCV
Exhaustive search over specified parameter values for an estimator.
GridSearchCV implements a “fit” and a “score” method.
The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.
Parameters: - estimator (estimator object.) – This is assumed to implement the scikit-learn estimator interface.
Either estimator needs to provide a
score
function, orscoring
must be passed. - param_grid (dict or list of dictionaries) – Dictionary with parameters names (string) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.
- scoring (callable, dict or None, optional (default=None)) – A callable to evaluate the predictions on the test set. It should take 3 parameters, estimator, x and y, and return a score (higher meaning better). For evaluating multiple metrics, give a dict with names as keys and callables as values. If None, the estimator’s score method is used.
- cv (int or cv generator, optional (default=None)) – Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - integer, to specify the number of folds in a KFold, - custom cv generator.
- refit (boolean, string, or callable, optional (default=True)) – Refit an estimator using the best found parameters on the whole
dataset.
For multiple metric evaluation, this needs to be a string denoting the
scorer that would be used to find the best parameters for refitting
the estimator at the end.
Where there are considerations other than maximum score in
choosing a best estimator,
refit
can be set to a function which returns the selectedbest_index_
givencv_results_
. The refitted estimator is made available at thebest_estimator_
attribute and permits usingpredict
directly on thisGridSearchCV
instance. Also for multiple metric evaluation, the attributesbest_index_
,best_score_
andbest_params_
will only be available ifrefit
is set and all of them will be determined w.r.t this specific scorer.best_score_
is not returned if refit is callable. Seescoring
parameter to know more about multiple metric evaluation.
Examples
>>> import dislib as ds >>> from dislib.model_selection import GridSearchCV >>> from dislib.classification import RandomForestClassifier >>> import numpy as np >>> from sklearn import datasets >>> >>> >>> if __name__ == '__main__': >>> x_np, y_np = datasets.load_iris(return_X_y=True) >>> x = ds.array(x_np, (30, 4)) >>> y = ds.array(y_np[:, np.newaxis], (30, 1)) >>> param_grid = {'n_estimators': (2, 4), 'max_depth': range(3, 5)} >>> rf = RandomForestClassifier() >>> searcher = GridSearchCV(rf, param_grid) >>> searcher.fit(x, y) >>> searcher.cv_results_
Variables: - cv_results (dict of numpy (masked) ndarrays) –
A dict with keys as column headers and values as columns, that can be imported into a pandas
DataFrame
. For instance the below given table:param_kernel param_degree split0_test_score … rank_t… ’poly’ 2 0.80 … 2 ’poly’ 3 0.70 … 4 ’rbf’ – 0.80 … 3 ’rbf’ – 0.93 … 1 will be represented by a
cv_results_
dict of:{ 'param_kernel': masked_array(data = ['poly', 'poly', 'rbf', 'rbf'], mask = [False False False False]...), 'param_degree': masked_array(data = [2.0 3.0 -- --], mask = [False False True True]...), 'split0_test_score' : [0.80, 0.70, 0.80, 0.93], 'split1_test_score' : [0.82, 0.50, 0.68, 0.78], 'split2_test_score' : [0.79, 0.55, 0.71, 0.93], ... 'mean_test_score' : [0.81, 0.60, 0.75, 0.85], 'std_test_score' : [0.01, 0.10, 0.05, 0.08], 'rank_test_score' : [2, 4, 3, 1], 'params' : [{'kernel': 'poly', 'degree': 2}, ...], }
NOTES:
The key
'params'
is used to store a list of parameter settings dicts for all the parameter candidates.The
mean_fit_time
,std_fit_time
,mean_score_time
andstd_score_time
are all in seconds.For multi-metric evaluation, the scores for all the scorers are available in the
cv_results_
dict at the keys ending with that scorer’s name ('_<scorer_name>'
) instead of'_score'
shown above (‘split0_test_precision’, ‘mean_train_precision’ etc.). - best_estimator (estimator or dict) – Estimator that was chosen by the search, i.e. estimator
which gave highest score (or smallest loss if specified)
on the left out data. Not available if
refit=False
. Seerefit
parameter for more information on allowed values. - best_score (float) – Mean cross-validated score of the best_estimator
For multi-metric evaluation, this is present only if
refit
is specified. - best_params (dict) – Parameter setting that gave the best results on the hold out data.
For multi-metric evaluation, this is present only if
refit
is specified. - best_index (int) – The index (of the
cv_results_
arrays) which corresponds to the best candidate parameter setting. The dict atsearch.cv_results_['params'][search.best_index_]
gives the parameter setting for the best model, that gives the highest mean score (search.best_score_
). For multi-metric evaluation, this is present only ifrefit
is specified. - scorer (function or a dict) – Scorer function used on the held out data to choose the best
parameters for the model.
For multi-metric evaluation, this attribute holds the validated
scoring
dict which maps the scorer key to the scorer callable. - n_splits (int) – The number of cross-validation splits (folds/iterations).
- estimator (estimator object.) – This is assumed to implement the scikit-learn estimator interface.
Either estimator needs to provide a
-
class
dislib.model_selection.
RandomizedSearchCV
(estimator, param_distributions, n_iter=10, scoring=None, cv=None, refit=True, random_state=None)[source]¶ Bases:
dislib.model_selection._search.BaseSearchCV
Randomized search on hyper parameters.
RandomizedSearchCV implements a “fit” and a “score” method.
The parameters of the estimator used to apply these methods are optimized by cross-validated search over parameter settings.
In contrast to GridSearchCV, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter.
If all parameters are presented as a list, sampling without replacement is performed. If at least one parameter is given as a distribution, sampling with replacement is used.
Parameters: - estimator (estimator object.) – This is assumed to implement the scikit-learn estimator interface.
Either estimator needs to provide a
score
function, orscoring
must be passed. - param_distributions (dict) – Dictionary with parameters names (string) as keys and distributions
or lists of parameters to try. Distributions must provide a
rvs
method for sampling (such as those from scipy.stats.distributions). If a list is given, it is sampled uniformly. - n_iter (int, optional (default=10)) – Number of parameter settings that are sampled.
- scoring (callable, dict or None, optional (default=None)) – A callable to evaluate the predictions on the test set. It should take 3 parameters, estimator, x and y, and return a score (higher meaning better). For evaluating multiple metrics, give a dict with names as keys and callables as values. If None, the estimator’s score method is used.
- cv (int or cv generator, optional (default=None)) – Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - integer, to specify the number of folds in a KFold, - custom cv generator.
- refit (boolean, string, or callable, optional (default=True)) – Refit an estimator using the best found parameters on the whole
dataset.
For multiple metric evaluation, this needs to be a string denoting the
scorer that would be used to find the best parameters for refitting
the estimator at the end.
Where there are considerations other than maximum score in
choosing a best estimator,
refit
can be set to a function which returns the selectedbest_index_
givencv_results_
. The refitted estimator is made available at thebest_estimator_
attribute and permits usingpredict
directly on thisGridSearchCV
instance. Also for multiple metric evaluation, the attributesbest_index_
,best_score_
andbest_params_
will only be available ifrefit
is set and all of them will be determined w.r.t this specific scorer.best_score_
is not returned if refit is callable. Seescoring
parameter to know more about multiple metric evaluation. - random_state (int, RandomState instance or None, optional, default=None) – Pseudo random number generator state used for random sampling of params in param_distributions. This is not passed to each estimator. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
Examples
>>> import dislib as ds >>> from dislib.model_selection import RandomizedSearchCV >>> from dislib.classification import CascadeSVM >>> import numpy as np >>> import scipy.stats as stats >>> from sklearn import datasets >>> >>> >>> if __name__ == '__main__': >>> x_np, y_np = datasets.load_iris(return_X_y=True) >>> # Pre-shuffling required for CSVM >>> p = np.random.permutation(len(x_np)) >>> x = ds.array(x_np[p], (30, 4)) >>> y = ds.array((y_np[p] == 0)[:, np.newaxis], (30, 1)) >>> param_distributions = {'c': stats.expon(scale=0.5), >>> 'gamma': stats.expon(scale=10)} >>> csvm = CascadeSVM() >>> searcher = RandomizedSearchCV(csvm, param_distributions, n_iter=10) >>> searcher.fit(x, y) >>> searcher.cv_results_
Variables: - cv_results (dict of numpy (masked) ndarrays) –
A dict with keys as column headers and values as columns, that can be imported into a pandas
DataFrame
.For instance the below given table
param_c param_gamma split0_test_score … rank_test_score 0.193 1.883 0.82 … 3 1.452 0.327 0.81 … 2 0.926 3.452 0.94 … 1 will be represented by a
cv_results_
dict of:{ 'param_kernel' : masked_array(data = ['rbf', 'rbf', 'rbf'], mask = False), 'param_gamma' : masked_array(data = [0.1 0.2 0.3], mask = False), 'split0_test_score' : [0.82, 0.81, 0.94], 'split1_test_score' : [0.66, 0.75, 0.79], 'split2_test_score' : [0.82, 0.87, 0.84], ... 'mean_test_score' : [0.76, 0.84, 0.86], 'std_test_score' : [0.01, 0.20, 0.04], 'rank_test_score' : [3, 2, 1], 'params' : [{'c' : 0.193, 'gamma' : 1.883}, ...], }
NOTE
The key
'params'
is used to store a list of parameter settings dicts for all the parameter candidates.The
mean_fit_time
,std_fit_time
,mean_score_time
andstd_score_time
are all in seconds.For multi-metric evaluation, the scores for all the scorers are available in the
cv_results_
dict at the keys ending with that scorer’s name ('_<scorer_name>'
) instead of'_score'
shown above. (‘split0_test_precision’, ‘mean_train_precision’ etc.) - best_estimator (estimator or dict) –
Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data. Not available if
refit=False
.For multi-metric evaluation, this attribute is present only if
refit
is specified.See
refit
parameter for more information on allowed values. - best_score (float) –
Mean cross-validated score of the best_estimator.
For multi-metric evaluation, this is not available if
refit
isFalse
. Seerefit
parameter for more information. - best_params (dict) –
Parameter setting that gave the best results on the hold out data.
For multi-metric evaluation, this is not available if
refit
isFalse
. Seerefit
parameter for more information. - best_index (int) –
The index (of the
cv_results_
arrays) which corresponds to the best candidate parameter setting.The dict at
search.cv_results_['params'][search.best_index_]
gives the parameter setting for the best model, that gives the highest mean score (search.best_score_
).For multi-metric evaluation, this is not available if
refit
isFalse
. Seerefit
parameter for more information. - scorer (function or a dict) –
Scorer function used on the held out data to choose the best parameters for the model.
For multi-metric evaluation, this attribute holds the validated
scoring
dict which maps the scorer key to the scorer callable. - n_splits (int) – The number of cross-validation splits (folds/iterations).
- estimator (estimator object.) – This is assumed to implement the scikit-learn estimator interface.
Either estimator needs to provide a
-
class
dislib.model_selection.
KFold
(n_splits=5, shuffle=False, random_state=None)[source]¶ Bases:
object
K-fold splitter for cross-validation
Returns k partitions of the dataset into train and validation datasets. The dataset is shuffled and split into k folds; each fold is used once as validation dataset while the k - 1 remaining folds form the training dataset.
Each fold contains n//k or n//k + 1 samples, where n is the number of samples in the input dataset.
Parameters: - n_splits (int, optional (default=5)) – Number of folds. Must be at least 2.
- shuffle (boolean, optional (default=False)) – Shuffles and balances the data before splitting into batches.
- random_state (int, RandomState instance or None, optional, default=None) – If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by np.random. Used when
shuffle
== True.
-
get_n_splits
()[source]¶ Get the number of CV splits that this splitter does.
Returns: n_splits – The number of splits performed by this CV splitter. Return type: int
-
split
(x, y=None)[source]¶ Generates K-fold splits.
Parameters: - x (ds-array) – Samples array.
- y (ds-array, optional (default=None)) – Corresponding labels or values.
Yields: - train_data (train_x, train_y) – The training ds-arrays for that split. If y is None, train_y is None.
- test_data (test_x, test_y) – The testing ds-arrays data for that split. If y is None, test_y is None.