imblearn RandomUnder/OverSampler

objectif

Expliquer et montrer comment rééchantillonnage simple pour équilibrer les classes.

code minimal

from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, weights=[0.9, 0.1], random_state=0)
X_over, y_over = RandomOverSampler(random_state=0).fit_resample(X, y)
X_under, y_under = RandomUnderSampler(random_state=0).fit_resample(X, y)
(X.shape, X_over.shape, X_under.shape)

utilisation

# combiner under puis over
from imblearn.pipeline import Pipeline
pipe = Pipeline([("under", RandomUnderSampler(random_state=0)), ("over", RandomOverSampler(random_state=0))])

variante(s) utile(s)

# sampling_strategy numérique ou dict par classe
RandomOverSampler(sampling_strategy=0.5, random_state=0)

notes

Under-sampling perd de l’information mais accélère l’entraînement.
Over-sampling peut surajuster sans régularisation.