← retour aux snippets

sklearn: GridSearch pipeline texte

Optimiser TF-IDF + classifieur via Pipeline + GridSearch.

objectif

Optimiser TF-IDF + classifieur via Pipeline + GridSearch.

code minimal

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

pipe = Pipeline([("tfidf", TfidfVectorizer()), ("clf", LogisticRegression(max_iter=1000))])
grid = GridSearchCV(pipe, {"tfidf__ngram_range":[(1,1),(1,2)], "clf__C":[0.1,1]}, cv=3)
X = ["a a b", "b c", "a c", "b b", "c a"]; y = [1,0,1,0,1]
grid.fit(X, y)
print(list(grid.best_params_.keys()))

utilisation

print(isinstance(grid.best_score_, float))

variante(s) utile(s)

print(grid.cv_results_["mean_test_score"][:2].tolist())

notes

  • Prétraiter le texte (stopwords, lowercase) dans Tfidf.