NLTK: stopwords et stemming

objectif

Expliquer et montrer comment filtrer stopwords et appliquer stemming sur du texte.

code minimal

from nltk.corpus import stopwords
from nltk.stem.snowball import FrenchStemmer

# nltk.download('stopwords')  # à exécuter une fois
stops = set(stopwords.words('french'))
stemmer = FrenchStemmer()

text = "Les développeurs développaient des applications rapidement."
tokens = [w.lower() for w in text.split()]
tokens = [t for t in tokens if t not in stops]
stems = [stemmer.stem(t) for t in tokens]
stems

utilisation

# intégrer dans un préprocesseur scikit-learn
def stem_preprocess(s: str) -> str:
    return " ".join(FrenchStemmer().stem(w.lower()) for w in s.split() if w.lower() not in stops)

variante(s) utile(s)

# alternative: lemmatization via spaCy au lieu du stemming

notes

Téléchargez les stopwords au préalable (nltk.download).
Le stemming coupe à la racine, pouvant déformer le mot.