sklearn TruncatedSVD: réduction dimensionnelle sparse
objectif
Expliquer et montrer comment réduire la dimension de matrices creuses (TF-IDF) avec TruncatedSVD.
code minimal
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
corpus = fetch_20newsgroups(subset="train", categories=["sci.space","rec.autos"], remove=("headers","footers","quotes")).data[:2000]
X = TfidfVectorizer(max_features=20000).fit_transform(corpus)
svd = TruncatedSVD(n_components=100, random_state=0)
Z = svd.fit_transform(X)
Z.shape
utilisation
# variance expliquée approximative
print(svd.explained_variance_ratio_.sum())
variante(s) utile(s)
# pipeline TF-IDF -> SVD -> classifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(TfidfVectorizer(max_features=5000), TruncatedSVD(100), LogisticRegression(max_iter=1000))
# pipe.fit(texts, y)
notes
- TruncatedSVD (LSA) fonctionne directement sur matrices creuses, contrairement à PCA.
- Utile pour compression et débruitage de textes.