objectif
Vectorisation stateless par hashing (peu coûteuse mémoire).
code minimal
from sklearn.feature_extraction.text import HashingVectorizer
vec = HashingVectorizer(n_features=2**10, alternate_sign=False)
Xv = vec.transform(["a b c", "b c d"])
print(Xv.shape[1] == 1024)
utilisation
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier().fit(Xv, [0,1])
print(hasattr(clf, "predict"))
variante(s) utile(s)
from sklearn.feature_extraction.text import HashingVectorizer
print(HashingVectorizer(n_features=2**8).transform(["x"]).shape[1] == 256)
notes
- Impossible d’inspecter les features (pas d’inverse mapping).