objectif
Entraîner CatBoostClassifier avec auto-encodage des catégories.
code minimal
import pandas as pd
from catboost import CatBoostClassifier, Pool
df = pd.DataFrame({
"cat1": ["a","b","a","c"],
"cat2": ["x","x","y","y"],
"num": [1.0, 2.0, 3.0, 4.0],
"y": [0,1,0,1],
})
X = df[["cat1","cat2","num"]]
y = df["y"]
cat_idx = [0,1]
train_pool = Pool(X, y, cat_features=cat_idx)
clf = CatBoostClassifier(
depth=6,
learning_rate=0.1,
iterations=500,
loss_function="Logloss",
random_seed=0,
verbose=False,
)
clf.fit(train_pool)
print(clf.predict_proba(train_pool).shape[1] == 2)
utilisation
# Prédiction sur DataFrame brut (mêmes colonnes/ordre)
proba = clf.predict_proba(X)[:,1]
print(len(proba) == len(X))
variante(s) utile(s)
# Importance des features (PredictionValuesChange)
imp = clf.get_feature_importance(train_pool, type="PredictionValuesChange")
print(len(imp) == X.shape[1])
notes
- CatBoost gère nativement les catégories (encodage target statistics avec régularisation).