How I Reached Top 8% on Kaggle with a Ridge-XGBoost N-gram Pipeline

via Dev.to WebdevFais Azis Wibowo3h ago

Kaggle Playground Series S6E3 — Predict Customer Churn | ROC-AUC 0.91685 | Rank 286 / 3,718 The Problem Customer churn prediction sounds straightforward — given a telecom customer's usage history and contract details, predict whether they'll leave. But the Kaggle Playground S6E3 dataset had 594,000 rows of heavily categorical data where the signal was buried inside combinations of features, not individual columns. Standard approaches plateau quickly here. My starting point was a LightGBM single model. It was decent — but decent doesn't crack the top 10%. Getting there required rethinking how the model saw the categorical features entirely. The Core Insight: Treat Categories Like Text The breakthrough came from an unconventional direction — NLP. In text classification, n-grams capture phrase-level patterns that individual words miss. The same logic applies to categorical feature combinations. A customer with Contract: Month-to-month is one signal. A customer with Contract: Month-to-mont

Continue reading on Dev.to Webdev

Opens in a new tab

Read Full Article

5 views