B2B Churn Prediction Engine for Small-Dataset Environments
Chief Revenue Officer
Designed a churn prediction system purpose-built for the small-dataset reality of B2B services: researching foundation models, validating TabPFN against conventional ML on sub-300 partner datasets, and deploying on SvelteKit with Cloudflare Workers and D1.
Built with
The Problem
Churn prediction in B2B managed services faces a fundamental constraint that most ML tutorials ignore: sample size. A typical MSP has 50 to 300 active partners, with perhaps 10 to 30 known churn events in the historical record. Every conventional machine learning algorithm (logistic regression, random forest, XGBoost) requires sample sizes that are 2x to 33x larger than what's available.
The academic literature is unambiguous on this point. A 2024 JMIR study on minimum sample sizes for prediction models found that logistic regression requires a minimum of 696 samples for stable performance, random forest needs 3,404, and XGBoost needs 9,960. A 2019 PLOS ONE study by Vabalas et al. found that 49% of variance in reported ML accuracy is explained by sample size alone, smaller datasets produce systematically inflated accuracy estimates.
Training a conventional ML model on 200 partners and claiming 92% accuracy isn't a prediction system. It's an overfitting artifact.
The Research Process
I conducted a systematic review of the academic literature on small-sample prediction, focusing on algorithms designed for the sub-1,000 sample regime. The research covered primary sources from Nature, arXiv, PLOS ONE, JMIR, and Scientific Reports: scraping data tables, extracting performance benchmarks, and building a quantitative decision matrix across three candidate approaches.
The breakthrough came from TabPFN, a transformer-based foundation model pre-trained on millions of synthetic tabular datasets. Unlike conventional ML that learns from your training data alone, TabPFN performs in-context learning. It has already learned the patterns of tabular data from its pre-training, and applies that knowledge to new datasets without gradient updates or hyperparameter tuning.
The performance data is compelling. The Nature 2025 paper (Hollmann et al.) reports a normalized ROC AUC of 0.939 for TabPFN versus 0.752 for CatBoost on benchmarks with 10,000 or fewer samples. TabPFN achieves CatBoost-level accuracy with 50% of the training data. A follow-up paper reports a 100% win rate against default XGBoost on classification datasets under 10,000 samples. And a drift-resilient variant handles temporal distribution shifts (critical for churn, where partner behavior changes over time) improving AUC from 0.786 to 0.832 versus the strongest baselines.
The Architecture
The prediction engine is built on SvelteKit deployed to Cloudflare Workers, with a D1 database for partner data and prediction history, and KV for hot-path caching. The choice of SvelteKit over React was deliberate: server-side rendering on Workers means the dashboard loads fast even on slow connections, and the Cloudflare adapter handles the edge deployment natively.
The prediction pipeline follows three steps. First, partner features are assembled from operational data: ticket volume trends, service utilization changes, contract tenure, revenue trajectory, support satisfaction signals, and engagement frequency. Second, the feature set is fed to the TabPFN model, which returns calibrated churn probabilities without any training step. Third, SHAP (SHapley Additive exPlanations) values are computed for each prediction, providing per-partner explanations of which factors are driving the churn signal.
The SHAP layer was non-negotiable. A model that says "this partner has a 73% churn probability" is useful. A model that says "this partner has a 73% churn probability, driven primarily by a 40% drop in ticket volume and a missed QBR" is actionable. Account managers need to know what to fix, not just that something is wrong.
Handling the Known Risks
The research identified one critical risk: class imbalance. A 2025 arXiv paper evaluating TabPFN in open environments found it performs best on class-balanced tasks, and B2B churn datasets are heavily imbalanced (3-10% positive rate). The mitigation strategy uses three approaches in combination: SMOTE oversampling to balance the training set, cost-sensitive weighting in the prediction wrapper, and threshold tuning on calibrated probabilities rather than using a naive 50% cutoff.
Validation follows the methodology prescribed by Vabalas et al.: nested cross-validation with feature selection inside the CV loop, stratified folds to preserve class ratios, and confidence intervals on all reported metrics. This eliminates the systematic bias that afflicts small-sample ML evaluations.
The Fallback Strategy
The system includes a rule-based scoring complement that runs alongside the ML predictions. Expert-defined rules (contract approaching expiration + declining ticket volume + missed QBR = high risk) provide a baseline that requires no training data and captures domain knowledge the model can't learn from 20 examples. The final churn score blends the ML probability with the rule-based score, weighted by the model's calibration confidence.
Impact
The engine provides account managers with weekly churn risk assessments for every partner, ranked by probability and annotated with the specific factors driving each score. Early identification of at-risk partners (weeks or months before traditional indicators like cancellation notices) enables proactive retention intervention. The system runs on edge infrastructure with sub-second response times and minimal operational cost.