Tabular data and tabular prediction
Tabular data is the perhaps the most common data shape you’ll find in the business world Three terms get used loosely in industry. Tabular is a single matrix with (mostly) i.i.d. rows. Relational is the multi-table case, e.g. a SQL database where rows in one table reference rows in another via foreign keys. Structured is the umbrella term for any data with a fixed schema and typed fields: tabular and relational both qualify, as do time series and various graph types. This post stays in the strict tabular setting, we will focus on extensions to relational data in future posts.. Banks store customers and transactions as rows; hospitals do the same for patient records, visits, and doctor schedules; logistics companies, insurance firms, and payroll systems ground their business processes stacks of typed columns.
Increasingly, this data is also used to make predictions: will this customer default, will this patient be readmitted, how much will this house sell for, is this transaction fraudulent. This post is about predictions made from tabular data, and why a particular family of methods dominated the space for a very long time. In a tabular prediction problem we’re trying to predict the target column; the remaining columns are the features used for inference. We can write the features as a matrix with rows and columns, and the target as a length- vector . A tabular prediction problem seeks a function , that maps a feature row to a prediction including and especially for rows the model has never seen. When the target is a category (fraud or not, defaulter or not), the problem is classification. When the target is a number (sale price, expected loss), it’s regression. The mechanics differ; the shape of the input is the same. Tabular ML covers more than binary classification and regression. Forecasting (predict from a window of past rows), multi-class classification ( classes), multi-label classification (multiple binary labels per row), time-to-event prediction (i.e. survival analysis; common in medical and reliability data), ranking, and various unsupervised tasks all run on the same matrix shape.
Three things describe a column. It has a type — a number, a category, an ordered grade, a date, occasionally a chunk of text. It has a scale — ages span tens, incomes span millions, dates span years. And it has a meaning — “age” is a quantity a person can read directly off the value, while a pixel in an image carries almost no information on its own. A tabular dataset is a stack of such columns, and a model’s job is to find the ones that actually carry information about the target and use them well.
A heart-attack prediction problem in tabular form. The training set pairs rows of features with binary labels. At prediction time we observe new rows and want to fill in the missing . Columns mix continuous (age, chol, bp), categorical (sex, angina), and binary (heart_attack) types.
By volume, tabular prediction is perhaps the most common machine-learning problem in the world — isn’t it weird that deep learning took more than a decade to become competitive in this space?
Decision tree models - a primer
Tree based models were used long before deep tabular models truly gained steam and purely as a baseline, they are extremely hard But not impossible, as we will see!to beat! Broadly, we can understand the why behind their success on three levels: a) they are designed and fit the problem space of tabular prediction extremely well, b) they provide some qualitative advantages which we have to work harder for with deep tabular models, and finally c) organisational dynamics make it hard to switch to something new, if it’s only marginally better.
Tree methods and why they fit tabular problems
A decision tree is a particularly transparent kind of model. It splits the feature space into rectangular regions and predicts a constant value in each. Concretely, given regions that partition , the tree’s prediction is
Each region is a conjunction of axis-aligned constraints, e.g. {x : x_2 ≤ 145, x_5 = "yes"} and each leaf value is just the average of the training labels that land there. The CART recipe (Breiman et al., 1984) grows the partition greedily: at each node it scans every candidate (feature , threshold ) pair, picks the split that most reduces an impurity criterion (Gini for classification, MSE for regression), and recurses on the two resulting halves until a stopping rule fires. The greedy split is locally myopic; finding the globally optimal tree of fixed size is NP-hard. The greedy heuristic works exceedingly well in practice anyway in conjunction with other tricks like bagging and boosting.
The shape of this answer is already a good match for the data. A tabular row is a vector of column values, each with its own units, scale, and meaning. A split asks one question about one column at a time. There is no representation-learning bottleneck because the representation is the input Well, in practice the data scientist is responsible for the representation through feature engineering, but these features are precisely what the representation is built upon; trees don’t use an alternative internal/latent representation..
A depth-3 tree partitions into five axis-aligned regions, each with a constant prediction . Class 1 points are filled, class 0 are hollow; shaded backgrounds show . An MLP must approximate the same target with a smooth , a curve (or rather a jagged “curve” in case of ReLu MLPs), not a partition, and pays for the mismatch on points near the steps (dashed circles), which it can only resolve by pushing the boundary around.
A single tree, trained to convergence, is an exceedingly low-bias model (given enough depth it can carve any partition fine enough) but a high-variance one. Two small data perturbations can produce two visibly different partitions in the decision space. In order to make trees actually useful we need to fix this tendency to overfit, and we can do so on either end of the bias-variance dial. Bagging reduces variance by averaging decorrelated estimators that share roughly the same bias (Breiman, 1996). Boosting reduces bias by sequentially fitting estimators to the residuals of the current ensemble (Friedman, 2001). Random Forests bag; XGBoost, LightGBM, and CatBoost all boost.
The variance recipe is bagging: train many trees on bootstrap resamples of the data and average their predictions (Breiman, 1996). Random Forests sharpen the trick by also subsampling features at each split, which decorrelates the trees further (Breiman, 2001). The mean of decorrelated estimators has lower variance than any one of them, while the bias is left roughly unchanged.
The bias recipe is boosting. Build trees sequentially; each new tree fits the residual error of the current ensemble. The seminal version is AdaBoost (Freund & Schapire, 1997), which reweights misclassified points after each round. Friedman generalized it into the gradient boosting machine (Friedman, 2001) by reading the update as a step in function space:
At round the loss is treated as a functional of ; the next tree is fit to the negative gradient of that functional at the current ensemble1; a small learning rate controls the step size. Stochastic GBM (Friedman, 2002) adds row subsampling for stability and speed.
Boosting as residual fitting on a 1D regression problem. Each panel shows the data (filled circles), the cumulative ensemble (solid step), the previous iteration ghosted (dashed gray), and the new tree’s contribution in each region (dashed red — magnitude and direction of the correction). Each is fit to the residuals of , so progressively refines the answer.
The 2010s industrialized the gradient boosting machine. Three specific implementations dominated the decade, and are still extremely prevalent today:
Three growth strategies, same model capacity (four leaves each). XGBoost grows level-wise: every leaf at the current depth is split before going deeper, and each node picks its split independently. LightGBM grows leaf-wise: only the leaf with the highest gain gets split next, producing asymmetric trees that spend depth where the loss demands it. CatBoost grows oblivious trees: every node at the same depth shares one pair — the dashed connector marks the shared split — trading per-node flexibility for branchless inference.
XGBoost (Chen & Guestrin, 2016) folds an L1/L2 regularizer directly into the boosting objective and uses a second-order Taylor expansion of the loss at each step, which makes the optimal leaf value closed-form and converges faster than first-order GBM. Its sparsity-aware splits learn a default direction per node, when the splitting feature is missing for a row, the tree just sends the row down the side that minimizes the loss on training rows with that feature missing. Missingness becomes a routing decision, not something we have to deal with during pre-processing.
XGBoost’s default direction. Each split stores not only a pair but also a side (left or right) that missing-value rows follow. The choice is learned at training time: XGBoost evaluates both options and picks whichever minimizes the loss on training rows where the splitting feature is missing.
LightGBM (Ke et al., 2017) takes the same scaffold and pushes it for speed. The “histogram trick”: pre-bin each continuous feature into ~256 buckets and accumulate gradient statistics per bucket. Split-search drops from per feature to with negligible accuracy cost. This is the central performance kernel of every modern GBDT. It grows trees leaf-wise: split whichever leaf yields the largest gain, rather than the level-wise growth of XGBoost; the asymmetric tree fits the loss better for the same number of leaves. Two more tricks make it especially well-suited to high-cardinality industrial tables: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB).
LightGBM’s two cost-cutting tricks. GOSS (left): rows are sorted by the gradient ; the top are kept entirely (their loss-correction signal is most informative), a random fraction is sampled from the rest, and the remaining low-gradient rows are dropped. Split-search runs on a small subset with little accuracy cost. EFB (right): mutually exclusive sparse features (rarely both nonzero on the same row) are bundled into one feature via a value-offset trick, they hold the same information but have much fewer columns to scan.
CatBoost (Prokhorenkova et al., 2018) addresses two problems the other two approaches disregard. Naive target encoding of a categorical feature uses the same labels both to encode and to fit, which leaks information from the target into the inputs in a way that’s hard to spot. CatBoost’s ordered target statistics compute the encoding for row using only the labels of rows that came before in a synthetic permutation, breaking the leak. The same idea applied to gradients gives ordered boosting, which avoids analogous leakage in the boosting update itself. CatBoost adds native categorical handling and uses symmetric (oblivious) trees where every node at a given depth splits on the same feature and threshold, which compiles to branchless and subsequently very efficient inference.
CatBoost’s ordered target statistics. After a synthetic permutation of the training rows, the encoding for row uses only rows that share the same category. Row 4 (highlighted, category RED) sees rows 1 and 3 as its only RED predecessors; its encoding is . Critically, row 4’s own target never feeds back into the input — naive target encoding would let it, which leaks signal from the target into the features.
All of these frameworks are engineering victories on top of a core statistical idea of using decision trees to fit functions and to combat their inherent low-bias, high-variance with various tricks. Chiefly, all of the approaches share the same set of inductive biases An inductive bias is the set of assumptions a machine learning algorithm uses to predict outputs for unseen data. It enables models to generalize beyond training examples by prioritizing certain solutions over others.
What do these biases look like, and why do they match tabular data so well? To see the answer we need a concrete picture of what real tabular data contains.
Production tables are wide and messy. A churn table at a telecom carrier sits in the hundred-column range. A few continuous fields: tenure, monthly charge, total charge, all live on very different scales (months vs. few dollars to thousands of dollars). A handful of low-cardinality categoricals: region, plan tier, contract type, sit beside high-cardinality ones: handset model, last-agent ID, marketing-campaign code, with thousands of distinct values, a long tail, and an "Other" bucket hiding a dozen different things. Sprinkle in Boolean flags (autopay, paperless billing) and dates encoded inconsistently across upstream systems. Many rows have a quarter of their fields missing, and the missingness is rarely random: a customer without a credit_score is often a customer who never had one, and that itself predicts churn.
The target is a step function. A customer either cancels within 30 days or does not, and the underlying probability surface has cliffs which depend on contract-end dates, price-tier crossings, recent service incidents. Most columns carry some information, but the bulk of the signal lives in maybe ten of them; the rest are weak or near-noise. Trees were built to fit exactly this. Here’s five properties to illustrate this fit:
- Axis-aligned splits respect feature heterogeneity. A height column in centimetres and a price column in dollars are not commensurable. Trees never combine them linearly, only conditionally — if height > 170 then if price > 50000 then …. An MLP’s first layer is a linear map of the input, which implicitly treats the columns as living in a shared coordinate system.
- Piecewise-constant predictions handle non-smooth targets natively. Real tabular targets are very jagged. Credit risk has thresholds. Medical diagnoses have cutoffs. Pricing has regimes. A tree captures a step function with one split; an MLP is forced to approximate the same step with many smooth basis functions stacked in a precise manner.
- Scale and monotone transformations are free. Trees only care about feature ordering. Standardization, log-transforms, quantile mapping, none of these change the model. An MLP’s optimization geometry depends on every one of those choices.
- Missingness is a routing decision. XGBoost’s default direction; CART’s surrogate splits. The tree learns where to send a row whose feature is absent, and often this missingness pattern itself carries a lot of signal (Twala et al., 2008).
- Calibration is structural. A leaf’s prediction is the empirical mean (or class frequency) of training labels in its region. The output reads as a probability without any post-hoc (e.g. Platt) scaling.
Each of these is an inductive A bit loosely speaking, if I may! bias the tree gets for free from its structure. None is a theoretical limit on what neural networks can do — UAT still says an MLP can approximate the same step functions, the same conditional rules, the same calibrated probabilities. The MLP just has to learn every one of them from finite data, while the tree gets them as the starting prior.
Qualitative axes where GBDTs are hard to beat
Decision trees also provide some additional qualitative advantages over MLPs, which make them a very sticky technology.
Calibration. Modern deep networks are badly miscalibrated out of the box (Guo et al., 2017), and the standard fixes — temperature scaling, deep ensembles (Lakshminarayanan et al., 2017), conformal wrappers (Angelopoulos & Bates, 2023) — add cost and another moving part to the pipeline. NGBoost (Duan et al., 2020) extends the GBDT recipe to full predictive distributions when a point estimate isn’t enough.
Missingness. Imputation choice rarely has a clean answer (Le Morvan et al., 2021); purpose-built deep architectures like NeuMiss (Le Morvan et al., 2020) close part of the gap but not all of it (Josse et al., 2024; Perez-Lebel et al., 2022).
Interpretability. TreeSHAP (Lundberg et al., 2020) computes exact Shapley values in polynomial time on trees, so every prediction is cheaply explainable. Deep-model attribution falls back on KernelSHAP (Lundberg & Lee, 2017), DeepLIFT, or LIME (Ribeiro et al., 2016), each with known failure modes. Moreover, the manual construction of features further reinforces this, as time spent carefully designing features makes for more interpretable SHAP scores.
Engineering-culture and org dynamics
Decades of work made experimenting with and running tree-based models in production a standard and straightforward engineering practice.
Inference economics and operational maturity. A GBDT prediction is comparisons per tree, embarrassingly parallel across the ensemble, no matrix multiplies — sub-millisecond on a CPU at production volumes. Deep tabular models typically need ~10× more compute and GPU inference to meet the same throughput targets. Around this cheap inference grew a stable ops stack: scikit-learn API (Pedregosa et al., 2011), joblib serialization, ONNX / Treelite / m2cgen compilation targets, feature-store integration, lineage tracking — all of it predating deep tabular methods And also years before running deep learning models in production was standard practice!.
Cultural memory. A decade of Kaggle leaderboards (Bojer & Meldgaard, 2021) documents overwhelming GBDT dominance among top finishers. And back in the 2010s Kaggle formed the basic education of most data scientists. Each XGBoost-shaped winning solution signals the approach to thousands of newcomers and compounds path-dependence.
Regulatory floors and “it’s not broken” effect. Finance (ECOA, GDPR Article 22), healthcare (FDA SaMD guidance), and credit scoring (SR 11-7) impose effective floors on explainability that TreeSHAP satisfies and a 50-million-parameter transformer typically does not. Fairness auditing frameworks (Bellamy et al., 2018; Chouldechova, 2017; Ding et al., 2021; Friedler et al., 2019) are built around tabular inputs and work best with tree-based models (Han et al., 2024). On top of all this, a one-line XGBRegressor() with sufficiently powerful auto-ML search achieves a baseline which is hard to beat and it doesn’t really make operational sense for smaller ML teams to invest heavily into alternatives, particularly at small data scales.
The structural and operational case is reinforced by the empirical record. Shwartz-Ziv and Armon (Shwartz-Ziv & Armon, 2022) revisited 11 datasets that recent deep-tabular papers had themselves chosen to showcase their methods:
| Dataset | Metric | XGBoost | NODE | DNF-Net | TabNet | 1D-CNN |
|---|---|---|---|---|---|---|
| Rossmann | RMSE | 490.18 | 488.59 | 503.83 | 485.12 | 493.81 |
| Forest Cover | cross-entropy | 3.13 | 4.15 | 3.96 | 3.01 | 3.51 |
| Higgs | cross-entropy | 21.62 | 21.19 | 23.68 | 21.14 | 22.33 |
| Gas Concentrations | cross-entropy | 2.18 | 2.17 | 1.44 | 1.92 | 1.79 |
| Eye Movements | cross-entropy | 56.07 | 68.35 | 68.38 | 67.13 | 67.90 |
| Gesture Phase | cross-entropy | 80.64 | 92.12 | 86.98 | 96.42 | 97.89 |
| Year Prediction | RMSE | 77.98 | 76.39 | 81.21 | 83.19 | 78.94 |
| MSLR | cross-entropy | 55.43 | 55.72 | 56.83 | 56.04 | 55.97 |
| Epsilon | cross-entropy | 11.12 | 10.39 | 12.23 | 11.92 | 11.08 |
| Shrutime | cross-entropy | 13.82 | 14.61 | 16.80 | 14.94 | 15.31 |
| Blastchar | cross-entropy | 20.39 | 21.40 | 27.91 | 23.72 | 24.68 |
Lower is better in all cases. Cross-entropy values are reported ×100 (Shwartz-Ziv & Armon’s convention). Means over 4 runs, ± standard errors omitted for brevity. Bold marks the row winner. Reproduced from Shwartz-Ziv & Armon.
Each deep model wins on its own benchmark datasets and loses elsewhere: TabNet on Rossmann / Forest Cover / Higgs, DNF-Net on Gas, NODE on Epsilon and Year. On the datasets unseen to a given deep model (i.e. the ones not included in the original paper), XGBoost outperforms it on 8 of 11 cases with statistical significance (). Two complementary studies extended the pattern: Grinsztajn et al. (Grinsztajn et al., 2022) ran a controlled 45-task benchmark with a 20,000-compute-hour random search and found tree-based models superior at every random-search budget; TabZilla (McElfresh et al., 2023) mapped the regime more carefully and found a narrow band of task structures where neural networks win, and a broad complement where they do not.
What’s to come
During the 2010s we saw a monumental shift in Machine Learning, a covergence across many of its sub-domains to the same deep learning paradigm. AlexNet (Krizhevsky et al., 2012) redefined vision in 2012 with ConvNets. The LeCun–Bengio–Hinton Nature review (LeCun et al., 2015) codified the expectation that the same playbook would generalize — and they were right! ResNet (He et al., 2016) pushed depth of NNs to hundreds of layers. The Transformer (Vaswani et al., 2017) reframed sequence modelling and brought forth the age of LLMs. By the early 2020s every modality had converged on a transformer-shaped backbone: decoder-only LLMs in language (Brown et al., 2020), ViT in vision (Dosovitskiy et al., 2021) and Whisper in speech (Radford et al., 2023).
The story is very different when strictly speaking about tabular data. If you were to randomly sample a ML System running in production during the 2010s you would struggle to find neural networks making predicitons on structured inputs The 2010s saw a wide adoption of Deep learning across the board, including various forms of DL embeddings as inputs to downstream tabular models, but never quite replaced boosted decision trees en masse.. The convergence-on-transformers pattern that swept every other modality fell flat on its face at structured data. This leaves a real question hanging: why didn’t deep tabular models achieve success as widespread as their deep learning counterparts in other ML domains? This is the puzzle we turn to in the next chapter: Revisiting the MLP prior.
Footnotes
-
Friedman’s reframing is extremely elegant and the paper is still a great read. The training loss is a functional of the prediction function ; each weak learner is a basis function in the space of functions might live in; the boosting update where is a gradient step in that function space. Different losses (squared error, log loss, Huber) correspond to fitting trees to different residual targets, all from one recipe (Friedman, 2001). ↩