This post presents a technique that's served me well on a past project involving a lot of ill-behaved data: using Decision Trees for data exploration, and more specifically for investigating pathological properties in datasets, such as missing data attributes.
This can be especially useful for maintaining complex databases, obtained by aggregating many external sources. Decision Trees are an old and well-known technique, but I don't know that they're often considered for this use case. From what I've seen, Decision Trees are good at "reverse-engineering" the superposition of factors that result in incomplete data.
Let me know what you think! I'd love to know how other people approach this sort of problem. I might pull this out into Open-Source tooling someday, although making it generic enough could be challenging. In my experience, implementing this atop an existing database is one or two day's work.
Table of Contents:
Suppose you're working on a software project, which consists of leveraging data that you don't understand very well (typically because the data comes from external and disparate sources).
Here are some examples of such data, from my own experience:
We'll use this last example (academic works) as an illustration for this article.
So, picture yourself developing a search engine for academic content (exciting, isn't it?). One morning, your Product Manager colleague bursts into your office:
"- Hey, I just used the search engine and noticed some articles are lacking a PDF link, what's going on?"
Your colleague just pointed out some pathological property of the data, which you must now investigate. Here are a few readily available investigation strategies:
This article describes a technique for automating approach 3: an algorithm that will run many database queries for you, more thoroughly than your patience would allow for, and presenting you some interesting statistical correlations to guide your investigation.
You fill in the property "lacks a PDF link" in a script, which runs for a few seconds or minutes, then produces an output such as this:
(This is displayed as coloured terminal output, but you could imagine other media e.g a web browser UI.)
This sort of visualization is rather dense, but with some training it becomes quite readable. It gives us a finer view of how the pathological property is distributed in our dataset, partitioning the data into regions where it's particularly present or rare. Here's how you would interpret it:
{:bool {:must_not {:exists {:field :article_url_for_pdf}}}}
n=1.0e+08
), 80.2% of which lack a PDF link (p=8.02e-01
).>> SPLIT BY :article_is_oa
): 25 million are Open-Access (OA
, n=2.5e+07
), 77 million are not (not OA
, n=7.7e+07
).p=1.0e+00
). That's our "educated guess" from above, and indeed explains most of the phenomenon. However, that's not the end of the story.p=1.97e-01
).∈ #{"Elsevier BV" "Public Library of Science (PLoS)"}
), with an unusually high rate of 78.2% of missing PDF links, whereas the 20 million articles not published by these 2 publishers average to 6.72%
missing PDF links. So you might say that Elsevier BV
and PLoS
are exceptionally bad at providing PDFs links for their Open-Access content, and you might want to reach out to them to fix the issue."American Physical Society (APS)"
etc.) or exceptionally bad ("Medknow"
) at providing PDF links.>> SPLIT BY [:article_year 2000]
): of 14 million documents, 3.2 million were published before 2000 with 2.45% of them lacking a PDF link, 10 million were published after 2000 with 10.3% of them lacking a PDF link, and 600,000 have no known publication date with 12.5% of them lacking a PDF link. What happened after 2000?In summary, this visualization leads to us to "explain" the lack of PDF links by a superposition of various factors: predominantly the non-Open-Accessibility, but also differences in publisher data, and finally chronology. It's unlikely we would have spotted all of these trends through manual analysis, whereas obtaining this visualization required no more effort than writing a single database query.
The rest of the article will be devoted to the how these trees are produced, but I want to emphasize the following: you don't need to understand the algorithm to benefit from it. The point of the algorithm is to exhibit some interesting statistics, which are useful and understandable in and of themselves, and as a user you don't need to know how the algorithm came up with them.
The above vizualization relies on an established Machine Learning algorithm: Decision Trees, which I'll re-introduce in this section. Note that whereas the conventional use of Decision Trees is prediction on future inputs, here we use it for data visualization and exploration of existing data.
The idea of Decision Trees is rather simple: they break down the data in the 'most informative way' to help you investigate some property, thus splitting the data into several subsets, then apply the same procedure to each split subset.
What does a 'most informative' breakdown consist of? To illustrate it, let's first do the work from intuition.
Suppose I give you a breakdown of the articles by both geographic area and publication era (numbers are deliberately made up and inconsistent with the previous section):
Total articles: 108.2 million, 33.8 million of them lacking a PDF link (31.2%).
By Geographic Area:
Geographic Area ($s$) | # articles ($N_s$) | lacking PDF link ($N_s^+$) |
---|---|---|
Asia | 45.7 million | 14.9 million (32.7%) |
Europe & Africa | 26.9 million | 8.18 million (30.4%) |
Americas | 35.6 million | 10.6 million (29.8%) |
By Publication Era:
Publication Era ($s$) | # articles ($N_s$) | lacking PDF link ($N_s^+$) |
---|---|---|
1900-1999 | 42.2 million | 5.53 million (13.1%) |
2000-2020 | 66.0 million | 28.1 million (42.7%) |
Which of these two tables help you the most investigate the lack of PDF links ? If that's not clear, maybe a picture will help:
In this case, I think it's intuitively obvious that Publication Era helps us more than Geographic Area, as the former heavily influences the proportion of lacking PDFs. The essence of our algorithm is that it will automatically choose two show you the most helpful of both charts (and then dig deeper).
Now all we need is a bit of "Artificial Intelligence" to translate this intuition to a computation. One insight behind Decision Trees is to use notions from the field of Information Theory to do that.
Concretely, we choose the split criterion $S$ (such as 'Publication Era' or 'Geographic Area') that yields the highest Total Information Gain $G[S]$, a quantity measured in bits.datapoint, and computed as:
$$ G[S] := N . H_2\left(\frac{N^+}{N}\right) - \sum_s{N_s . H_2\left(\frac{N_s^+}{N_s}\right)} $$... in which:
A theoretical justification for this formula can be found in the appendix. Sorry if the notation is a bit intimidating to some readers - what's important for practical problem-solving is not understanding the math, but being able to program it.
In our example, we have:
Note in particular that the $N_s$ and $N_s^+$ can be directly read from the frequency tables above; also $N^+ = \left( \sum_s N_s^+ \right)$, and $N = \left( \sum_s N_s \right)$. So in order to compute $G[S]$, all the data querying we have to do consists of computing the frequency table for $S$, which we can then feed into the above formula.
Once we have computed the best Split Criterion, we can use it to split the dataset into several subsets, creating the first children nodes in our tree. Then we can apply the same procedure to each child node... thus growing our tree from the top.
Here are some practical considerations:
With that in mind, growing the tree can be viewed as a search algorithm: at each iteration, you're looking for an optimal leaf node and Split Criterion to expand it. The Total Information Gain can be compared across nodes, making it suitable to guide that search.
One option is to search exhaustively: always consider all leaf nodes, displaying only those that have contributed the highest information gain. If that's too slow, you might want to narrow your search with a heuristic. In both cases, you will want to have some priority ordering of leaf nodes: for instance, you might give precedence to the leaf nodes of highest residual entropy $N . H_2\left(\frac{N^+}{N}\right)$; you could also refine that expression to penalize nodes of high depth, e.g $K^d . N . H_2\left(\frac{N^+}{N}\right)$ where $d$ is the depth of the node and $K < 1$.
As we've seen, to evaluate Split Criteria, you need to compute count aggregations at the leaf nodes of the tree. If the size of the dataset is small enough, you can implement the aggregations in-memory. Otherwise, you will probably want to do that by sending queries to a datastore such as ElasticSearch or a SQL database.
In a typical SQL implementation, the path to a leaf node will translate to a WHERE
clause, each element of that path adding an AND
; and the Split Criterion being evaluated will correspond to a GROUP BY
clause.
For example, in our example tree from above, the >> SPLIT BY :top_publishers
below NODE<3>
(in blue) could be evaluated by the following query:
SELECT COUNT(*) AS N_s, SUM(one_if_pathological) AS N_s_pos FROM (
SELECT article_id, article_publisher,(CASE WHEN article_url_for_pdf is NULL THEN 1 ELSE 0) AS one_if_pathological
FROM articles
WHERE article_is_oa
AND NOT article_publisher IN ("Elsevier BV", "Public Library of Science (PLoS)")
) AS t
GROUP BY article_publisher
Another strategy is to compute the aggregation for several leaf nodes at the same time, using a conditional expression that computes the 'id' of the leaf nodes. The advantage is to run one query for several leaf nodes; the downside is that you scan more data and might not leverage indexes optimally. For example, here's a query that would evaluate the 'publishers' Split Criterion for nodes <02>
, <04>
and <05>
in one fell swoop:
SELECT node_id, COUNT(*) AS N_s, SUM(one_if_pathological) AS N_s_pos FROM (
SELECT
article_id,
article_publisher,
(CASE WHEN article_url_for_pdf is NULL THEN 1 ELSE 0) AS one_if_pathological,
(CASE WHEN article_is_oa THEN
(CASE WHEN article_publisher IN ("Elsevier BV", "Public Library of Science (PLoS)") THEN
4
ELSE (CASE WHEN article_publisher IN ("American Physical Society (AMS)", "American Society for Biochemistry & Molecular Biology (ASBMB)" ...) THEN
5
ELSE -1))
ELSE 2
) AS node_id
FROM articles
WHERE NOT node_id = -1
) AS t
GROUP BY article_publisher
I do not know of a satisfactory way to evaluate several Split Criteria in one query in SQL. This is something that's straightforward to do with ElasticSearch aggregations. In addition, ElasticSearch has the advantage of enabling you to trade accuracy for speed, making it well suited for such use cases.
In this section we examine some special cases of Split Criteria that demand special processing.
When your data has a temporal feature, you can't directly break it down by that feature - for example, if the publication date is milli-second encoded, all articles may have a distinct publication date, and your dataset would be split into as many subsets as articles, which would make your Decision Tree unreadable.
One strategy is to bucket the data, for example by decade. Even that may leave too many subsets for good readability.
Therefore, one strategy to consider is to make a binary temporal split: split the data into "before" and "after" a specific "pivot" date. The question then becomes: how to find the most informative pivot date?
The strategy I recommend is to bucket the data into intervals (e.g decades or years, even finer depending on your needs), then search for the optimal pivot date amongst interval bounds, i.e the one that yields the highest Information Gain. By using cumulated counts, this sum can be done in linear time in the number of buckets.
Of course, this logic can be applied to any real-valued feature, not just dates.
Some categorical features, such as 'publisher' in our example, can take many values - hundreds or thousands. That's too many to make a useful split, as the Decision Tree would become unreadable (even though the Information Gain might be very high).
Again, one mitigation strategy is to make a binary split: split the data points as belonging or not belonging to a small subset of the categorical values. The question becomes one of choosing that subset. Here's one approach to it:
"OTHER"
value.I've been careful to use the (unconventional) term "Split Criterion" rather than "feature", for good reasons: there needn't be a one-to-one correspondance between split criteria and features. There can be many ways to split a dataset based on one feature, and a particular split may involve more than one feature.
I don't know if this fact is well-known amongst Machine Learning practictioners: all of the Decision Trees implementations I've seen out there limit themselves to axis-aligned splits on pre-computed features, but this limitation is by no way essential. One reason for this might be preventing over-fitting, but I suspect another one is simply programming convenience.
In particular, the output of any classification or clustering algorithm can be used as a split criterion. For example, on a geospatial feature, you could use a clustering algorithm to bucket the data, and use that as the Split Criterion.
For our use case, an important issue to keep in mind is that of interpretability: since we use them as clues to investigate our pathological property, Split Criteria have to be understandable. While you could in principle train a neural network as a way to propose a Split Criterion, you probably couldn't make sense of the output.
For pattern recognition tasks (classification or regression), a well-known shortcoming of Decision Trees is their lack of robustness: Decision Trees have a tendency of badly over-fitting the training data. For this reason, practitioners will tend to use other techniques, usually at the cost of interpretability.
However, for our use case, I think this issue is much less of a problem. What we expect of Decision Trees here is much less than pattern recognition: we merely want a big picture of how a pathological property is distributed in our data to guide our investigation. In particular:
In this section I'll motivate from information-theoretic notions the formula for the Total Information Gain of a Split Criterion $S$ at a given tree node:
$$ G[S] := N . H_2\left(\frac{N^+}{N}\right) - \sum_s{N_s . H_2\left(\frac{N_s^+}{N_s}\right)} $$In summary: the Total Information Gain consists of the Mutual Information between the Split Criterion $S$ and the Pathological Property $P$, scaled by the number of datapoints $N$ at the node, i.e $G[S] := N . I[P,S]$. The reason for scaling the Mutual Information by $N$ is to make the computed scores comparable across tree nodes.
The information content of a random event $A$, measured in bits, is defined as $\log_2\left(\frac{1}{\mathbb{P}(A)}\right)$. It can be understood as the amount of "surprise" we get from observing $A$. If $A$ is certain ($\mathbb{P}(A) = 1$), there is no surprise in observing $A$, and the information content is zero. If $A$ is impossible ($\mathbb{P}(A) = 0$), then the information content is infinite, but that's not an eventuality we need to consider, as it will never happen.
In particular, if $X$ is a discrete random variable and $x$ a particular value that $X$ might take, the information content of $X=x$ is $\log_2\left(\frac{1}{\mathbb{P}(X=x)}\right)$.
The entropy $H[X]$ of $X$ is then defined as the expected information content of observing $X$:
$$ H[X] := \mathbb{E}_X\left[\log_2\left(\frac{1}{\mathbb{P}(X)}\right)\right] = \sum_x{\mathbb{P}(X=x) \log_2\left(\frac{1}{\mathbb{P}(X=x)}\right)} $$Equivalently, we can define the entropy of a probability distribution rather than a random variable:
$$ H(x \mapsto p(x)) := \sum_x{p(x) \log_2\left(\frac{1}{p(x)}\right)}$$The entropy $H[X]$ can be thought of as an additive measure of the information we lack about $X$, of how unpredictable it is. In particular, the entropy is never negative, and is zero if and only if $X$ is fully deterministic, i.e if $X$ takes a particular value with probability 1.
In the special case of a binary variable $X$ (i.e one that follows a Bernouilli Distribution), the entropy is expressed using the binary entropy function: $H[X] = H_2(\mathbb{P}[X=x])$, where $x$ is any one of the two values taken by $X$, and
$$ H_2(p) := p \log_2\left(\frac{1}{p}\right) + (1-p) \log_2\left(\frac{1}{1-p}\right) $$Given two discrete random variables $X$ and $Y$, $(X, Y)$ is itself a random variable, which entropy we denote $H[X,Y]$.
If a particular outcome $X=x$ is observed, then we can consider the entropy of the conditioned probability distribution for $Y$:
$$ H[Y|X=x] := H(y \mapsto \mathbb{P}(Y=y | X=x)) = \sum_y{\mathbb{P}(Y=y|X=x) \log_2\left(\frac{1}{\mathbb{P}(Y=y|X=x)}\right)} $$This can be thought of as the amount of uncertainty that remains about $Y$ when we know that $X=x$.
The conditional entropy $H[Y|X]$ of $Y$ given $X$ is the expected value of the above quantity:
$$ H[Y|X] := \mathbb{E}_{x \sim X}\left[H[Y|X=x]\right] = \sum_x \mathbb{P}(X = x) H[Y|X=x]$$It can be thought of as the (average) remaining uncertainty we have about $Y$ when we can observe $X$. In particular:
The Mutual Information $I[X,Y]$ between $X$ and $Y$ is defined as:
$$ I[X,Y] := H[Y] - H[Y|X] $$The mutual information can be understood as the amount of information we gain (or equivalently, the amount of uncertainty we lose, hence the formula) about $Y$ by having access to $X$.
Mutual Information has several intuitive properties:
The magic of Information Theory is to translate informal notions of uncertainty to numbers that add up. The quantities we've defined enjoy many additive relations to each other, summarized in this figure (courtesy of MacKay):
The properties of Mutual Information make it an attractive heuristic for finding the most helpful Split Criterion: to find a Split Criterion $S$ that "tells" us as much as possible about the pathological property $P$, we choose the one that has the highest Mutual Information to $P$ in the sense of Information Theory.
This leads us to score Split Criteria using the Total Information Gain defined as:
$$ G_\nu[S] := N(\nu) . I_\nu[P, S] $$Here $\nu$ represents the tree node we're considering: the notation makes explicit the fact that all counts and probabilities are taken within the subset of the data represented by node $\nu$.
You may find it strange that we multiplied by $N(\nu)$. We will justify that in the next section - for now, observe that this does not change the order in which we rank Split Criteria at a given node.
For now, let's derive the formula for $G[S]$ in terms of counts:
$$ G[S] := N . I[P, S] = N . (H[P] - H[P|S]) = N . \left(H_2(\mathbb{P}(P)) - \sum_s{\mathbb{P(S)}H_2(\mathbb{P}(P|S=s)}\right) $$We now approximate probabilities by empirical frequencies, yielding the formula as promised:
$$ G[S] = N . \left(H_2\left(\frac{N^+}{N}\right) - \sum_s{\frac{N_s}{N} H_2\left(\frac{N_s^+}{N_s}\right)}\right) $$$$ G[S] = N . H_2\left(\frac{N^+}{N}\right) - \sum_s{N_s . H_2\left(\frac{N_s^+}{N_s}\right)} $$At this point, it can be insightful to plot the $H_2$ function, to understand how the above formula strikes a balance between purity and cardinality:
import my_article_support
my_article_support.plot_h2()
What remains to be explained is why we have scaled the Mutual Information by a factor of $N(\nu)$. The reason is that it makes the computed score $G_\nu[S]$ comparable not only across Split Criteria, but also across tree nodes. This makes intuitive sense - you don't care about gaining predictability when very few datapoints are concerned. We'll now provide a more principled demonstration for that.
First, observe that any Decision Tree $T$ can be viewed as a (sophisticated) way of splitting the data, partitioning it into as many subsets as it has leaf nodes.
It therefore makes sense to rank Decision Trees by their (global) Mutual Information $I[P,T]$. Tree-growing is now viewed as searching for Decision Trees with a high $I[P,T]$ score.
Now, for our iterative tree-growing algorithm, the question becomes: given a tree $T$, how to expand it into a tree $T'$ in a way that achieves the highest possible $I[P,T']$ score?
Denote $T_{\nu \prec S}$ the tree obtained by expanding $T$ at leaf node $\nu$ with Split Criterion $S$. It is easily verified that:
$$I[P,T_{\nu \prec S}] = I[P,T] + \mathbb{P}(\nu) . I_\nu[P, S]$$Therefore, the "progress" in Mutual Information that we made by expanding $\nu$ with $S$ is of $\mathbb{P}(\nu) . I_\nu[P, S]$ bits.
Denoting $N^{(\text{all})}$ the size of the global dataset and equating probabilities to empirical frequencies, this can be rewritten to:
$$\mathbb{P}(\nu) . I_\nu[P, S] = \frac{N(\nu)}{N^{(\text{all})}} . I_\nu[P, S] = \frac{1}{N^{(\text{all})}} . G_\nu[S]$$Since $\frac{1}{N^{(\text{all})}}$ is a positive constant, this justifies $G_\nu[S]$ as a global measure of information gain, with the added benefit of not requiring to carry $N^{(\text{all})}$ in computations.
(By the way, $\mathbb{P}(\nu) . I_\nu[P, S]$ is what is conventionally called "Information Gain" in the literature; that's why I've called $G[S]$ the Total Information Gain to make the distinction more explicit)
Our definition of Information Gain uses binary entropy as an impurity metric. For completeness, I should mention that other impurity metrics are sometimes used for choosing the Split Criterion: the most frequent is Gini Impurity, which concretely results in replacing the $H_2$ function with $(p \mapsto 4 p (1 - p))$.
my_article_support.plot_impurity_metrics(0., 1.)
In most cases, I doubt the choice matters much: numerically both functions are similar except in extreme purity regions.