Random tricks for computing costly sums

De la mesure en toute chose : ni trop, ni trop peu !1
  — Lieutenant-Colonel Dulac

1. Introduction

It amazes me how many analytical tasks reduce to estimating some big sum or integral. That may mean calculating probabilities or expectations, aggregating zonal statistics across a landscape, measuring carbon stocks in a project area, averaging telemetry data, etc. Another frequent use case in scientific computing is estimating how long some computation would take if carried out exhaustively. Summation is particularly challenging when the terms of the sum are costly to compute and/or when the set of terms is large or infinite, as in the case for continuous integrals and probabilistic calculations.

This article aims to show you techniques for estimating costly sums approximately, through the use of random sub-samples of the terms. We will start with the basic case of uniform sub-samples of a finite discrete set of terms, which we will summarize here by:

\begin{equation} \label{org7d4e772} \sum_{x \in \mathcal{X}}{f(x)} \overset{\mathbb{E}}{\approx} \frac{|\mathcal{X}|}{n} \sum_{j=1}^n{f(X_j)} \end{equation}

We will then study more refined sub-sampling schemes, yielding estimators of increasing efficiency and applicability, thus adapting to situations in which the cost structure of the computation is more and more challenging. This process will walk us through typical methods like Importance Sampling and stratification, and culminate at viewing the sub-sample \((X_j)_j\) as a random measure and considering its expectation measure \(\nu\), at which point equation \eqref{org7d4e772} will have been generalized to:

\begin{equation} \label{org6c09faa} \int_{x \in \mathcal{X}}{f(x) d\mu(x)} \overset{\mathbb{E}}{\approx} \sum_j{\frac{d\nu}{d\mu}(X_j)^{-1} f(X_j)} \end{equation}

in which the weight \(\frac{d\nu}{d\mu}(X_j)^{-1}\) compensates for how the sampling procedure over-represents \(X_j\) compared to the "natural" measure \(\mu\). Equation \eqref{org6c09faa} is the key to mechanically deriving the appropriate estimator for essentially any sub-sampling scheme.

Readers uninterested in this level of abstraction are encouraged to pick and choose among the more concrete embodiments of equation \eqref{org6c09faa}. Equations \eqref{org31d1d44}, \eqref{org5661861}, \eqref{org95735c6}, \eqref{org0e22f27} and \eqref{org4d0939e} are useful checkpoints.

My main goal in this article is reducing computational costs and getting answers faster. However, some readers may also enjoy the little tour of measure theory and probability theory that we will take along the way.

Mathematical notation: \(\mathbb{E}[Z]\) denotes the expectation of random variable \(Z\); \(\text{Var}[Z]\) is its variance. \(A \overset{\mathbb{E}}{\approx} B\) is a shorthand for saying that \(A\) and \(B\) are equal in expectation, i.e. \(\mathbb{E}[A] = \mathbb{E}[B]\). \(|\mathcal{X}|\) denotes the number of elements in set \(\mathcal{X}\). \(A := B\) means that \(A\) is defined as \(B\). \((X_j)_{j=1}^n\) - sometimes abbreviated to \((X_j)_j\) - represents a collection (in this article, a random sample) of size \(n\) indexed by \(j\) - in plain English, \(X_j\) is the \(j-th\) element of \((X_j)_j\).

(Note: if the math formulas display poorly in your web browser, try right-clicking a formula > Math Settings > Math Renderer > Common HTML).

2. Goal: estimating a big costly sum

I will assume that our goal is to estimate approximately the value \(S\) of summing \(f(x)\) for \(x\) ranging over a large set \(\mathcal{X}\):

\begin{equation} \label{org2fba7dc} S := \sum_{x \in \mathcal{X}}{f(x)} \end{equation}

The assumptions are that:

  1. \(f(x)\) is costly to evaluate, so that it would be prohibitively expensive to compute \(S\) exactly.
  2. \(\mathcal{X}\) is a large discrete set that we can enumerate quickly enough: \(\mathcal{X} = \left\{x_1, x_2, \dots \right\}\). (We'll relax this assumption later, see section 7.)
  3. It's fine to do an approximate estimate of \(S\) - typically, \(f(x)\) itself is an approximation.

3. Using a random sub-sample

The main idea of this article is: we can estimate \(S\) by drawing a random sub-sample, yielding a cheaper computation. In the most basic form, we will use a uniform i.i.d. sub-sample: that is, we will randomly sample \(n\) values \(\left(X_j\right)_{j=1}^{n}\), by drawing \(n\) times with replacement from \(\mathcal{X}\): each \(x \in \mathcal{X}\) has the same (hence "uniform") probability \(1/|\mathcal{X}|\) of being drawn at each iteration, and the iterations are performed indenpendently (hence "i.i.d.", for Independent and Identically Distributed).

Having drawn the sample \(\left(X_{j}\right)_j\), we can estimate \(S\) by using formula:

\begin{equation} \label{org31d1d44} S \overset{\mathbb{E}}{\approx} \hat{S} := \frac{|\mathcal{X}|}{n} \sum_{j}{f(X_j)} \end{equation}

Equation \eqref{org31d1d44} is akin to estimating an expectation using the sample mean. \(\hat{S} \overset{\mathbb{E}}{\approx} S\) means that \(\hat{S}\) is an unbiased estimator of \(S\), that is \(\mathbb{E}\left[\hat{S}\right] = S\). What's more, the Law of Large Numbers tells us that this estimate becomes more precise as \(n\) is made large. More specifically, the Central Limit Theorem tells us that the estimation error approximately follows a Gaussian distribution, and has standard deviation proportional to \(n^{- 1/2}\):

\begin{equation} \label{org77b9cae} \text{Var}\left[\hat{S}\right] = \frac{|\mathcal{X}|^2}{n}\text{Var}\left[f(X_1)\right] \end{equation}

4. Other uniform sampling schemes

In the previous sections, we drew the sample \(\left(X_j\right)_{j=1}^{n}\) uniformly with replacement. However, we could also have used other sampling strategies, for example:

  1. sampling without replacement, which can be achieved by shuffling \(\mathcal{X} = \left\{x_1, x_2, \dots \right\}\) to random order and picking the first \(n\) elements.
  2. Bernoulli sampling: we iterate over \(x \in \mathcal{X}\) and independently pick \(x\) with probability \(n/|\mathcal{X}|\). In this sampling scheme, the sample size becomes a random variable \(\hat{n}\) (in fact, it is binomial-distributed).
  3. Poisson sampling: in this sampling scheme, the sample size is a random variable \(\hat{n}\) drawn as a Poisson distribution \(\hat{n} \sim \text{Poisson}(\lambda = n)\), and given \(\hat{n}\) the \((X_j)_{j=1}^{\hat{n}}\) are drawn i.i.d. with replacement.
  4. Poisson sampling (elementwise version): non-trivially, the above sampling scheme is equivalent to iterating over \(x \in \mathcal{X}\) and picking \(x\) zero or more times, the number of times \(\hat{n}_x\) being Poisson-distributed: \(\hat{n}_x \sim \text{Poisson}(\lambda = n/|\mathcal{X}|)\).

To make things concrete, here is an example of SQL code implementing Bernoulli sampling:

WITH agg AS (
	1e3 AS exp_sample_size, -- This is where you configure the desired sample size.
	COUNT(*) AS n_elements
    FROM my_table
FROM my_table AS x -- This table represents the entire space from which we sample.
	(exp_sample_size / n_elements) AS selection_prob
    FROM agg
) AS t_selprob
    WITH random_draw AS (
	    x.*, -- In some database engines, this is necessary to force one independent random trial per element.
	    random() AS rand
    SELECT WHERE (rand < selection_prob) -- This is where random sampling happens.
) AS t_draw

For now, I will leave you to marvel at the following fact: even though they are quite different in their sampling procedure, all of these sampling schemes share the same estimator formula, given by equation \eqref{org31d1d44}. By the end of this article, we will have explained this apparent coincidence.

Aside from having the same expected value, each these estimator has its own strengths, differing slightly in variance and computational convenience:

  1. Sampling without replacement has the lowest variance, although the difference is insignificant when \(n \ll |\mathcal{X}|\).
  2. Bernoulli sampling can be convenient to implement.
  3. Poisson sampling tends to be the simplest to reason about. It preserves its property through many transformations. For example, merging two independent Poisson samples of expected sizes \(N_1\) and \(N_2\) yields a Poisson sample of expected size \(N_1 + N_2\).
  4. Both Bernoulli and Poisson sampling can be easily parallelized over a partition of \(\mathcal{X}\). They can also be thinned into smaller sub-samples while preserving their properties.

The precision characteristics of these estimators are summarized in table 1:

Table 1: Precision characteristics of uniform sampling schemes. \(\hat{S}\) is the estimator defined in equation \eqref{org31d1d44}. The Coefficient of Variation (CV) provides a measure of relative error.
Sampling scheme \(\text{Var}\left[\hat{S}\right]^{1/2}/\lvert S \rvert:\) Coefficient of Variation (CV)
With replacement \(n^{- 1/2} \left( \frac{\overline{f^2}}{\overline{f}^2} - 1 \right)^{1/2}\)
Without replacement \(n^{- 1/2} \left( \frac{\overline{f^2}}{\overline{f}^2} - 1 \right)^{1/2} \left(1 - \frac{n - 1}{\lvert\mathcal{X}\rvert - 1}\right)^{1/2}\)
Bernoulli \(n^{- 1/2} \left( \frac{\overline{f^2}}{\overline{f}^2} \right)^{1/2} \left(1 - \frac{n}{\lvert\mathcal{X}\rvert}\right)^{1/2}\)
Poisson \(n^{- 1/2} \left( \frac{\overline{f^2}}{\overline{f}^2} \right)^{1/2}\)

in which \(\overline{f}\) and \(\overline{f^2}\) respectively denote the population averages of \(f(x)\) and \(f(x)^2\):

\begin{align*} \overline{f} &:= \frac{1}{\lvert\mathcal{X}\rvert} \sum_{x}{f(x)} \\ \overline{f^2} &:= \frac{1}{\lvert\mathcal{X}\rvert} \sum_{x}{f(x)^2} \end{align*}

5. Weighted samples

5.1. Importance Sampling

The uniform estimators discussed above are a good start, but they can fall short of being precise. This happens in particular in "needle in a haystack" situations where the sum is highly imbalanced, the vast majority of the terms \(f(x)\) contributing negligibly to the sum \(S\). Then the variance is high, and a very high sample size is needed to achieve good precision.

In such cases, it pays to oversample the terms \(x\) that are most likely to contribute significantly to the sum. This is the principle of Importance Sampling, in which the \(x\) are drawn with different selection probabilities, and the sum is weighted to compensate for the bias this introduces.

Concretely, we draw the sample \(\left(X_j\right)_j\) i.i.d. but not uniformly: each \(x \in \mathcal{X}\) has a probability \(p(x)\) of being selected (\(p(\cdot)\) is called the proposal distribution). The estimator is then:

\begin{equation} \label{org5661861} \hat{S}_p := \frac{1}{n} \sum_{j}{\frac{1}{p(X_j)} f(X_j)} \end{equation}

This is also an unbiased estimator: \(\hat{S}_p \overset{\mathbb{E}}{\approx} S\), which is easily verified as follows:

\begin{align*} \mathbb{E}\left[\hat{S}_p\right] = \frac{n}{n} \mathbb{E}\left[\frac{1}{p(X_1)} f(X_1)\right] = \sum_{x}{p(x) \frac{1}{p(x)} f(x)} = S \end{align*}

What can make \(\hat{S}_p\) more interesting than \(\hat{S}\) is its precision characteristics. The lowest possible variance is achieved when \(p(x)\) is proportional to \(|f(x)|\). This optimum is typically impossible to achieve (you'd have to compute \(\sum_x{|f(x)|}\)) but usually you can rely on some heuristic pointing you to a small subset of \(\mathcal{X}\) where most of the sum is concentrated.

As before, other sampling schemes can be used to draw the sample \(\left(X_j\right)_j\). The proposal distribution \(p(x)\) can be used to draw without replacement. Bernoulli sampling can also be used with some non-constant selection probability function \(q(x) \in (0; 1]\), and the formula becomes:

\begin{equation} \label{orgffa4bf4} \hat{S}_B := \sum_{j}{\frac{1}{q(X_j)} f(X_j)} \end{equation}

Equation \eqref{orgffa4bf4} also works in an Poisson sampling scheme, in which the number of repetitions of \(x\) in the sample follows \(\text{Poisson}(\lambda = q(x))\), in which case \(q(x) > 1\) is allowed.

Anticipating the next sections, we encourage readers to think of \(q(x)\) and \(n p(x)\) as the expected number of times that \(x\) is selected.

Table 2 describe the variance of these estimators, and how it can be estimated from the sample:

Table 2: Variance of estimators from various weighted sampling schemes. The rightmost column is an unbiased estimator of \(\text{Var}\left[\hat{S}\right]\).
Sampling scheme \(\text{Var}\left[\hat{S}\right]\) Variance estimator
With replacement \(\frac{1}{n} \text{Var}\left[ \frac{1}{p(X_1)}f(X_1) \right]\) \(\frac{1}{n-1} \left( \left(\sum_j{\frac{1}{p(X_j)}f(X_j)^2} \right) - \hat{S}^2 \right)\)
Without replacement \(\frac{1}{n} \text{Var}\left[ \frac{1}{p(X_1)}f(X_1) \right] \left(1 - \frac{n-1}{\lvert X \rvert - 1}\right)\) \(\frac{1}{n-1} \left( \left(\sum_j{\frac{1}{p(X_j)}f(X_j)^2} \right) - \hat{S}^2 \right) \left(1 - \frac{n-1}{\lvert X \rvert - 1}\right)\)
Bernoulli \(\sum_x{\frac{1 - q(x)}{q(x)} f(x)^2}\) \(\sum_j{\frac{1 - q(X_j)}{q(X_j)^2} f(X_j)^2}\)
Poisson \(\sum_x{\frac{1}{q(x)} f(x)^2}\) \(\sum_j{\frac{1}{q(X_j)^2} f(X_j)^2}\)

5.2. Stratification

Stratified estimation is an estimation technique with similar characteristics to Importance Sampling: we sample more densely in some parts of the space than others. The idea is to partition \(\mathcal{X}\) into \(K\) disjoint strata \(\mathcal{X}_1, \mathcal{X}_2, \dots, \mathcal{X}_K\):

\begin{equation} \label{org71824a1} \mathcal{X} = \mathcal{X}_1 \uplus \mathcal{X}_2 \uplus \dots \uplus \mathcal{X}_K \end{equation}

In each stratum \(\mathcal{X}_k\), we draw a sample \(\left(X_{kj}\right)_{j=1}^{n_k}\) of size \(n_k\). Typically, we'll have at least one stratum \(k\) which we expect to contain most of the large \(f(x)\) terms, and sample abundantly from it even though it makes from a small subset of \(\mathcal{X}\).

The corresponding estimator is then:

\begin{equation} \label{org95735c6} \hat{S}_\text{strat} := \sum_{k=1}^{K}{\frac{|\mathcal{X}_k|}{n_k} \sum_{j=1}^{n_k}{f(X_{kj})}} \end{equation}

Because strata are sampled independently, the variance of \(\hat{S}_\text{strat}\) is simply the sum of the per-stratum variances:

\begin{equation} \label{org7266da9} \text{Var}\left[\hat{S}_\text{strat}\right] = \sum_{k=1}^{K}{\text{Var}\left[\hat{S}_k\right]} \end{equation}

Stratified estimation can be further refined by using the ideas of previous sections within each stratum. At which point, our list of estimators is starting to look like a zoo. It's time for some unification.

6. Unification: the expected frequency of the sampling procedure

6.1. Example 1: merging weighted samples

We have seen how to estimate \(S\) using a small sub-sample of terms, and how to make the most of the sample size using weighted sampling schemes. How could we possibly want more? Well, here's one situation where we could want more. Suppose that you used Importance Sampling with sample size \(N_1\) and proposal distribution \(p_1(x)\). Then your colleague Meryem tells you that she used Bernoulli sampling with selection probability \(q_2(x)\). It would be a shame not to combine your samples to get better precision, wouldn't it? Spoiler alert, here's how to do it:

\begin{equation} \hat{S}_{1 \cup 2} := \sum_j{\frac{1}{N_1 p_1(X_{j}) + q_2(X_{j})}f(X_{j})} \end{equation}

in which the sum iterates over your merged samples. I will not bother to prove that this works in this particular use case - we will soon see a general result that will make this sort of formula obvious.

6.2. Example 2: tiled geospatial data

As another example, imagine that you are dealing with high-resolution geospatial data spanning all of India. Your data consists of rasters with 30m pixels: \(\mathcal{X}\) is the set of pixels and \(f(x)\) consists of computing the average distance of \(x\) to an urban area. No single file can contain so many pixels, and so the data is split in 60km square tiles, one raster file per tile.

This is an example where the cost structure of your computation is too complicated for the previous estimators to be useful. You cannot just scatter the sample all over the landscape. Instead, you'll want to use a multi-stage sampling scheme: first sample a subset of tiles, and then maybe sub-sample the pixels within each tile.

This time I won't show you the formula - can you come up with an estimator that works?

6.3. The expected frequency mass function

The first step towards unifying the previous results is to notice that all the estimators we studied above can be written in the form:

\begin{equation} \label{org0e22f27} S \overset{\mathbb{E}}{\approx} \hat{S} = \sum_{j}{\frac{1}{\nu(X_j)} f(X_j)} \end{equation}

in which \(\nu(x)\) is the expected number of times that \(x\) is selected in the sample:

\begin{equation} \label{orgfbd389f} \nu(x) := \mathbb{E}\left[\hat{n}_x\right] \end{equation}

\(\nu(\cdot)\) might be described as the "expected count" function; I will call it the expected frequency mass function.

The proof to equation \eqref{org0e22f27} is elementary:

\begin{align*} \mathbb{E}\left[\hat{S}\right] &= \mathbb{E}\left[\sum_{x}{\sum_{j | X_j = x}{\frac{1}{\nu(X_j)} f(X_j)}}\right] \\ &= \mathbb{E}\left[\sum_{x}{\frac{1}{\nu(x)} f(x) \sum_{j | X_j = x}{1}}\right] \\ &= \mathbb{E}\left[\sum_{x}{\frac{1}{\nu(x)} f(x) \hat{n}_x}\right] \\ &= \sum_{x}{\frac{1}{\nu(x)} f(x) \mathbb{E}\left[\hat{n}_x\right]} \\ &= \sum_{x}{\frac{1}{\nu(x)} f(x) \nu(x)} \\ &= \sum_{x}{f(x)} \\ \mathbb{E}\left[\hat{S}\right] &= S \end{align*}

Equation \eqref{org0e22f27} gives us a general and versatile method for estimating sums:

  1. Design a sub-sampling scheme that fits your problem. Your intuition probably has good approaches to suggest, and you can also use some of the techniques listed above, like Importance Sampling or Stratified Estimation.
  2. Find the expected frequency mass function \(\nu(x)\) that corresponds to your sub-sampling scheme.
  3. Draw a random sample \(\left(X_j\right)_j\) from your sub-sampling scheme.
  4. Compute and record \(\left(f(X_j)\right)_j\).
  5. Estimate the sum by applying equation \eqref{org0e22f27}.

6.4. What about variance?

Unfortunately, there is no general formula for the variance of the estimator defined in equation \eqref{org0e22f27}. The variance will depend on the specifics of the sampling procedure; it's possible to contrive sampling procedures of arbitrarily bad variance, by not sampling densely in the important parts of the space and/or by making samples highly correlated. The variance will have to be worked out specifically for each sampling procedure, which can usually be done by applying basic mathematical properties of variance: variance responds quadratically to scaling (\(\text{Var}[\lambda Y] = \lambda^2 \text{Var}[Y]\)), is additive on independent random variables (\(\text{Var}[Y_1 + Y_2] = \text{Var}[Y_1] + \text{Var}[Y_2]\)), can be expressed from moments (\(\text{Var}[Y] = \mathbb{E}[Y^2] - \mathbb{E}[Y]^2 = \mathbb{E}[(Y - \mathbb{E}[Y])^2]\)).

In addition, the conditional variance formula is useful for multi-stage sampling schemes:

\begin{equation} \label{org093c7b6} \text{Var}[Y] = \mathbb{E}\left[\text{Var}[Y | Z]\right] + \text{Var}\left[\mathbb{E}[Y | Z]\right] \end{equation}

Finally, the variance of a sum with correlated terms can be worked out from the covariances of the terms:

\begin{equation} \label{org4943892} \text{Var}\left[ \sum_{i=1}^m{Y_i} \right] = \sum_{i,j}{\text{Cov}\left[ Y_i, Y_j \right]} \end{equation}

7. General integration and expectation measure

7.1. General integrals and the reweighting trick

We will now make the above results applicable to a wider range of problems, like estimating continuous integrals and probabilistic expectations.

We will need some notions from Measure Theory; this is unsurpring, since Measure Theory is essentially the mathematical theory of sums. Said informally, a measure is something that distributes of positive mass over a space \(\mathcal{X}\). Concretely, a measure \(\mu\) is a function that assigns a non-negative "mass" \(\mu(A)\) to each set \(A \subseteq \mathcal{X}\) (with some constraints of what \(A\) and \(\mu(A)\) are allowed to be, such as non-negativity, addivity, etc.). Furthermore, by definition of integrals, we have:

\begin{equation} \label{orgadf7ccf} \int_{x \in A}{d\mu(x)} = \mu(A) \end{equation}

For example, if \(\mathcal{X} = \left\{x_1, x_2, \dots \right\}\) is a discrete set, the counting measure \(\kappa\) over \(\mathcal{X}\) maps each set \(A \subseteq \mathcal{X}\) to the number of elements \(x_i \in A\):

\begin{equation} \label{org6d0d23b} \kappa(A) := |A| \end{equation}

As such, our discrete sum \(S\) (defined by equation \eqref{org2fba7dc}) can be rewritten as an abstract integral:

\begin{equation} \label{org7317578} S := \sum_{x \in \mathcal{X}}{f(x)} = \int_{x \in \mathcal{X}}{f(x) d\kappa(x)} \end{equation}

This more general framing will prove useful when we define the estimand \(S\) to be something else than a discrete sum.

Another example of measure is the Dirac distribution \(\delta_x\) at a point \(x \in \mathcal{X}\), which maps each set \(A \subseteq \mathcal{X}\) to \(1\) if \(x \in A\) and \(0\) otherwise:

\begin{equation} \label{orga49ab20} \delta_{x}(A) := \begin{cases} 1 & \mbox{if } x \in A \\ 0 & \mbox{if } x \notin A \end{cases} \end{equation}

Measure theory tells us that an integral over a measure \(\mu\) can be rewritten as an integral over another measure \(\nu\), via the integral reweighting trick:

\begin{equation} \label{org5db7389} \int_{x \in \mathcal{X}}{g(x) d\mu(x)} = \int_{x \in \mathcal{X}}{g(x) \frac{d\mu}{d\nu}(x) d\nu(x)} \end{equation}

in which \(\frac{d\mu}{d\nu}(x)\) is the Radon-Nikodym derivative of \(\mu\) with respect to \(\nu\). No, don't run, there's an intuitive concept behind this scary name. Said informally, \(\frac{d\mu}{d\nu}(x)\) is a concentration ratio: it says by what factor \(x\) is more weighted by \(\mu\) than by \(\nu\). Equation \eqref{org5db7389} only works if \(\nu\) dominates \(\mu\), which essentially means that \(\frac{d\mu}{d\nu}\) is well defined: formally, it means that \(\mu(A) = 0\) whenever \(\nu(A) = 0\).

7.2. The sub-sample as a random measure

We now apply this theory to the use of random samples. So far, we had seen our random sample \((X_j)_j\) as a point process over \(\mathcal{X}\), that is, a random set of points in \(\mathcal{X}\) (more rigorously, a random collection of points, since duplicates are allowed). We now see our random sample as a random measure \(\hat{n}_X\) over \(\mathcal{X}\), which maps each set \(A \subseteq \mathcal{X}\) to the number of sample points \(X_j\) "landing" inside \(A\):

\begin{equation} \label{org2a2a3fd} \hat{n}_X(A) := \sum_{j | X_j \in A}{1} = \sum_{j}{\delta_{X_j}(A)} \end{equation}

Again, this measure-theoretic framing lets us rewrite discrete sums over the sample as abstract integrals over measure \(\hat{n}_X\):

\begin{equation} \label{org83b36a9} \sum_j{g(X_j)} = \int_{x \in \mathcal{X}}{g(x) d\hat{n}_X(x)} \end{equation}

The notion of probabilistic expectation can be applied to random measures. The expectation measure of \(\hat{n}_X\) is the measure \(\nu\) defined as:

\begin{equation} \label{org2da55f9} \nu(A) := \mathbb{E}\left[\hat{n}_X(A)\right] \end{equation}

Then, probability theory allows us to exchange the integral and expectation operators:

\begin{equation} \label{orgdfab63d} \mathbb{E}\left[\int_{x \in \mathcal{X}}{g(x) d\hat{n}_X(x)}\right] = \int_{x \in \mathcal{X}}{g(x) d\nu(x)} \end{equation}

By the way, this "point process viewpoint" can make the Radon-Nikodym derivative \(\frac{d\mu}{d\nu}\) more intuitive. If \((X_{j})_j\) and \((X'_{j})_j\) are two point processes with expectation measures \(\nu\) and \(\nu'\), then the Radon-Nikodym derivative \(\frac{d\nu'}{d\nu}(x)\) tells us how much more often \(x\) gets drawn in \(X'\) than in \(X\); or said differently, it tells us by what factor \(X'\) over-represents \(x\) compared to \(X\).

7.3. Estimating general integrals

We now have all the tools we need to estimate general integrals using random samples. We no longer assume that \(\mathcal{X}\) is a discrete space. Instead, we assume that our estimand \(S\) is now an integral over a measure \(\mu\):

\begin{equation} \label{org12799a5} S := \int_{x \in \mathcal{X}}{f(x) d\mu(x)} \end{equation}

We now draw a sample \((X_j)_j\) with an expectation measure \(\nu\) that dominates \(\mu\). By combining equations \eqref{org83b36a9}, \eqref{orgdfab63d} and \eqref{org5db7389}, we obtain an unbiased estimator of \(S\):

\begin{align*} S &= \int_{x \in \mathcal{X}}{f(x) d\mu(x)} \\ &= \int_{x \in \mathcal{X}}{f(x) \frac{d\mu}{d\nu}(x) d\nu(x)} \\ &= \mathbb{E}\left[ \int_{x \in \mathcal{X}}{\frac{d\mu}{d\nu}(x) f(x) d\hat{n}_X(x)} \right] \\ S &= \mathbb{E}\left[ \sum_j{\frac{d\mu}{d\nu}(X_j) f(X_j)} \right] \end{align*}

In summary, we estimate \(S\) through the sample-reweighting estimator \(\hat{S}_X\):

\begin{equation} \label{org77807e8} S \overset{\mathbb{E}}{\approx} \hat{S}_X := \sum_j{\frac{d\mu}{d\nu}(X_j) f(X_j)} \end{equation}

Taking a step back, we can say that we are essentially approximating the measure \(\mu\) using a weighted mixture of Dirac distributions:

\begin{equation} \label{org594a7de} \mu \overset{\mathbb{E}}{\approx} \sum_j{\frac{d\mu}{d\nu}(X_j) \delta_{X_j}} \end{equation}

Let us now revisit the discrete case (equation \eqref{org0e22f27}) from this more general perspective. As we saw in equation \eqref{org7317578}, \(\mu\) is simply the counting measure \(\kappa\) over \(\mathcal{X}\), with mass function \(\kappa(x) = 1\); the expectation measure \(\nu\) has mass function \(\nu(x)\), so the Radon-Nikodym derivative \(\frac{d\mu}{d\nu}\) is concretely expressed as:

\begin{equation} \label{orgd429edf} \frac{d\mu}{d\nu}(x) = \frac{1}{\nu(x)} \end{equation}

from which equation \eqref{org0e22f27} follows directly as a special case of equation \eqref{org77807e8}.

8. Application: estimating continuous integrals

Let us make equation \eqref{org77807e8} more concrete, by studying the case of estimating a continuous integral:

\begin{equation} S := \int_{\mathcal{X}}{f(x) dx} \end{equation}

Assume that we sample \((X_j)_j\) from a point process which expectation distribution has a density function \(\nu_\text{DF}(x) := \frac{d\nu}{dx}(x)\). For example, we might sample \(n\) times from some probability density function \(p(x)\), in which case \(\nu_\text{DF}(x) = n p(x)\). Then the Radon-Nikodym derivative can be expressed concretely:

\begin{equation} \frac{d\mu}{d\nu}(x) = \frac{dx}{d\nu}(x) = \frac{1}{\nu_\text{DF}(x)} \end{equation}

from which can compute the sample-reweighting estimator:

\begin{equation} \label{org9344aeb} \hat{S}_X := \sum_j{\frac{1}{\nu_\text{DF}(X_j)} f(X_j)} \end{equation}

9. Application: expected value of (other) probability distributions

9.1. A new look at the sample mean

A common use case is to estimate some expected value. We now assume that the estimand \(S\) is the expected value of \(f(X)\) in which \(X\) follows a probability distribution with density \(p(x)\):

\begin{equation} S := \mathbb{E}\left[f(X)\right] = \int_{x \in \mathcal{X}}{f(x) p(x) dx} \end{equation}

The most common to estimate \(S\) is to draw an i.i.d. sample \((X_j)_{j=1}^n\) from \(p\), and compute the sample mean:

\begin{equation} \label{org4c9e3f1} S \overset{\mathbb{E}}{\approx} \hat{S}_\frac{1}{n} := \frac{1}{n} \sum_{j=1}^n{f(X_j)} \end{equation}

It's interesting to revisit the sample mean as a special case of our more general framework. Our i.i.d. sampling procedure can be viewed as a point process; its expectation measure is \(\nu = n p\), with density function \(n p(x)\). The Radon-Nikodym derivative is:

\begin{align*} \frac{dp}{d\nu}(x) = \frac{dp/dx}{n dp/dx} = \frac{p(x)}{n p(x)} = \frac{1}{n} \end{align*}

so that the sample mean estimator (equation \eqref{org4c9e3f1}) follows directly as a special case of the sample-reweighting estimator (equation \eqref{org77807e8}):

\begin{equation} \hat{S}_X := \sum_j{\frac{1}{n} f(X_j)} = \frac{1}{n} \sum_{j=1}^n{f(X_j)} = \hat{S}_\frac{1}{n} \end{equation}

Some sampling procedures like Markov Chain Monte Carlo (MCMC) do not draw independent samples (the chains have autocorrelation). However, once ergodicity is achieved, a chain of length \(n\) does have \(n p\) as its expectation measure, so that equation \eqref{org4c9e3f1} still applies (with somewhat degraded variance).

9.2. Importance Sampling and Likelihood Ratio Estimator

Now assume that the sample \((X_j)_{j=1}^n\) was drawn from some other probability distribution \(p'\), with density function \(p'(x)\). Then applying equation \eqref{org77807e8} gives us what has been called in the literature the likelihood ratio estimator:

\begin{equation} \label{org4d0939e} S \overset{\mathbb{E}}{\approx} \hat{S}_\frac{p}{n p'} := \frac{1}{n} \sum_{j=1}^n{\frac{p(x)}{p'(x)} f(X_j)} \end{equation}

The above is how Importance Sampling is generally framed in the literature. Equation \eqref{org4d0939e} also works in the discrete case, i.e. when \(p(x)\) and \(p'(x)\) are probability mass functions instead of probability density functions.

10. Beyond point samples

10.1. Random integrals

Recall that the proof of equation \eqref{org77807e8} involved the expectation of an integral over the sample viewed as a random measure \(\hat{n}_X\):

\begin{equation} \label{org3cb0833} S \overset{\mathbb{E}}{\approx} \int_{x \in \mathcal{X}}{\frac{d\mu}{d\nu}(x) f(x) d\hat{n}_X(x)} \end{equation}

We will now simply forget that \(\hat{n}_X\) represents a point sample, thus making equation \eqref{org3cb0833} more broadly applicable. This is interesting because sometimes, we have more efficient ways of computing integrals than by drawing a point sample.

10.2. Example: damage from firebrands in wildfire simulations

In the United States, most of the damage caused by wildfires to structures is due to firebrands - burning embers that get carried by the wind and ignite buildings. The risk caused by firebrands can be estimated by running Monte Carlo simulations that model hypothetical fires and their firebrands emissions.

The most basic way in which this is done it to simulate a large number of fires \((F_j)_j\), and for each fire to simulate the trajectories and structure ignitions of firebrands emitted by burning fuels, recording which structures end up ignited. Effectively, we're estimating structure ignition probability by drawing a point process \((B_j(s))_j\) of structure ignitions, \(B_j(s)\) being a Bernoulli random variable corresponding to \(j\) burning \(s\), and the probability2 \(\beta(s)\) of structure \(s\) burning across all potential fires is estimated as:

\begin{equation} \beta(s) = \int_{F}{\mathbb{P}\left[s \text{ burns} | F\right] d\mu(F)} \overset{\mathbb{E}}{\approx} \hat{\beta}(s) := \sum_j{\frac{d\mu}{d\nu}(F_j) B_j(s)} \end{equation}

in which \(\mu\) and \(\nu\) are respectively the expectation measures of real-world and simulated fire occurrence (which are known by design of the simulations).

This estimation could be done more efficiently, however. Oftentimes, the stochastic model for firebrand trajectories is simple enough to be analytically tractable: typically, each burning pixel emits a random number of firebrands based on its burning conditions, then each firebrand "jumps" independently, with a log-normal jump in the wind direction and a normal jump orthogonal to it, so that we can compute the spatial probability density of the jump. Then we might apply a probability of successful structure ignition based on things like travelled distance and structure characteristics. The upshot is: for each structure \(s\) and each burned pixel \(z\), we know how to compute the expected number \(\phi_j(s;z)\) of firebrands originating from \(z\) that burn \(s\) conditional on fire \(j\).

Then the total expected number \(\phi_j(s)\) of firebrands igniting structure \(s\) during fire \(j\) is simply the sum over burned pixels:

\begin{equation} \phi_j(s) = \sum_{z}{\phi_j(s;z)} \end{equation}

It's sensible to assume that the number of firebrands actually burning \(s\) is Poisson-distributed, and therefore the probability that the structure burns during fire \(j\) is \(\beta_j(s) = \left(1 - e^{- \phi_j(s)}\right)\). Then the aggregated risk estimate becomes:

\begin{equation} \beta(s) \overset{\mathbb{E}}{\approx} \sum_j{\frac{d\mu}{d\nu}(F_j) \beta_j(s)} \end{equation}

Of course, all of the techniques that we have seen in previous sections are applicable to sampling the simulated fires. For example, we might use Importance Sampling to oversample the fires that are most likely to emit firebrands, for example because they happen in firebrand-prone vegetation like eucalyptus.

11. Summary

Let's recap the ground we have covered. We have seen how to estimate discrete sums (section 2), continuous integrals (section 8), probabilistic expectations (section 9) and general integrals (7) using random samples. To address issues of computational efficiency, we have studied more or less sophisticated sampling schemes, from uniform i.i.d. sampling (section 3) to weighted designs like Bernoulli sampling Importance Sampling (section 5), and even more irregular use cases like merging samples (section 6.1) and multi-stage designs (section 6.2). We used Measure Theory to give a unified treatment of all these use cases (7.3), thus deriving a general estimation formula:

\begin{equation} \label{org9e0ef4d} S := \int_{x \in \mathcal{X}}{f(x)d\mu(x)} \overset{\mathbb{E}}{\approx} \sum_j{\frac{d\mu}{d\nu}(X_j) f(X_j)} =: \hat{S} \end{equation}

Table 3 references the concrete embodiments of equation \eqref{org9e0ef4d} that we have covered:

Table 3: List of estimation techniques.
Estimand \(S\) Target measure \(\mu\) Sampling scheme \(\frac{d\mu}{d\nu}(x)\) Equation
Discrete sum \(\sum_x{f(x)}\) \(\mu := \kappa = \sum_{x}{\delta_x}\) Uniform sampling (with or without replacement, sample size \(n\)) \(\frac{1}{n / \lvert \mathcal{X} \rvert} = \frac{\lvert \mathcal{X} \rvert}{n}\) \eqref{org31d1d44}
    Importance sampling (sample size \(n\), proposal distr \(p'_\text{MF}(x)\)) \(\frac{1}{n p'_\text{MF}(x)}\) \eqref{org5661861}
    Bernoulli sampling (selection prob \(q(x)\)) \(\frac{1}{q(x)}\) \eqref{orgffa4bf4}
    Poisson sampling (selection frequency \(q(x)\)) \(\frac{1}{q(x)}\) \eqref{orgffa4bf4}
    General case (expected frequency mass function \(\nu(x)\)) \(\frac{1}{\nu(x)}\) \eqref{org0e22f27}
Continuous integral \(\int_{\mathcal{X}}{f(x)dx}\) \(d\mu := dx\) Importance sampling (sample size \(n\), proposal distr \(p'_\text{DF}(x)\)) \(\frac{1}{n p'_\text{DF}(x)}\) \eqref{org9344aeb}
Expectation \(\mathbb{E}[f(X)]\), \(X \sim p(x)\) \(\mu := p\) Monte-Carlo + sample mean (sample size \(n\)) \(\frac{p(x)}{n p(x)} = \frac{1}{n}\) \eqref{org4c9e3f1}
    Importance sampling (sample size \(n\), proposal distr \(p'(x)\)) \(\frac{p(x)}{n p'(x)}\) \eqref{org4d0939e}

Table 4 summarizes the cast of characters:

Table 4: Table of symbols
Symbol(s) Meaning   Main formulas
\(S\) Estimand: the sum to be estimated. \eqref{org2fba7dc} \(S := \sum_x{f(x)}\) (discrete case)
\(\mathcal{X}\) Set indexing the terms of \(S\) (from which samples are drawn).    
\((X_j)_j\) Random sample drawn from \(\mathcal{X}\).   \(X_j \in \mathcal{X}\)
\(n\) (Expected) size of the sample. (The size itself may be random.)    
\(\hat{n}_X\) Random measure representing \((X_j)_j\). \eqref{org2a2a3fd} \(\hat{n}_X := \sum_j{\delta_{X_j}}\)
    \eqref{org2a2a3fd} \(\hat{n}_X(A) = \sum_{j \lvert X_j \in A}{1}\)
    \eqref{org83b36a9} \(\int_{x \in \mathcal{X}}{g(x)d\hat{n}_X(x)} = \sum_j{g(X_j)}\)
\(\nu\) Expectation measure of the sampling procedure. \eqref{org2da55f9} \(\nu := \mathbb{E}\left[ \hat{n}_X \right]\)
\(\mu\) Estimand measure - from the definition of \(S\).    
\(\hat{S}\) Estimator of \(S\). \eqref{org0e22f27} \(\hat{S} := \sum_j{\frac{1}{\nu(X_j)}f(X_j)}\) (discrete case)
    \eqref{org77807e8} \(\hat{S} := \sum_j{\frac{d\mu}{d\nu}(X_j) f(X_j)}\) (general case)
      \(\hat{S} \overset{\mathbb{E}}{\approx} S\)
\(\frac{d\mu}{d\nu}(x)\) Concentration ratio¬†of estimand over expectation measure at point \(x \in \mathcal{X}\) (Radon-Nikodym derivative).    

12. Parting thoughts

I suspect that a lot of resources are wasted on computing sums exhaustively, for lack of thinking to use a random sub-sample, even with a basic sampling procedure like Bernoulli sampling (see sections 3 and 4). If this general approach is a new tool in your toolbox, my main goal for this article is achieved.

In my current work (environmental data science and stochastic modeling of natural hazards), uniform subsamples often fall short of being precise enough. In such cases, weighted sampling schemes (see section 5) like stratification and Importance Sampling can help significantly. I have found that the most basic weighted sampling procedures are often too simplistic to be practical: this is when the more general "sample-reweighting" estimators (equations \eqref{org0e22f27} and \eqref{org77807e8}) come in handy, as they provide a unified treatment of all sampling procedures. Reweighting samples is useful not only for estimating one sum more efficiently, but also for repurposing a sample to target other distributions.

I used to think of sub-sampling solely in terms of random points. I have found the point process and random measure perspectives to be empowering. I initially thought that it wasn't worth the abstraction effort in practice; I'm glad I reconsidered. These notions are useful not only for approximating sums, but for conceiving of sums.

13. Further reading

A good textbook on Monte Carlo methods is (Rubinstein, Reuven Y and Kroese, Dirk P, 2016).

I haven't seen exactly the angle of this article being used in the statistical literature, which is why I felt the need to write it. Still, I would welcome references from commenters. There is much more to using sub-samples: for example, this tutorial hasn't covered the use of cheaper auxiliary functions for improving the approximation - see e.g. (McConville, Kelly S and Moisen, Gretchen G and Frescino, Tracey S, 2020). I'm no specialist, but I suspect the survey design literature is a good logical next step: I hear (Kalton, Graham, 2020) is a good reference.

A more rigorous treatment of random measures may be found in (Kallenberg, Olav and others, 2017). For mathematical introductions to Measure Theory, see (Tao, Terence, 2011) and (Halmos, Paul R, 2013).

14. References

Halmos, Paul R (2013). Measure theory, Springer.

Kallenberg, Olav and others (2017). Random measures, theory and applications, Springer.

Kalton, Graham (2020). Introduction to survey sampling, Sage Publications.

McConville, Kelly S and Moisen, Gretchen G and Frescino, Tracey S (2020). A tutorial on model-assisted estimation with application to forest inventory, MDPI.

Rubinstein, Reuven Y and Kroese, Dirk P (2016). Simulation and the Monte Carlo method, John Wiley \& Sons.

Tao, Terence (2011). An introduction to measure theory, American Mathematical Soc..



"Measure in everything, neither too much nor too little!" The commanding officer of the 2011 Polytechnique class probably did not have Measure Theory in mind when he was telling us this, but I expect my applying his mantra to approximate sums will leave us in good terms. Puns intended.


Rigorously speaking, this is the expected burn count of the structure. It is a reasonable approximation as long as the probability of "collision" between several potential fires is negligible.

Author: Valentin Waeselynck

Created: 2024-02-15 Thu 09:02