The Ubiquitous Normal Law

When students are first taught (as undergraduate math majors or as graduate students) about the Central Limit Theorem (CLT), they are often in awe of how all-encompassing this remarkable result is.

They have up to this point been introduced to the concepts of discrete and continuous random variables, distribution functions, independence and conditionality, expectations, convergence in probability and the weak Law of Large Numbers, among other topics.

More often than not they become acquainted with the binomial distribution and apply it to finding probabilities of outcomes associated with coin-tossing experiments. For a large number of trials (which, with today’s powerful math software, would be trivial), the instructor will introduce Stirling’s Theorem, which for our purposes states that

\lim_{n \rightarrow \infty} \frac{n!}{\sqrt{2\pi}e^{-n}n^{n+\frac{1}{2}}}=1

and use it to prove the de Moivre-Laplace approximation to the binomial approximation: If X is a binomial random variable with parameters n and \rho then for positive integer k

P\{X\leq k\} = P\{\frac{X-n\rho}{\sqrt{n\rho(1-\rho)}} \leq \frac{k-n\rho}{\sqrt{n\rho(1-\rho)}}\} \simeq \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\frac{k-n\rho}{\sqrt{n\rho(1-\rho)}}}e^{-t^2}\,dt.

What this says is that for large ­n, the number of successes in n binomial trials with constant probability of success \rho can be approximated using a normal distribution.

I have heard students say, “That is really cool,” which of course would please me greatly. But by now the students are ready to be introduced to the CLT. We present the classical version:

Suppose that \{X_i\} is a sequence of independent and identically distributed random variables with finite mean and finite variance. Let Z_n = \sum_{i=1}^n X_i and let E(Z_n) and \sigma^2(Z_n) be the mean and variance respectively of Z_n, then with Z^*_n = \frac{Z_n-E(Z_n)}{\sigma(Z_n)}

\lim_{n \rightarrow \infty} P\{Z^*_n \leq a\} = \int_{-\infty}^a \frac{1}{\sqrt{2\pi}} e^{\frac{-\mu^2}{2}}\, d\mu

for any real a.

That is, Z^*_n converges in distribution to a normal random variable with mean 0 and variance 1, often designated simply as N(0,1).

The instructor states that the CLT greatly generalizes the de Moivre-Laplace results in that it too serves as an approximation to binomial distribution using the normal law but does the same for any number of other distributions as long as its conditions are satisfied. In fact, social scientists and other researchers analyze their data using the normal law as the vehicle for estimation and for testing hypotheses. (Unfortunately, it is somewhat cavalier to approach such problems by invoking the normal law as if it is a universal truth. But of course, that is another story.)

Unfortunately in most courses, at least at the elementary or intermediate level, the story ends here, when it should not. The very brief formulation described above is de rigueur of most content experienced in an introductory probability course. In fact, the normal law is deeply connected to “sums of independent quantities.” And very early work connected to stochastically independent functions, which generalizes sums of independent quantities, frees us in our thinking from the constraints of games of chance, for example.

Nowhere is this more elegantly discussed for the non-expert than in a wonderful autobiography by Mark Kac entitled Enigmas of Chance, part of the Sloan Foundation series by or of prominent scientists. In Enigmas of Chance, Kac gives two examples which illustrate this point vividly and accessibly.

I will only briefly sketch Kac’s first example, which will amply make the point. For the more curious reader, please refer to Enigmas of Chance as a beginning.

To start, recall that for every 0 \leq t \leq 1 there is a unique non-terminating decimal expansion. For example,

\frac{2}{7} = 0.285714285714 \ldots or

= \frac{2}{10}+\frac{8}{10^2}+\frac{5}{10^3}+\frac{7}{10^4}+\cdots

Or in general for any 0 \leq t \leq 1 there exists a unique sequence d_1,d_2,d_3,\ldots of digits (where for any i, d_i can only assume 0,1,2,\ldots,9.) Thus

t = \frac{d_1}{10}+\frac{d_2}{10^2}+\frac{d_3}{10^3}+\cdots

Of course there is nothing sacred about base 10; we can, for example, use base 2. In this case the b_i’s from above can only assume the values 0 or 1, in which case

\frac{2}{7} = \frac{0}{2}+\frac{1}{2^2}+\frac{0}{2^3}+\frac{0}{2^4}+\frac{1}{2^5}+\frac{0}{2^6}+\cdots

so that b_1=0, b_2 = 1, b_3=0, b_4 = 0, b_5 = 1, etc.

Consider now those numbers for which b_1(t) = 1, b_2(t)=0, b_3(t)=1.

The smallest such t is therefore \frac{1}{2}+\frac{0}{2^2}+\frac{1}{2^3} = \frac{5}{8} and, recalling the sum of a geometric series, the largest is \frac{5}{8}+\frac{1}{2^4}+\frac{1}{2^5}+\frac{1}{2^6} +\cdots = \frac{6}{8}

so b_1(t) = 1, b_2(t)=0, and b_3(t)=1 form the interval (\frac{5}{8},\frac{6}{8}) which has length L = \frac{1}{8}.

Denote this as L\{b_1(t) = 1, b_2(t)=0, b_3(t)=1\} = \frac{1}{8}.

Now we can readily see that b_2(t) = 0 can occur two ways:

\frac{0}{2}+\frac{0}{2^2} or \frac{1}{2}+\frac{0}{2^2}.

Reasoning as above for each respective possibility, we obtain two intervals (0,\frac{1}{4}) and (\frac{1}{2},\frac{3}{4}). By summing these lengths we arrive at L(b_1=0)=\frac{1}{2}.

Similarly, reasoning yields L(b_1=1) = \frac{1}{2} and L(b_3=1) = \frac{1}{2}.

Thus L(b_1=1,b_2=0,b_3=1) = \frac{1}{8} = L(b_1=1)\cdot L(b_2=0) \cdot L(b_3=1) = \frac{1}{2}\cdot\frac{1}{2}\cdot\frac{1}{2}=\frac{1}{8}.

And so the reasoning continues, which shows that with the binary variable x replaced by b and P\{\} replaced by L\{\} it follows that with B_n(t) = \frac{\sum_1^n b_i(t)-n/2}{\sqrt{n/4}} that for any a,b

\lim_{n \rightarrow \infty} L\{a<B_n(t)<b\} = \int_a^b \frac{1}{\sqrt{2\pi}}e^{\frac{1}{2}x^2}\, dx.

Notice what has been achieved: by the simple arithmetic demonstration that \{b_i\} are indeed independent in the sense that L\{b_i=j: i=1,\ldots,n, j = 0,1\}=L(b_1)\times L(b_2) \times \cdots \times L(b_n) we simply can apply the CLT to demonstrate convergence to N(0,1) without invoking games of chance like coin tossing or underlying probability distributions assumed for random variables. The normal law can apply as well under conditions that have nothing to do with what the student has thus far encountered in a standard course in probability but, as Kac illustrates, “could be part of everyday mathematics.”

As Kac points out, this kind of thinking was introduced by Hugo Steinhaus in a 1923 paper dealing with arithmetization of probability theory and resulted in bringing the “normal law closer to the mainstream of mathematics”—a useful and important piece in the history of mathematics but a worthy subject in showing how ubiquitous the normal law is.

This entry was posted in Uncategorized. Bookmark the permalink.

7 Responses to The Ubiquitous Normal Law

  1. Pingback: Polls, Observational Studies, and the Meaning of “Random” | CUNYMath Blog

  2. Pingback: Round Up! : Footenotes

  3. This very interesting mathematical discussion also tempts us to want to hear that parenthetical “another story” on whether the normal law is a universal truth. That is, how ubiquitous is the normal law in experimental data? Those words in the CLT assumptions “independent … finite mean and finite variance” stand out. The answer may be sometimes “yes” and sometimes “no”. The distributions of experimental data from many scientific fields (such as the pieces of coastlines, the fluctuations in economic markets, and the timing between heart attacks) seem to have “fat tails” (power-law distributions) that may not be well described by the normal distribution. (Was it those “fat-tails” that did in LTCM and many other financial firms as well?) See, for example, the non-technical review : Two Lessons from Fractals and Chaos.

  4. Doug Howard says:

    Yes. When I teach Stochastic Processes our underlying probability space is always the unit interval [0,1]. We think of a number “picked at random” as an infinite sequence of independent and uniformly distributed digits. These digits can be “repackaged” to form a sequence U_1, U_2,… of iid uniform(0,1) random variables.
    (E Unum Pluribus, as I tell my students.) Then, via inverse transform, we generate iid sequences of any desired distribution. From there, all manner of constructions are possible. For example, Brownian motion can be “built” from an iid sequence of standard normals. It’s a very constructive appoach and leads nicely to Monte Carlo methods, where a computer generates U_1, U_2,… for you.

  5. Jonas Reitz says:

    What resonates for me is the journey of discovery (and wonder!) that comes from learning about normal distribution, the CLT, and its corollaries. This is one of my recent favorite mathematical ideas, and something I’ve been mulling over lately with regards to my teaching. As a student I somehow managed to avoid much exposure to the normal distribution, and it wasn’t until I started teaching Introduction to Statistics that I got an inkling of just how fundamental it is. The more it sinks in (I have taught the course five years running, and each time I come away with a little more understanding and many more questions) the more surprised (and happy!) I am. But the journey as described is definitely that of the mathematically sophisticated student, quite a small proportion of the students that pass through my care. For my introductory students, I struggle to convey this sense of wonder — I can describe to them why they should be filled with amazement, and provide a laundry list of examples demonstrating the universality of these principles, but I’m working to find a way to really get them to “feel” it, to show rather than tell.

  6. Warren B. Gordon says:

    This was a very nice article, in particular, Kac’s example. This would be a nice addition to a probability class. especially when discussing the CLT.

Comments are closed.