Probability Theory:
A Summary
zweistein
Desu-Cartes
1
Contents
1 Measure Theory 3
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Measures and algebras . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Integration Theory 5
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Integration of real-valued functions . . . . . . . . . . . . . . . . . 6
2.3 Multiple Integrals and Derivatives . . . . . . . . . . . . . . . . . 9
2.4 Change of Variables . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Function Spaces 12
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 L
p
spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Probability Theory 13
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Expectation and Variance . . . . . . . . . . . . . . . . . . . . . . 14
4.3 A Few Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.4 Independence of Random Variables . . . . . . . . . . . . . . . . . 16
5 Modes of Convergence 18
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2 Various notions of convergence . . . . . . . . . . . . . . . . . . . 18
5.3 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . 22
5.4 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . 23
Appendices 24
A Common Distributions 24
A.1 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . 24
A.2 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . 25
2
1 Measure Theory
1.1 Introduction
In order to define modern probability, we will have to start by making a neces-
sary detour in measure theory and integration. In general, we will not define a
probability measure on all subsets of our space. Instead, perhaps in analogy to
topology and open sets, we define which sets we decide to work with. Those sets
will form a particular algebraic structure, and we will define a probability on it.
After that, we define integration in general and consider the important question
of interchanging limits between our integral and other analytical objects such
as sequences, series and other integrals. With this out of the way, we can define
function spaces and look at what kind of structure our functions form. Finally,
we can define probability theory using all the tools we’ve built so far, and an-
alyze what kind of modes of convergence exist between probabilistic objects.
This section culminates in the strong law of large numbers which reinforces our
intuition about what probability really is about.
1.2 Measures and algebras
Definition 1.1. We consider a universe of events (a set really) Ω. Let F
P(Ω) be a family of sets over Ω. We say that F is a σ-algebra if the following
properties are verified:
1. For any countable family of subsets in F, their union is in F.
2. For any subset in F, their complement is in F.
3. The universe is in F.
It is clear from this definition that countable intersections of elements of F are
in F and that the empty set is in F. We call elements A F events or more
generally, measurable sets. The pair (Ω, F) is called a measurable space.
In some cases, we can take P(Ω) as our σ-algebra. We will usually consider
this σ-algebra when working over countable spaces such as N. Since P(Ω) is
always a σ-algebra for any universe Ω, we can use this fact to understand the
notion of σ-algebra generated by a subset of Ω.
Definition 1.2. Let X P(Ω). Then the σ-algebra generated by X is the
intersection of all σ-algebras over that contain X. It is clear this intersection
exists from the discussion above. We will denote this σ-algebra by σ(X).
It is not so clear why we need σ-algebras initially. Surely, if P(Ω) always
works, why not just take that at all times? Choosing which sets are measurable
will fine-tune the notion of ”measurability” itself, just like specifying a topology
on X will restrict topological notions such as convergence or continuity. Just
as in topology, we also have special maps that preserve the notion of being
measurable.
3
Definition 1.3. Let (Ω, F) and (Ω
0
, F
0
) be two measurable spaces. We say a
function f : (Ω, F) (Ω
0
, F
0
) is measurable if it respects the σ-algebra struc-
ture. In other words, for any B F
0
, we have that f
1
(B) F. If F = σ(X),
then it is sufficient to check this for any subset of X instead. From now on, we
write f :
0
instead of the more cumbersome notation above.
This makes clearer the notion of ”measurability” and how similar it is to
topology. In topology, open sets specify which functions are continuous. In
measurable spaces, elements of σ-algebra specify which functions are measur-
able.
Proposition 1.4. Measurable functions respect the usual algebraic operations.
Sums, (scalar and pointwise) products, quotients and compositions of measurable
functions are measurable where it makes sense. Furthermore, if the target space
of the function is a metric space, pointwise limits of measurable functions are
measurable. Supremums, infimums and their limits of real-valued measurable
functions are measurable as well.
Proof. The proofs are either not very illuminating or extremely easy. We just
show composition. Suppose f :
1
2
and g :
2
3
are measurable
functions whose domains and codomains match as one would want. Then the
preimage of g f is f
1
(g
1
(A)). Suppose A F
3
. Then B = g
1
(A) F
2
.
In turn, C = f
1
(B) F
1
. This holds because both f and g are measurable.
Thus for A F
3
, (g f) (A) F
1
which shows g f is measurable.
As one can see, the class of measurable functions is quite rich. This is a good
thing. Since we will define integration on those functions, we want as many
functions as possible to be at least measurable. It is obvious that continuous
functions are measurable provided that the σ-algebra on the domain is the one
generated by open (or closed) sets. We will call this σ-algebra the Borel algebra.
In the case of the real line, we will write it as B(R). We now define the main
object of study in measure theory.
Definition 1.5. A measure is a function µ : F R
+
{∞} such that the
following holds:
1. µ() = 0
2. µ(
S
i=1
X
i
) =
P
i=1
µ(X
i
), for any countable family of disjoint sets (X
i
)
iI
.
The second property is called σ-additivity (sometimes also known as countable
additivity). If µ(Ω) = 1, we say that µ is a probability measure, or more simply,
a probability. The triple (Ω, F, µ) is called a measure space. If µ = P is a
probability, then we call the space (Ω, F, P) a probability space, and a measurable
function on it a random variable.
Proposition 1.6 (Properties of measures). Let (Ω, F, µ) be a measure space.
Let A, B F. Then:
1. µ(A) = µ(A \ B) + µ(A B);
4
2. If µ(A) < , then µ(A B) = µ(A) + µ(B) µ(A B);
3. If A B, then µ(A) µ(B). If in addition, µ(A) < , then µ(B \ A) =
µ(B) µ(A);
4. If (A
n
)
n
is an increasing sequence and for every n, A
n
F, then µ(
S
n=1
A
n
) =
lim
n→∞
µ(A
n
). If the sequence is decreasing and µ(A
1
) < , then µ(
T
i=1
A
n
) =
lim
n→∞
µ(A
n
). Furthermore, the limit converge increasingly (resp. de-
creasingly).
Proof. Those are all easy properties that we leave to the reader. For 4, use a
decomposition of your space that allows you to apply σ-additivity.
It will be important for us to define measures based on random variables.
Indeed, suppose P is a probability. We will often be interested in understanding
the probability related to values of random variables. Suppose X is a real-valued
random variable on (the space, by abuse of notation). We are interested in
understanding P(ω : X(ω) B) for an event B. We denote this function P
X
(B).
It is not too hard to see that this is a measure on
0
= R. This will become
important in the probability section.
Notice that we haven’t actually talked about the problem of existence of
measures. Indeed, we will not concern ourselves with it. For now, we just
assume there’s a magic theorem (a few, actually) that allows us to construct
all the measures we need. We finish this section by presenting two important
measures.
Proposition 1.7 (Lebesgue measure). There exists a translation-invariant mea-
sure λ on (R, B (R)) such that λ([a, b]) = b a.
Proposition 1.8 (Product measure). Suppose
1
,
2
are two measure spaces
with measures m
1
(resp. m
2
). Then there exists a measure m
1
m
2
on
1
×
2
such that for all A
1
, B
2
, we have (m
1
m
2
)(A × B) = m
1
(A)m
2
(B).
2 Integration Theory
2.1 Introduction
Let (Ω, F, µ) be a measure space. We wish to define a linear functional f 7−
R
f on the space of real-valued measurable functions. When we learn about
the Darboux-Riemann integral, we start by subdividing the domain of integra-
tion. Here, we take a different approach. In a sense, we subdivide the range
of the function and then sum. This subdivision comes in the form of limits of
finite sums of indicator functions, which we define next.
Definition 2.1. The indicator function of a set A is the function 1
A
which is
equal to 1 on A, 0 everywhere else. It is clear that 1
A
is measurable if and only
if A is. A simple function is a finite sum of indicator functions.
5
We proceed in three steps to define our integral. First we construct our
integral for indicator functions only. Then we define it for simple functions. Fi-
nally, we define it for positive functions and extend that definition to measurable
functions in general.
2.2 Integration of real-valued functions
We now construct the integral for each class of functions discussed above. In
this subsection, a measurable function means a real-valued measurable function.
Definition 2.2 (Existence of the integral). Suppose f is a measurable function
on (Ω, F, µ).
1. If f = 1
A
is an indicator function, then we define
R
1
A
=
R
A
=
µ(A). This exists because f is supposed measurable.
2. If f =
P
n
i=1
α
i
1
A
i
is a simple function, then we define
R
f =
P
n
i=1
α
i
R
A
i
.
This exists because of 1.
3. If f is positive, we define
R
f = sup
R
g
, where g is a simple func-
tion and g f. More general, for a measurable function we decompose
it as f = f
+
f
and apply 3 to both terms. Notice that our integral
might be undefined here.
As we have seen in the third part of the proposition, it is quite possible
that for a general measurable function, its integral is undefined. To fix that, we
introduce a new class of functions, called integrable functions.
Definition 2.3. Let f be a measurable function. If
R
|f| < , we say that
the function is integrable. The class of integrable functions is a vector space
denoted by L
1
(µ). We will look more into this space and other similar spaces
in section 3.
Definition 2.4 (Almost-everywhere convergence). We say that two functions
are equal almost everyhwere if f = g except maybe on a set of measure zero. A
sequence of functions (f
n
) converges to f almost everywhere (a.e., or sometimes
almost surely, a.s.) if it converges pointwise to f on the complement of a set
of measure zero. Finally, we say that P (ω) is a property almost-everywhere if
P holds everywhere except possibly on a set of measure zero. For instance, for
almost all x means for all x except possibly for those in a set N of measure zero.
It is important to notice that almost-everywhere properties will pop up a
lot in Lebesgue integration. This is so because our integral doesn’t detect this
property. If f = g almost everywhere in L
1
(µ), then
R
f g = 0 always, as
seen in the next proposition.
Proposition 2.5 (Properties of the integral). Suppose f is a measurable func-
tion.
6
1. The operator I(f) that sends a measurable function to its integral is a lin-
ear operator from the space of real-valued measurable functions to [0, +];
2. If f g are both in L
1
(µ), then
R
f
R
g . In general, if the
functions are positive, the same holds true.
3. If f L
1
(µ), then
R
f
R
|f|.
4. If A, B are disjoint measurable sets, then
R
AB
f =
R
A
f +
R
B
f .
If A has measure zero, then
R
A
f = 0 always. If f = g a.e.,then
R
f =
R
g .
Proof. For most of these properties, start by considering them first for indica-
tor functions, then for simple functions, and then apply Theorem 2.7. For 4,
decompose 1
AB
and notice that the measure of N is zero, which will make the
integral zero.
Right now, it is not so clear why this integral is more useful or powerful
than the usual Darboux integral. While there are other forms of integration on
R such as the gauge integral, the Lebesgue integral is extremely well-suited to
analysis for two reasons. The first one is because of convergence theorems, the
second one is the main topic of section 3. We now turn to those convergence
theorems. This is the crux of Lebesgue integration theory.
Theorem 2.6 (Fatou’s Lemma). For any sequence of positive measurable func-
tions (f
n
), we have
Z
lim inf
n→∞
f
n
lim inf
n→∞
Z
f
n
dµ.
Proof. This inequality comes from the fact that for k 1,
Z
inf
nk
f
n
inf
nk
Z
f
n
because the integral is increasing. Letting k and using the definition of
lim inf
n→∞
and properties of the integral above does the job.
Don’t be fooled by the name, this is a powerful result in its own right. This
actually is a lemma though, because it is used in the proof of the dominated
convergence theorem we will soon see.
Theorem 2.7 (Monotone convergence theorem). For any positive increasing
sequence of functions (f
n
) with lim
n→∞
f
n
= f, we have that
lim
n→∞
Z
f
n
=
Z
f dµ.
7
Proof. Since (f
n
) is increasing, the integral is as well and thus the limit r =
lim
n→∞
R
f
n
exists in [0, +]. This shows r
R
f . We just need to
show the other inequality now. To do this, fix c in [0, 1) and g a simple function
such that g f. Then, according to the monotonicity of the integral,
Z
f
n
Z
1
{f
n
cg}
f
n
Z
1
{f
n
cg}
g dµ.
Writing out g gives us the inequality
Z
1
{f
n
cg}
g =
X
xg(Ω)
µ ({g = x} {f
n
cx}) x.
Using the monotone property of measures, we get as n
X
xg(Ω)
µ ({g = x} {f
n
cx}) x. =
X
xg(Ω)
µ ({g = x}) x =
Z
g dµ.
Thus r c
R
g , and since both c [0, 1) and g are arbitrary, we are done.
A very handy theorem, this is not true at all in the case of Riemann inte-
gration. Finally, we get to maybe the most powerful theorem in the theory, the
celebrated dominated convergence theorem.
Theorem 2.8 (Dominated Convergence Theorem). Let (f
n
) be a sequence of
real-valued measurable functions. If we have that:
1. f
n
converges almost everywhere to a function f ;
2. There exists h L
1
(µ) such that |f
n
(ω)| h(ω), for all n;
then
lim
n→∞
Z
f
n
=
Z
f dµ.
Proof. Let (f
n
)
n1
be a convergent sequence of measurable functions dominated
by an integrable function h. The measurable functions f
n
g (resp. f
n
+ g) are
thus positive and tend to g f (resp. g + f) as n . We can thus apply
Fatou’s lemma to get
Z
(g ± f ) lim inf
n→∞
Z
(g ± f ) dµ.
Since g is integrable, we can subtract it from both sides to get
lim inf
n→∞
Z
f
n
Z
f dµ.
But notice now that
lim sup
n→∞
Z
f
n
Z
f
which gives us the desired result.
There exists a generalized version of this theorem where each f
n
is bounded
by a specific g
n
with extra conditions but we will not need it here.
8
2.3 Multiple Integrals and Derivatives
In this subsection, we propose to make the theorems we’ve developed in the
preceding section to good use. Let U be an open subset of a metric space E,
and be a measure space. We can use the dominated convergence theorem
to analyze the relationship between integrals, continuity and differentiability.
More specifically, we are interested in functions of the form
F (x) =
Z
f(x, ω) (ω)
where f(x, ω) is a function from U × R.
Theorem 2.9 (Continuity under the integral sign). Suppose that:
1. For all x U , the function w 7→ f(x, w) is measurable;
2. For almost all ω , the function x 7→ f(x, ω) is continuous at x
0
U;
3. There exists h L
1
(µ) such that |f(x, ω)| h(ω) almost surely.
Then F (x) as defined above is well-defined and continuous at x
0
.
Proof. We get the existence from property 3. To show continuity, it’s enough to
show that F (x
n
) F (x
0
) for any sequence in U such that x
n
x
0
. Let (x
n
)
be such a sequence. Then, from property 2 we have that f(x
n
, ω) f (x
0
, ω).
From 3 and the DCT, we get F (x
n
) F (x
0
).
Theorem 2.10 (Differentiation under the integral sign). Here we assume that
U is an open interval in R. Suppose that:
1. For all x U , the function w 7→ f(x, w) is integrable;
2. For almost all ω , and for all x U, the partial derivative
f
x
(x, ω)
exists and verifies
f
x
(x, ω)
h(ω)
where h L
1
(µ).
Then the function defined above is well-defined and differentiable. Furthermore,
we have that
F
0
(x) =
Z
f
x
(x, ω) (ω).
Proof. Let (x
n
) be a sequence in U converging to x U. Then g
n
(ω) =
f(x
n
)f(x,ω)
x
n
x
converges almost everywhere to
f
x
(x, ω). Applying the mean
value theorem, we get
|g
n
(ω)| sup
0θ1
f
x
(θx + (1 θ)x
n
, ω)
h(ω).
9
We now use the DCT to conclude that
lim
n→∞
F
n
(x) F (x)
x
n
x
= lim
n→∞
Z
g
n
(ω) (ω) =
Z
f
x
(x, ω) (ω).
These two theorems help us determine when we can do the naive differenti-
ation under the integral sign. We see here just why the dominated convergence
theorem is useful. Domination is a hypothesis in both theorems. We now turn
to the issue of integration over product spaces. Recall that if (Ω
1
, A
1
, m
1
)
and (Ω
2
, A
2
, m
2
) are two measure spaces, we can look at the measure space
(Ω
1
×
2
, m
1
m
2
). One particular question we want to answer is when is it
true that integration over the product space is akin to integrating twice over
the respective measure spaces. For this, we will need a certain hypothesis.
Definition 2.11. We say that a measure is σ-finite if there exists a countable
family of measurable sets (A
i
)
iI
such that =
S
iI
and µ(A
i
) < , for all
i N.
Proposition 2.12 (Measurability of slices). Suppose f(x
1
, x
2
) is measurable
with respect to the product σ-algebra A
1
A
2
. Then the slice function x
2
7→
f(x
1
, x
2
) is measurable with respect to A
2
. Likewise, x
1
7→ f(x
1
, x
2
) is measur-
able with respect to A
1
.
Proof. Apply the usual machinery starting from indicator functions, then simple
functions and use the MCT to prove the property for positive functions.
We are now in a position to state the theorem that will allow us to inter-
change multiple integrals. This is the famous Fubini-Tonelli theorem.
Theorem 2.13 (Fubini). Let m
1
, m
2
be two σ-finite measures. Then:
1. If f (x
1
, x
2
) is positive and measurable with respect to the product measure
m
1
m
2
, then
Z
1
×
2
f(x
1
, x
2
) d(m
1
m
2
) =
Z
1
Z
2
f(x
1
, x
2
) dm
2
(x
2
)
dm
1
(x
1
)
=
Z
2
Z
1
f(x
1
, x
2
) dm
1
(x
1
)
dm
2
(x
2
)
2. If f (x
1
, x
2
) isn’t positive but is integrable, the same property holds true.
Proof. We mentioned earlier a few magic theorems that let us assume certain
measures exist. In our case this was both Carath´eodory’s extension theorem,
and the λ π theorem. While we not delve into those here, we just mention
that the second one is needed for this proof. The idea is that you consider the
three members of the equality above and apply the usual machinery. First you
consider indicator functions, then you use the λ π theorem to extend that to
simple functions, and you apply the MCT as usual.
10
2.4 Change of Variables
Right after defining measures, we talked about the measure P
X
(B). In this
section, we will construct tools to help us integrate over similar measures. First,
we give the formal definition of measures of the form P
X
(B).
Definition 2.14 (Image measure). Suppose f is a measurable function from
(Ω, F, µ) to (Ω
0
, F
0
). The function µ
f
: F
0
[0, ] that sends an element A
0
of F to µ(f
1
(A
0
)) is a measure on (Ω
0
, F
0
) called the image measure of µ by
f (also called the pushforward of µ). In the case of probability spaces, we call
P
X
the probability distribution of the random variable X.
Our main topic this subsection will be to understand how to work with
integration with respect to image measures. First we discuss a somewhat easy
result.
Proposition 2.15 (Transfer theorem). Let f : (Ω, F, µ) (Ω
0
, F
0
) be a mea-
surable function. Suppose ϕ :
0
R is measurable with respect to the Borel
algebra. Then ϕ L
1
(µ
f
) if and only if ϕ f L
1
(µ), in which case
Z
0
ϕ
f
=
Z
ϕ f dµ.
This also holds true without any integrability hypothesis if ϕ is positive.
Proof. Left as an exercise for the reader. The proof is the same as usual, start
with indicator functions and build up to general functions.
This proposition is quite general and we would like to specialize to a case
where image measures have a very specific form.
Definition 2.16 (Measures with density). Let (Ω, F, µ) be a measure space.
We will say a measure ν has a density g with respect to µ if
ν(A) =
Z
A
g(ω) (ω)
.
It is with respect to those measure that integration will be most interesting
for our purposes. In particular, we will be interested in changing measures on
R
n
. This comes from the following change of variables formula which is quite
handy. You might have seen this applied in vector calculus before, or differential
geometry.
Theorem 2.17 (Change of variables). Suppose f :
0
is a C
1
-diffeomorphism
between two open subsets of R
n
and let ρ : R
+
be a measurable func-
tion. Let µ be the measure of density ρ with respect to the usual (product)
Lebesgue measure on : (x) = 1
(x)ρ(x)dx. Then the image measure µ
f
is
f
(y) = 1
0
(y)ρ(f
1
(y))|det Df
1
(y)|dy. Thus, for any measurable function
Φ :
0
R
n
that is either positive or in L
1
(µ
f
), the following holds:
Z
Φ(f(x)) dm(x) =
Z
0
Φ(y)ρ(f
1
(y))|det Df
1
(y)|dy.
11
Proof. The proof can be found in any standard text on vector analysis.
This summarizes the tools we will need to discuss probability properly. Be-
fore that however, we need to discuss spaces of functions and exactly what kind
of convergence there can be between sequences of functions and their limits.
3 Function Spaces
3.1 Introduction
In basic real analysis, one learns early on about pointwise convergence. Usually,
this is in the form of ε δ proofs, where we construct a δ that depends on
both x and ε. Then, later on, one realizes pointwise convergence is not strong
enough to interchange limits between sequences and other objects and we require
a stronger notion of convergence, namely that of uniform convergence. More
specifically, this time we only allow δ to depend on ε and not at all on the x at
hand. In a sense, the function itself converges to the limit, and not just each
point at their own speed. In this section, we discuss the various relationships
between convergences of functions and spaces of functions. We won’t prove
results in this chapter as they are all very standard results found in any text on
functional analysis, and aren’t the meat of probability theory.
3.2 L
p
spaces
In the integration section, we have defined first the space of measurable func-
tions, on which the integral is defined. Then, we’ve looked (for a measure µ)
at the space L
1
. In this subsection, we want to expand this in two directions:
First, we want to discuss whether there’s a meaning to the space L
p
, for a gen-
eral p 1. After that, we wanna see if those spaces are complete, that is to say,
whether Cauchy sequences always converge in the space itself.
Definition 3.1. Let p [1, +). We write L
p
for the space of p-integrable
functions, that is to say
kfk
p
=
Z
|f|
p
1
p
< +
Notice that k·k
p
is a norm on that space and thus L
p
is a normed vector space.
Recall from functional analysis (or topology if you’re Swiss) that every
normed vector space has a Banach space completion. Effectively, this means
there exists a complete space based on L
p
. Its existence is a standard exercise
in functional analysis.
Theorem 3.2 (Completion). Let L
p
denote the space of equivalence classes of
functions in L
p
. We say that f g if and only if f = g almost everywhere.
12
Then L
p
is the completion of L
p
. Furthermore, if p = 2, this is actually a
Hilbert space with inner product defined as
hf|gi
2
=
Z
fg (ω).
This theorem relies on a few fundamental inequalities that we state next.
Theorem 3.3. For f, g L
p
, we have that:
1. kf + gk
p
kfk
p
+ kgk
p
(Minkowski’s Inequality);
2. kfgk
1
kfk
p
kgk
q
where
1
p
+
1
q
= 1 (H¨older’s Inequality).
The case p = 2 in the second inequality is very important in its own right and is
called the Cauchy-Schwarz inequality. For now, we avoid talking about the case
p = 1 where we have q = ’.
These new spaces give us a new way to talk about convergence. There is
a subtle question of what it means exactly to integrate an equivalence class of
functions, but do notice that for any two representatives of a class [f], say f
and f
0
, their integral is equal. Indeed, this is exactly what we wanted and what
we defined it to be. This is true because
Z
|f f
0
| =
Z
NN
c
|f f
0
| =
Z
N
c
|f f
0
| = 0
where N is a set of measure zero, and thus integrating over it gives us zero.
Notice here, we implicitly assume that in this context, the product ‘0 ·
0
= 0.
This is a convention set to make sure everything holds. Going forward, we will
just assume that by f L
p
, we mean a representative of the equivalence class
of f. With that out of the way, let us define convergence in our new norm.
Definition 3.4. We say that f
n
converges to f in L
p
, sometimes denoted by
f
n
L
p
f, if
lim
n→∞
kf
n
fk
p
= 0.
This gives us a new way to define convergence. In the next section, we finally
introduce probability and we will look at even more modes of convergence.
4 Probability Theory
4.1 Introduction
In this section, we begin our study of probability by considering a probability
space (Ω, F, P) and real-valued random variables X, Y . Effectively, this means
that P(Ω) = 1, and X, Y are measurable with respect to F.
13
4.2 Expectation and Variance
In elementary probability theory, more specifically when we consider discrete
random variables, we have the notion of expectation. Usually, this is defined as
n
X
i=1
P(X = x
i
)x
i
=
n
X
i=1
p
i
x
i
.
Using measure-theoretic tools, we will look to generalize this by treating this
sum as a special form of a more general construct.
Definition 4.1. Let (Ω, F, P) be a probability space and X a real-valued ran-
dom variable on it. Then the expectation of X is defined as
E [X] =
Z
X(ω) dP(ω).
This definition is rather abstract, and it is not so clear what the usual defi-
nition of expectation has to do with it. To relate it to our usual definition, we
need to work with random variables that have a density. That is to say, those
for which integrating against the probability measure gives us something that
allows us to transfer the integral over to R, which lets us apply calculus to it.
More specifically:
Definition 4.2 (Probability density). Let X be a real-valued random variable
on Ω. Let S R. We say f
X
(x) is a probability density for X if
P(X S) = P
X
(S)
=
Z
R
1
S
f
X
(x) dm(x)
=
Z
S
f
X
(x) dm(x).
Such a function is necessarily positive a.e. and has integral equal to 1.
This is closely related but not the same as the following:
Definition 4.3 (CDF). Let X be a real-valued random variable on Ω. The
cumulative distribution function of X is defined as F (t) = P(X t), for t R.
If X happens to have a density function f
X
, then P(X t) =
R
t
−∞
f
X
(x) dm(x).
Furthermore, if F is differentiable at t, then F
0
(t) = f
X
(t).
We now apply the transfer theorem to understand how to integrate random
variables with density.
Theorem 4.4 (Transferring probability measures). Suppose X has a density
f(x). This means that P
X
(A) = 1
A
(x)f(x)dm(x) where dm(x) is the Lebesgue
14
measure. Then for any measurable function Φ : R
+
, we have that
E [Φ(X)] =
Z
Φ X dP
=
Z
R
Φ(x) dP
X
(x)
=
Z
R
Φ(x)f(x) dm(x)
In particular, we have that
E [X] =
Z
R
xf(x) dm(x)
Proof. Apply the transfer theorem to X.
Next we define the last two important concepts from probability theory we
need to prove interesting theorems in the theory. The first one is fundamental,
while we won’t use the second one much but mention it for completion’s sake.
Definition 4.5 (Variance). Let X be a real-valued random variable. Then the
variance of X denoted V (X) is defined as E
h
(X E [X])
2
i
. A quick calculation
shows that V (X) = E
X
2
E [X]
2
.
The final concept we wish to define in this subsection is that of characteristic
functions (not to be confused with indicator functions). Those functions arise
from applying Fourier analysis to probability theory.
Proposition 4.6. Let X be a real-valued random variable on . The character-
istic function of X is defined as Φ
X
(t) = E
e
itX
. The characteristic function
of X completely determines the distribution of X. In particular, if X and Y are
two random variables, then
Φ
X
(t) = Φ
Y
(t) F
X
(t) = F
Y
(t).
Proof. We don’t concern ourselves with the proof here as the characteristic
function won’t be used anywhere else, we just mention it in passing.
4.3 A Few Inequalities
In this subsection, we take a look at a few useful inequalities that will be nec-
essary to prove some important theorems down the line. They are important
in their own right however, which is why we discuss them now. Here, X always
means a real-valued random variable.
Theorem 4.7 (Jensen’s Inequality). Let ϕ : R R be a convex function. If X
and ϕ(X) are both integrable functions, then
ϕ (E [X]) E [ϕ(X)] .
15
Proof. From convexity, we know that ϕ has at any point x R a left-derivative
ϕ
0
g
(x) and that
ϕ(y) ϕ(x) ϕ
0
g
(x)(y x)
for any y R. Thus, ϕ (X) ϕ (E [X]) ϕ
0
g
(E [X]) (X E [X]). The result is
obtained after taking the expectation of this inequality.
Theorem 4.8 (Markov’s Inequality). Let X be a real-valued random variable.
Then, for all t > 0, we have
P(X t)
E [X]
t
.
Proof. It’s enough to take the expectation of the inequality
t1
Xt
X1
Xt
|X|.
From this inequality, we get the following:
Theorem 4.9 (Bienaym´e-Chebyshev’s Inequality). If X
2
is integrable, then
P (|X E [X] | t)
V (X)
t
2
.
Proof. Since {X t} {|X|
p
t
p
}, we get from Markov’s inequality that
P(X t)
E [|X|
p
]
t
p
.
Take Y = |X E [X]| and p = 2 and we get our result by applying the above
to Y .
4.4 Independence of Random Variables
In elementary probability theory, we have seen the notion of two events being
independent. Usually, this is stated as
P(A B) = P(A)P(B).
This sort of relation holds true and we will generalize it to σ-algebras first, and
then to random variables themselves, thus answering the question: “What does
it mean for two random variables to be independent?”
Definition 4.10 (Independence of σ-algebras). The events {A
i
}
iI
are said to
be mutually independent if for any finite family of indices {i
1
, . . . , i
n
} in I, we
have that
P (A
i
1
A
i
2
··· A
i
n
) =
n
Y
k=1
P(A
i
k
)
16
Similarly, if {F
i
}
iI
is a family of σ-algebra on Ω, we say that they are mutually
independent if for any finite family of indices {i
1
, . . . , i
n
}in I, and for any choice
of A
i
k
F
i
k
we have that
P (A
i
1
A
i
2
··· A
i
n
) =
n
Y
k=1
P(A
i
k
).
This is consistent with what we’ve seen in elementary probability. Of interest
to us is to generalize this notion to random variables themselves.
Definition 4.11 (Independence of random variables). Let {X
i
}
iI
be a family
of random variables. We say that they are mutually independent or more often
simply independent if and only if the σ-algebras generated are (i.e., σ(X
i
) are
independent.)
While this is a perfectly fine definition on its own, it’s sometimes bother-
some to check. Luckily, there’s a nice proposition that summarizes equivalences
between various ideas of independence for random variables. First, we need to
discuss random vectors. Indeed, so far we’ve only talked about random variables
from to R, but it makes perfect sense to think of random variables from to
R
n
for some n. Those random variables will be called random vectors. They also
have a probability distribution and a cumulative distribution function. Usually,
we write F (x
1
, . . . , x
n
) = P(X
1
t
1
, . . . , X
n
t
n
) where (t
1
, . . . , t
n
) is a vector
in R
n
. Again, some of those will have a density function. Henceforth, we will
call it the joint distribution density. To contrast, the density function of a single
random variable will be called the marginal distribution density. With that out
of the way, here is the proposition.
Proposition 4.12. The following are all equivalent:
1. The variables {X
i
} are mutually independent;
2. For any finite family of indices, the joint distribution of the random vector
(X
1
, . . . , X
n
) is equal to the product of the distributions of each random
variable, i.e. P
X
i
1
,...,X
i
n
=
Q
n
k=1
P
X
i
k
.
3. For any finite family of indices, and for any choice of measurable bounded
functions f
i
k
: R R, we have E [f
i
1
(X
1
) . . . f
i
n
(X
n
)] =
Q
n
k=1
E [f
i
k
(X
k
)].
Proof. It is obvious that 2 = 3 = 1. We deduce 2 from 1 as in the proof of
Fubini’s theorem, first for (f
i
k
) indicator functions of measurable (Borel) sets,
simple functions and finally, we apply the DCT.
With this out of the way, we can now begin discussing the main modes of
convergence that exist in probability theory.
17
5 Modes of Convergence
5.1 Introduction
We come now to the final section. In this section, we set out to do two things.
The first one is to investigate what kind of convergence exists between sequences
of random variables and their limits. The second one is to present two important
theorems in probability and statistics: the law of large numbers and the central
limit theorem.
5.2 Various notions of convergence
So far, we’ve encountered four different kinds of convergence. We will not be
concerned too much with uniform convergence and pointwise convergence of
random variables here. Instead, we will concern ourselves with convergence
almost everywhere (that we will call convergence almost surely from now on),
convergence in L
p
and two new modes of convergence. We assume all our
random variables are real-valued.
Definition 5.1. Let (X
n
) be a sequence of random variables on (Ω, F, P). We
say that the sequence converges
1. almost surely, if
P
n
ω : lim
n→∞
X
n
(ω) = X(ω)
o
= 1;
2. in L
p
, for p 1, if X
n
and X are in L
p
and
lim
n→∞
kf
n
fk
p
= 0;
3. in probability, if for all ε > 0, we have
lim
n→∞
P (|X
n
X| ε) = 0.
4. in distribution or weakly, if
lim
n→∞
F
n
(x) = F (x),
for continuous points x, where F
n
and F are the cumulative distribution
functions of X
n
and X respectively.
We denote convergence in probability with X
n
P
X, and convergence in dis-
tribution with X
n
distr.
X. Convergence in distribution is the same as saying
that the distribution and the characteristic functions of X
n
converge to the
distribution and characteristic function of X.
18
Notice this definition of convergence almost surely is precisely the same as
converging except on possibly a set of measure zero. Convergence in probability
is weaker however, as the following shows.
Proposition 5.2. We have that X
n
X a.s. if and only if for all ε > 0,
lim
n→∞
P
sup
nm
|X
n
X| ε
= 0.
In particular, X
n
a.s.
X implies X
n
P
X.
Proof. By definition of the convergence of a sequence in R we get
{X
n
X}
c
=
[
kN
\
mN
[
nm
|X
n
X| >
1
k
=
[
kN
\
mN
sup
nm
|X
n
X|
1
k
.
Thus, X
n
X a.s. if and only if
k N
lim
n→∞
P
\
mN
sup
nm
|X
n
X|
1
k
!
= 0.
Using properties of measures discussed before, we know this to be equal to
k N
lim
m→∞
P
sup
nm
|X
n
X|
1
k
= 0
But now we are done, since for any ε > 0 we can find a k N
such that ε > 1/k
and this is equivalent to the result we were trying to prove.
This is similar to the difference between uniform and pointwise convergence
of real functions, where f
n
unif
f if and only if lim
n→∞
sup
xR
|f
n
(x) f (x)| =
0. So we have now found one implication for our various modes of convergence.
One could ask whether convergence in probability implies convergence almost
surely. The answer in general is no, but there is a certain relationship nonethe-
less.
Proposition 5.3. The sequence (X
n
) converges to X in probability if and only if
from any increasing sequence of natural numbers (n
k
), there exists a subsequence
(n
k
j
) such that X
n
k
j
a.s.
X.
Proof. The proof relies on a few other results that we don’t feel are necessary
in the presentation of this material, so the proof is excluded.
So while (X
n
) itself might not converge almost surely to X, one of its sub-
sequence does. Next we cover the case of convergence in L
p
.
19
Proposition 5.4. Let p q 0. If X
n
L
p
X, then X
n
L
q
X. In particular,
L
2
convergence implies L
1
convergence and L
1
convergence implies convergence
in probability.
Proof. Let p q > 0 and let α = p/q 1. From Jensen’s inequality, we get
E [|Y |
p
] = E [(|Y |
q
)
α
] E [|Y |
q
]
α
.
This shows L
p
L
q
and thus convergence in L
p
implies convergence in L
q
. The
fact this implies convergence in probability is related to the fact that convergence
in probability is equivalent to convergence in L
0
which we have not defined
here.
We now want to see if convergence in probability is equivalent to conver-
gence in L
1
. Again, this isn’t exactly true but if we can impose an additional
requirement on our sequence, it holds.
Proposition 5.5 (Uniform integrability). A sequence of random variables is
said to be uniformly integrable if
lim
c→∞
sup
nN
E
|X
n
|1
|X
n
|≥c
= 0.
Furthermore, if (X
n
) is uniformly integrable, then it is bounded in L
1
. Con-
versely, if (X
n
) is dominated by Y L
1
or (X
n
) L
p
for p > 1, then (X
n
) is
uniformly integrable.
Proof. We prove both parts separately.
1. We have that sup
n
E [|X
n
|] sup
n
E
|X
n
|1
|X
n
|≤c
+sup
n
E
|X
n
|1
|X
n
|≥c
.
The first term is bounded by c while the second is bounded because it is
convergent.
2. Suppose |X
n
| Y . Then
E
|X
n
|1
{|X
n
|≥c}
E
h
Y 1
{
Y
c
}
i
+ E
h
Y 1
{
Y
c
}
1
{|X
n
|≥c}
i
E
h
Y 1
{
Y
c
}
i
+
cP|X
n
| c
E
h
Y 1
{
Y
c
}
i
+
c
c
E [|X
n
|]
using Markov’s inequality. The first term tends to 0 when c via
the MCT. The second one is bounded by
E[Y ]
c
. Suppose now that (X
n
) is
bounded in L
p
for p¿1. We get
E
|X
n
|1
|X
n
|≤c
kX
n
k
p
P (|X
n
| c)
1
q
kX
n
k
p
E [|X
n
|
p
]
c
p
1
q
by successively applying older’s and Markov’s inequalities.
20
With this, we can state the theorem that relates L
1
convergence and con-
vergence in probability.
Theorem 5.6. Let (X
n
) be a uniformly integrable sequence of random variables.
If (X
n
) converges to X in probability, then (X
n
) converges to X in L
1
.
Proof. First, we need to settle the matter of integrability of X. From the
characterization of convergence in probability, we know there exists a sequence
X
n
k
that converges to X almost surely. Using Fatou’s lemma combined with
the theorem above, we deduce
E [|X|] lim inf
n
k
→∞
E [X
n
k
] <
which gives us integrability.
Let Y
n
= |X
n
X|. Since (X
n
) is uniformly integrable and X is integrable, Y
n
is uniformy integrable (why?). Thus, for any ε > 0,
E [Y
n
] = E [Y
n
1
Y
n
ε
] + E [Y
n
1
Y
n
ε
]
E [Y
n
1
Y
n
ε
] + ε.
Choose c > ε such that E [Y
n
1
Y
n
c
] ε. Then
E [Y
n
1
Y
n
ε
] E [Y
n
1
Y
n
c
] + E [Y
n
1
cY
n
ε
]
ε + cP (Y
n
ε) .
Thus,
lim sup
n→∞
E [Y
n
] 2ε
since Y
n
0 in probability. Since ε was arbitrary, we get the desired result.
Lastly, we discuss the final implication we can get. We invite the reader to
try and find counterexamples to the other implications.
Theorem 5.7. If (X
n
) converges to X in probability, then (X
n
) converges
to X in distribution. Convergence in distribution is thus the weakest form of
convergence we have defined.
Proof. The proof of this result relies on a technical lemma we do not need in
the primary presentation of the material so we leave it out.
Theorem 5.8. The implications we’ve discussed are the only ones that exist
between our various modes of convergence. For q p, we have that
(X
n
)
a.s.
X = (X
n
)
P
X = (X
n
)
weakly
X,
(X
n
)
L
q
X = (X
n
)
L
p
X = (X
n
)
P
X,
(X
n
)
P+U.I.
X (X
n
)
L
1
X.
Proof. Left to the reader as it gives good intuition in how modes of convergence
work. The reader can consult Wikipedia or more academic sources for standard
counterexamples.
21
5.3 The Law of Large Numbers
We have developed much machinery to discuss probability. We are now able to
define precisely what the word probability itself means, and how to work with
its most basic objects. But how do we reconcile this with the more naive ideas
of elementary probability? For instance, when we flip a fair coin, we know the
probability of getting heads is exactly 1/2. But clearly, this doesn’t mean that
every time I flip a coin ten times in a row, I will get exactly 5 heads. How are
we sure that our intuitive idea of probability is grounded in mathematics? This
comes in the form of laws of large numbers. Briefly, the theorem tells us that the
more we flip the coin, the closer the average value will get to the expectation.
This is how casinos make money. Even if they get lucky winners from time
to time, the law of large numbers provides a mathematical background that
assures the casino that in the long run, it will make money. We only prove the
weak law of large numbers and leave the proofs of the other two results to the
mathematical literature.
Definition 5.9. Let (X
n
) be a sequence of random variables. The sample
average up to n is the random variable
X
n
=
1
n
n
X
k=1
X
k
.
It will be of interest to us to consider sequences of random variables that are
both independent and identically distributed. This means that F
X
i
= F
X
j
for all
i, j in the index. Effectively, this means the joint distribution is the product of
the marginal distributions, which are all equal. Furthermore, this implies that
E [X
1
] = E [X
i
] for all i. We will usually denote this expectation with µ.
Our theorem comes in two flavors, which we can now appreciate because of our
work in the last section. As we have seen, convergence almost surely necessarily
implies convergence in probability. This is the distinction between the two
following theorems.
Theorem 5.10 (Weak Law of Large Numbers). Let (X
n
) be a sequence of
independent identically distributed random variables. Then the sample average
converges in probability to the expectation. Symbolically:
X
n
P
µ.
Proof. We assume that the variance of X
i
is finite and equal to σ
2
. The variance
of X
n
is equal to
V
1
n
n
X
i=1
X
i
!
=
1
n
2
V
n
X
i=1
X
i
!
=
2
n
2
=
σ
2
n
22
where the first equality comes from the independence of the random variables.
Likewise, E
X
n
= µ. Using Bienaym´e-Chebyshev’s inequality, we get
P
X
n
µ
ε
σ
2
2
.
From this, we get
P
X
n
µ
< ε
= 1 P
X
n
µ
ε
= 1
σ
2
2
.
Letting n , we get X
n
P
µ which proves the result.
Theorem 5.11 (Strong Law of Large Numbers). If (X
n
) is as above, then the
sample average actually converges almost surely to the expectation. Symbolically:
X
n
a.s.
µ.
These tell us that in the long run, our usual interpretation of expectation as
the expected value is correct. Likewise, we have a law of large numbers for our
usual interpretation of probability itself.
Theorem 5.12 (Borel’s Law of Large Numbers). Suppose we do repeated trials
of a probabilistic experiment. Let E be an event and p = P(E) its probability.
We let N
n
(E) denote the number of times E occurs in the first n trials. Then:
N
n
(E)
n
a.s.
n→∞
p
This is why we can expect to have approximately 50% of heads and 50% of
tails in the long run after flipping fair coins for a long time. You can test this
empirically by running a simulation.
5.4 The Central Limit Theorem
In our last subsection, we discuss an important theorem that has applications in
statistics. We are again interested in understanding the asymptotic behavior of
the sample average. We assume that all our random variables are independent
and identically distributed (i.i.d.). Recall from elementary probability that the
normal distribution for a random variable is of the form
N(µ, σ
2
) =
1
σ
2π
e
1
2
(
xµ
σ
)
2
.
We know from the previous section that the sample average converges in proba-
bility and almost surely to the expectation. We are interested in understanding
exactly how that happens. More specifically, we’d like to understand how the
distribution itself changes as n tends to infinity. This is given by the following
classical theorem.
23
Theorem 5.13 (Classical Central Limit Theorem). Suppose E [X
i
] = µ and
V (X
i
) = σ
2
< . Then, as n approaches infinity, the random variables
n
X
n
µ
converge in distribution to a normal N(0, σ
2
). Symbolically:
n
X
n
µ
distr.
N(0, σ
2
).
What does this imply for statistics? Well this explains why many density
estimates have this bell curve shape. This comes from the shape of the normal
distribution itself. If we apply this to the flipping coin example from last section,
we will get that flipping many coins will give us a normal distribution for the
number of heads (or tails, for that matter).
Appendices
A Common Distributions
In this appendix, we list a few common probability distributions for the reader.
For the reader’s sake, we also rewrite the change of variables formula in prob-
abilistic term, so that they may use it to understand how products or sums
modify the distributions of the random variables involved.
Theorem A.1 (Change of variables: probabilistic version). Suppose X is a
real-valued random vector with density f
X
. Then, if φ is a C
1
-diffeomorphism,
then the random variable Y = φ(X) has density
g(y) = f
φ
1
(y)
det Dφ
1
(y)
.
A.1 Discrete Random Variables
Example A.2 (Binomial distribution). Consider the following probabilistic ex-
periment. You do n independent yes-no experiments, where yes has probability
p and no has probability q = 1 p. The binomial distribution is the probability
distribution of a random variable X that counts the number of success in the
sequence. We write X B(n, p) and we note that
P(X = k) =
n
k
p
k
(1 p)
nk
.
If n = 1, we call this the Bernoulli distribution.
Example A.3 (Poisson distribution). We say that X follows a Poisson distri-
bution with parameter λ > 0 if
P(X = k) =
λ
k
e
λ
k!
.
This distribution is usually used to compute the probability of a given number
of events occurring in a fixed interval of time, provided these events occur at a
constant mean rate and independently of the time since the last event.
24
A.2 Continuous Random Variables
Example A.4 (Uniform distribution). The uniform distribution occurs when
X has probability density function
f
X
(x) =
1
b a
1
[a,b]
(x).
Example A.5 (Beta distribution). If X follows a Beta distribution (usually
written X Beta(α, β), then
f
X
(x) =
1
B(α, β)
x
α1
(1 x)
β1
where B(α, β) is a normalization constant equal to
B(α, β) =
Z
1
0
t
α1
(1 t)
β1
dt.
Example A.6 (Gamma distribution). We say that X follows a Gamma distri-
bution (X Gamma(α, β) if its density function is of the form
f
X
(x) =
1
Γ(α)
β
α
x
α1
e
βx
1
{x0}
where Γ(α) is the Gamma function applied to α.
Example A.7 (Exponential distribution). We say that X follows an exponential
distribution (X Exp(λ)) if
f
X
(x) = λe
λx
1
{x0}
.
Furthermore, we have that lim
n→∞
nBeta(1, n) = Exp(1).
Example A.8 (Normal distribution). As we’ve seen before, a random variable
has a normal distribution
X N
µ, σ
2

if
f
X
(x) =
1
σ
2π
e
1
2
(
xµ
σ
)
2
.
Example A.9 (Cauchy distribution). This distribution usually has parameters
but we will stick with the simplest version of it. We say X follows a Cauchy
distribution if
f
X
(x) =
1
π
1
1 + x
2
.
Try to compute the expectation and variance of this distribution.
25