Probability Theory:

A Summary

zweistein

Desu-Cartes

Contents

1 Measure Theory 3

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Measures and algebras . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Integration Theory 5

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Integration of real-valued functions . . . . . . . . . . . . . . . . . 6

2.3 Multiple Integrals and Derivatives . . . . . . . . . . . . . . . . . 9

2.4 Change of Variables . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Function Spaces 12

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 L

spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Probability Theory 13

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.2 Expectation and Variance . . . . . . . . . . . . . . . . . . . . . . 14

4.3 A Few Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.4 Independence of Random Variables . . . . . . . . . . . . . . . . . 16

5 Modes of Convergence 18

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.2 Various notions of convergence . . . . . . . . . . . . . . . . . . . 18

5.3 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . 22

5.4 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . 23

Appendices 24

A Common Distributions 24

A.1 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . 24

A.2 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . 25

1 Measure Theory

1.1 Introduction

In order to deﬁne modern probability, we will have to start by making a neces-

sary detour in measure theory and integration. In general, we will not deﬁne a

probability measure on all subsets of our space. Instead, perhaps in analogy to

topology and open sets, we deﬁne which sets we decide to work with. Those sets

will form a particular algebraic structure, and we will deﬁne a probability on it.

After that, we deﬁne integration in general and consider the important question

of interchanging limits between our integral and other analytical objects such

as sequences, series and other integrals. With this out of the way, we can deﬁne

function spaces and look at what kind of structure our functions form. Finally,

we can deﬁne probability theory using all the tools we’ve built so far, and an-

alyze what kind of modes of convergence exist between probabilistic objects.

This section culminates in the strong law of large numbers which reinforces our

intuition about what probability really is about.

1.2 Measures and algebras

Definition 1.1. We consider a universe of events (a set really) Ω. Let F ⊆

P(Ω) be a family of sets over Ω. We say that F is a σ-algebra if the following

properties are veriﬁed:

1. For any countable family of subsets in F, their union is in F.

2. For any subset in F, their complement is in F.

3. The universe Ω is in F.

It is clear from this deﬁnition that countable intersections of elements of F are

in F and that the empty set is in F. We call elements A ∈ F events or more

generally, measurable sets. The pair (Ω, F) is called a measurable space.

In some cases, we can take P(Ω) as our σ-algebra. We will usually consider

this σ-algebra when working over countable spaces such as N. Since P(Ω) is

always a σ-algebra for any universe Ω, we can use this fact to understand the

notion of σ-algebra generated by a subset of Ω.

Definition 1.2. Let X ∈ P(Ω). Then the σ-algebra generated by X is the

intersection of all σ-algebras over Ω that contain X. It is clear this intersection

exists from the discussion above. We will denote this σ-algebra by σ(X).

It is not so clear why we need σ-algebras initially. Surely, if P(Ω) always

works, why not just take that at all times? Choosing which sets are measurable

will ﬁne-tune the notion of ”measurability” itself, just like specifying a topology

on X will restrict topological notions such as convergence or continuity. Just

as in topology, we also have special maps that preserve the notion of being

measurable.

Definition 1.3. Let (Ω, F) and (Ω

, F

) be two measurable spaces. We say a

function f : (Ω, F) → (Ω

, F

) is measurable if it respects the σ-algebra struc-

ture. In other words, for any B ∈ F

, we have that f

−1

(B) ∈ F. If F = σ(X),

then it is suﬃcient to check this for any subset of X instead. From now on, we

write f : Ω → Ω

instead of the more cumbersome notation above.

This makes clearer the notion of ”measurability” and how similar it is to

topology. In topology, open sets specify which functions are continuous. In

measurable spaces, elements of σ-algebra specify which functions are measur-

able.

Proposition 1.4. Measurable functions respect the usual algebraic operations.

Sums, (scalar and pointwise) products, quotients and compositions of measurable

functions are measurable where it makes sense. Furthermore, if the target space

of the function is a metric space, pointwise limits of measurable functions are

measurable. Supremums, inﬁmums and their limits of real-valued measurable

functions are measurable as well.

Proof. The proofs are either not very illuminating or extremely easy. We just

show composition. Suppose f : Ω

→ Ω

and g : Ω

→ Ω

are measurable

functions whose domains and codomains match as one would want. Then the

preimage of g ◦ f is f

−1

(A)). Suppose A ∈ F

. Then B = g

−1

(A) ∈ F

In turn, C = f

−1

(B) ∈ F

. This holds because both f and g are measurable.

Thus for A ∈ F

, (g ◦f) (A) ∈ F

which shows g ◦ f is measurable.

As one can see, the class of measurable functions is quite rich. This is a good

thing. Since we will deﬁne integration on those functions, we want as many

functions as possible to be at least measurable. It is obvious that continuous

functions are measurable provided that the σ-algebra on the domain is the one

generated by open (or closed) sets. We will call this σ-algebra the Borel algebra.

In the case of the real line, we will write it as B(R). We now deﬁne the main

object of study in measure theory.

Definition 1.5. A measure is a function µ : F → R

∪ {∞} such that the

following holds:

1. µ(∅) = 0

2. µ(

∞

i=1

) =

∞

i=1

µ(X

), for any countable family of disjoint sets (X

)

i∈I

The second property is called σ-additivity (sometimes also known as countable

additivity). If µ(Ω) = 1, we say that µ is a probability measure, or more simply,

a probability. The triple (Ω, F, µ) is called a measure space. If µ = P is a

probability, then we call the space (Ω, F, P) a probability space, and a measurable

function on it a random variable.

Proposition 1.6 (Properties of measures). Let (Ω, F, µ) be a measure space.

Let A, B ∈ F. Then:

1. µ(A) = µ(A \ B) + µ(A ∩ B);

2. If µ(A) < ∞, then µ(A ∪ B) = µ(A) + µ(B) − µ(A ∩ B);

3. If A ⊆ B, then µ(A) ≤ µ(B). If in addition, µ(A) < ∞, then µ(B \ A) =

µ(B) − µ(A);

4. If (A

)

is an increasing sequence and for every n, A

∈ F, then µ(

∞

n=1

) =

lim

n→∞

µ(A

). If the sequence is decreasing and µ(A

) < ∞, then µ(

∞

i=1

) =

lim

n→∞

µ(A

). Furthermore, the limit converge increasingly (resp. de-

creasingly).

Proof. Those are all easy properties that we leave to the reader. For 4, use a

decomposition of your space that allows you to apply σ-additivity.

It will be important for us to deﬁne measures based on random variables.

Indeed, suppose P is a probability. We will often be interested in understanding

the probability related to values of random variables. Suppose X is a real-valued

random variable on Ω (the space, by abuse of notation). We are interested in

understanding P(ω : X(ω) ∈ B) for an event B. We denote this function P

(B).

It is not too hard to see that this is a measure on Ω

= R. This will become

important in the probability section.

Notice that we haven’t actually talked about the problem of existence of

measures. Indeed, we will not concern ourselves with it. For now, we just

assume there’s a magic theorem (a few, actually) that allows us to construct

all the measures we need. We ﬁnish this section by presenting two important

measures.

Proposition 1.7 (Lebesgue measure). There exists a translation-invariant mea-

sure λ on (R, B (R)) such that λ([a, b]) = b − a.

Proposition 1.8 (Product measure). Suppose Ω

, Ω

are two measure spaces

with measures m

(resp. m

). Then there exists a measure m

⊗m

on Ω

×Ω

such that for all A ∈ Ω

, B ∈ Ω

, we have (m

⊗ m

)(A × B) = m

(A)m

(B).

2 Integration Theory

2.1 Introduction

Let (Ω, F, µ) be a measure space. We wish to deﬁne a linear functional f 7−→

Ω

fdµ on the space of real-valued measurable functions. When we learn about

the Darboux-Riemann integral, we start by subdividing the domain of integra-

tion. Here, we take a diﬀerent approach. In a sense, we subdivide the range

of the function and then sum. This subdivision comes in the form of limits of

ﬁnite sums of indicator functions, which we deﬁne next.

Definition 2.1. The indicator function of a set A is the function 1

which is

equal to 1 on A, 0 everywhere else. It is clear that 1

is measurable if and only

if A is. A simple function is a ﬁnite sum of indicator functions.

We proceed in three steps to deﬁne our integral. First we construct our

integral for indicator functions only. Then we deﬁne it for simple functions. Fi-

nally, we deﬁne it for positive functions and extend that deﬁnition to measurable

functions in general.

2.2 Integration of real-valued functions

We now construct the integral for each class of functions discussed above. In

this subsection, a measurable function means a real-valued measurable function.

Definition 2.2 (Existence of the integral). Suppose f is a measurable function

on (Ω, F, µ).

1. If f = 1

is an indicator function, then we deﬁne

Ω

dµ =

µ(A). This exists because f is supposed measurable.

2. If f =

i=1

is a simple function, then we deﬁne

Ω

fdµ =

i=1

Ω

dµ.

This exists because of 1.

3. If f is positive, we deﬁne

Ω

fdµ = sup



Ω

gdµ



, where g is a simple func-

tion and g ≤ f. More general, for a measurable function we decompose

it as f = f

− f

−

and apply 3 to both terms. Notice that our integral

might be undeﬁned here.

As we have seen in the third part of the proposition, it is quite possible

that for a general measurable function, its integral is undeﬁned. To ﬁx that, we

introduce a new class of functions, called integrable functions.

Definition 2.3. Let f be a measurable function. If

Ω

|f|dµ < ∞, we say that

the function is integrable. The class of integrable functions is a vector space

denoted by L

(µ). We will look more into this space and other similar spaces

in section 3.

Definition 2.4 (Almost-everywhere convergence). We say that two functions

are equal almost everyhwere if f = g except maybe on a set of measure zero. A

sequence of functions (f

) converges to f almost everywhere (a.e., or sometimes

almost surely, a.s.) if it converges pointwise to f on the complement of a set

of measure zero. Finally, we say that P (ω) is a property almost-everywhere if

P holds everywhere except possibly on a set of measure zero. For instance, for

almost all x means for all x except possibly for those in a set N of measure zero.

It is important to notice that almost-everywhere properties will pop up a

lot in Lebesgue integration. This is so because our integral doesn’t detect this

property. If f = g almost everywhere in L

(µ), then

Ω

f −g dµ = 0 always, as

seen in the next proposition.

Proposition 2.5 (Properties of the integral). Suppose f is a measurable func-

tion.

1. The operator I(f) that sends a measurable function to its integral is a lin-

ear operator from the space of real-valued measurable functions to [0, +∞];

2. If f ≤ g are both in L

(µ), then

Ω

f dµ ≤

Ω

g dµ. In general, if the

functions are positive, the same holds true.

3. If f ∈ L

(µ), then



Ω

f dµ



≤

Ω

|f|dµ.

4. If A, B are disjoint measurable sets, then

A∪B

f dµ =

f dµ +

f dµ.

If A has measure zero, then

f dµ = 0 always. If f = g a.e.,then

Ω

f dµ =

Ω

g dµ.

Proof. For most of these properties, start by considering them ﬁrst for indica-

tor functions, then for simple functions, and then apply Theorem 2.7. For 4,

decompose 1

A∪B

and notice that the measure of N is zero, which will make the

integral zero.

Right now, it is not so clear why this integral is more useful or powerful

than the usual Darboux integral. While there are other forms of integration on

R such as the gauge integral, the Lebesgue integral is extremely well-suited to

analysis for two reasons. The ﬁrst one is because of convergence theorems, the

second one is the main topic of section 3. We now turn to those convergence

theorems. This is the crux of Lebesgue integration theory.

Theorem 2.6 (Fatou’s Lemma). For any sequence of positive measurable func-

tions (f

), we have

Ω

lim inf

n→∞

dµ ≤ lim inf

n→∞

Ω

dµ.

Proof. This inequality comes from the fact that for k ≥ 1,



inf

n≥k



dµ ≤ inf

n≥k

dµ

because the integral is increasing. Letting k → ∞ and using the deﬁnition of

lim inf

n→∞

and properties of the integral above does the job.

Don’t be fooled by the name, this is a powerful result in its own right. This

actually is a lemma though, because it is used in the proof of the dominated

convergence theorem we will soon see.

Theorem 2.7 (Monotone convergence theorem). For any positive increasing

sequence of functions (f

) with lim

n→∞

= f, we have that

lim

n→∞

Ω

dµ =

Ω

f dµ.

Proof. Since (f

) is increasing, the integral is as well and thus the limit r =

lim

n→∞

Ω

dµ exists in [0, +∞]. This shows r ≤

Ω

f dµ. We just need to

show the other inequality now. To do this, ﬁx c in [0, 1) and g a simple function

such that g ≤ f. Then, according to the monotonicity of the integral,

Ω

dµ ≥

Ω

≥cg}

dµ ≥

Ω

≥cg}

g dµ.

Writing out g gives us the inequality

Ω

≥cg}

g dµ =

x∈g(Ω)

µ ({g = x} ∩ {f

≥ cx}) x.

Using the monotone property of measures, we get as n → ∞

x∈g(Ω)

µ ({g = x} ∩ {f

≥ cx}) x. =

x∈g(Ω)

µ ({g = x}) x =

Ω

g dµ.

Thus r ≥ c

Ω

g dµ, and since both c ∈ [0, 1) and g are arbitrary, we are done.

A very handy theorem, this is not true at all in the case of Riemann inte-

gration. Finally, we get to maybe the most powerful theorem in the theory, the

celebrated dominated convergence theorem.

Theorem 2.8 (Dominated Convergence Theorem). Let (f

) be a sequence of

real-valued measurable functions. If we have that:

1. f

converges almost everywhere to a function f ;

2. There exists h ∈ L

(µ) such that |f

(ω)| ≤ h(ω), for all n;

then

lim

n→∞

Ω

dµ =

Ω

f dµ.

Proof. Let (f

)

n≥1

be a convergent sequence of measurable functions dominated

by an integrable function h. The measurable functions f

−g (resp. f

+ g) are

thus positive and tend to g − f (resp. g + f) as n → ∞. We can thus apply

Fatou’s lemma to get

Ω

(g ± f ) dµ ≤ lim inf

n→∞

Ω

(g ± f ) dµ.

Since g is integrable, we can subtract it from both sides to get

lim inf

n→∞

Ω

dµ ≥

Ω

f dµ.

But notice now that

lim sup

n→∞

Ω

dµ ≤

Ω

f dµ

which gives us the desired result.

There exists a generalized version of this theorem where each f

is bounded

by a speciﬁc g

with extra conditions but we will not need it here.

2.3 Multiple Integrals and Derivatives

In this subsection, we propose to make the theorems we’ve developed in the

preceding section to good use. Let U be an open subset of a metric space E,

and Ω be a measure space. We can use the dominated convergence theorem

to analyze the relationship between integrals, continuity and diﬀerentiability.

More speciﬁcally, we are interested in functions of the form

F (x) =

Ω

f(x, ω) dµ(ω)

where f(x, ω) is a function from U × Ω → R.

Theorem 2.9 (Continuity under the integral sign). Suppose that:

1. For all x ∈ U , the function w 7→ f(x, w) is measurable;

2. For almost all ω ∈ Ω, the function x 7→ f(x, ω) is continuous at x

∈ U;

3. There exists h ∈ L

(µ) such that |f(x, ω)| ≤ h(ω) almost surely.

Then F (x) as deﬁned above is well-deﬁned and continuous at x

Proof. We get the existence from property 3. To show continuity, it’s enough to

show that F (x

) → F (x

) for any sequence in U such that x

→ x

. Let (x

)

be such a sequence. Then, from property 2 we have that f(x

, ω) → f (x

, ω).

From 3 and the DCT, we get F (x

) → F (x

Theorem 2.10 (Diﬀerentiation under the integral sign). Here we assume that

U is an open interval in R. Suppose that:

1. For all x ∈ U , the function w 7→ f(x, w) is integrable;

2. For almost all ω ∈ Ω, and for all x ∈ U, the partial derivative

∂f

∂x

(x, ω)

exists and veriﬁes



∂f

∂x

(x, ω)



≤ h(ω)

where h ∈ L

(µ).

Then the function deﬁned above is well-deﬁned and diﬀerentiable. Furthermore,

we have that

(x) =

Ω

∂f

∂x

(x, ω) dµ(ω).

Proof. Let (x

) be a sequence in U converging to x ∈ U. Then g

(ω) =

f(x

,ω)−f(x,ω)

−x

converges almost everywhere to

∂f

∂x

(x, ω). Applying the mean

value theorem, we get

(ω)| ≤ sup

0≤θ≤1



∂f

∂x

(θx + (1 − θ)x

, ω)



≤ h(ω).

We now use the DCT to conclude that

lim

n→∞

(x) − F (x)

− x

= lim

n→∞

Ω

(ω) dµ(ω) =

Ω

∂f

∂x

(x, ω) dµ(ω).

These two theorems help us determine when we can do the naive diﬀerenti-

ation under the integral sign. We see here just why the dominated convergence

theorem is useful. Domination is a hypothesis in both theorems. We now turn

to the issue of integration over product spaces. Recall that if (Ω

, A

, m

)

and (Ω

, A

, m

) are two measure spaces, we can look at the measure space

(Ω

× Ω

, m

⊗ m

). One particular question we want to answer is when is it

true that integration over the product space is akin to integrating twice over

the respective measure spaces. For this, we will need a certain hypothesis.

Definition 2.11. We say that a measure is σ-ﬁnite if there exists a countable

family of measurable sets (A

)

i∈I

such that Ω =

∞

i∈I

and µ(A

) < ∞, for all

i ∈ N.

Proposition 2.12 (Measurability of slices). Suppose f(x

, x

) is measurable

with respect to the product σ-algebra A

⊗ A

. Then the slice function x

7→

f(x

, x

) is measurable with respect to A

. Likewise, x

7→ f(x

, x

) is measur-

able with respect to A

Proof. Apply the usual machinery starting from indicator functions, then simple

functions and use the MCT to prove the property for positive functions.

We are now in a position to state the theorem that will allow us to inter-

change multiple integrals. This is the famous Fubini-Tonelli theorem.

Theorem 2.13 (Fubini). Let m

, m

be two σ-ﬁnite measures. Then:

1. If f (x

, x

) is positive and measurable with respect to the product measure

⊗ m

, then

Ω

×Ω

f(x

, x

) d(m

⊗ m

) =

Ω



Ω

f(x

, x

) dm

)



)

Ω



Ω

f(x

, x

) dm

)



)

2. If f (x

, x

) isn’t positive but is integrable, the same property holds true.

Proof. We mentioned earlier a few magic theorems that let us assume certain

measures exist. In our case this was both Carath´eodory’s extension theorem,

and the λ − π theorem. While we not delve into those here, we just mention

that the second one is needed for this proof. The idea is that you consider the

three members of the equality above and apply the usual machinery. First you

consider indicator functions, then you use the λ −π theorem to extend that to

simple functions, and you apply the MCT as usual.

2.4 Change of Variables

Right after deﬁning measures, we talked about the measure P

(B). In this

section, we will construct tools to help us integrate over similar measures. First,

we give the formal deﬁnition of measures of the form P

(B).

Definition 2.14 (Image measure). Suppose f is a measurable function from

(Ω, F, µ) to (Ω

, F

). The function µ

: F

→ [0, ∞] that sends an element A

of F to µ(f

−1

)) is a measure on (Ω

, F

) called the image measure of µ by

f (also called the pushforward of µ). In the case of probability spaces, we call

the probability distribution of the random variable X.

Our main topic this subsection will be to understand how to work with

integration with respect to image measures. First we discuss a somewhat easy

result.

Proposition 2.15 (Transfer theorem). Let f : (Ω, F, µ) → (Ω

, F

) be a mea-

surable function. Suppose ϕ : Ω

→ R is measurable with respect to the Borel

algebra. Then ϕ ∈ L

(µ

) if and only if ϕ ◦ f ∈ L

(µ), in which case

Ω

ϕ dµ

Ω

ϕ ◦ f dµ.

This also holds true without any integrability hypothesis if ϕ is positive.

Proof. Left as an exercise for the reader. The proof is the same as usual, start

with indicator functions and build up to general functions.

This proposition is quite general and we would like to specialize to a case

where image measures have a very speciﬁc form.

Definition 2.16 (Measures with density). Let (Ω, F, µ) be a measure space.

We will say a measure ν has a density g with respect to µ if

ν(A) =

g(ω) dµ(ω)

It is with respect to those measure that integration will be most interesting

for our purposes. In particular, we will be interested in changing measures on

. This comes from the following change of variables formula which is quite

handy. You might have seen this applied in vector calculus before, or diﬀerential

geometry.

Theorem 2.17 (Change of variables). Suppose f : Ω → Ω

is a C

-diﬀeomorphism

between two open subsets of R

and let ρ : Ω → R

be a measurable func-

tion. Let µ be the measure of density ρ with respect to the usual (product)

Lebesgue measure on Ω: dµ(x) = 1

Ω

(x)ρ(x)dx. Then the image measure µ

dµ

(y) = 1

Ω

(y)ρ(f

−1

(y))|det Df

−1

(y)|dy. Thus, for any measurable function

Φ : Ω

→ R

that is either positive or in L

(µ

), the following holds:

Ω

Φ(f(x)) dm(x) =

Ω

Φ(y)ρ(f

−1

(y))|det Df

−1

(y)|dy.

Proof. The proof can be found in any standard text on vector analysis.

This summarizes the tools we will need to discuss probability properly. Be-

fore that however, we need to discuss spaces of functions and exactly what kind

of convergence there can be between sequences of functions and their limits.

3 Function Spaces

3.1 Introduction

In basic real analysis, one learns early on about pointwise convergence. Usually,

this is in the form of ε − δ proofs, where we construct a δ that depends on

both x and ε. Then, later on, one realizes pointwise convergence is not strong

enough to interchange limits between sequences and other objects and we require

a stronger notion of convergence, namely that of uniform convergence. More

speciﬁcally, this time we only allow δ to depend on ε and not at all on the x at

hand. In a sense, the function itself converges to the limit, and not just each

point at their own speed. In this section, we discuss the various relationships

between convergences of functions and spaces of functions. We won’t prove

results in this chapter as they are all very standard results found in any text on

functional analysis, and aren’t the meat of probability theory.

3.2 L

spaces

In the integration section, we have deﬁned ﬁrst the space of measurable func-

tions, on which the integral is deﬁned. Then, we’ve looked (for a measure µ)

at the space L

. In this subsection, we want to expand this in two directions:

First, we want to discuss whether there’s a meaning to the space L

, for a gen-

eral p ≥ 1. After that, we wanna see if those spaces are complete, that is to say,

whether Cauchy sequences always converge in the space itself.

Definition 3.1. Let p ∈ [1, +∞). We write L

for the space of p-integrable

functions, that is to say

kfk



|f|

dµ



< +∞

Notice that k·k

is a norm on that space and thus L

is a normed vector space.

Recall from functional analysis (or topology if you’re Swiss) that every

normed vector space has a Banach space completion. Eﬀectively, this means

there exists a complete space based on L

. Its existence is a standard exercise

in functional analysis.

Theorem 3.2 (Completion). Let L

denote the space of equivalence classes of

functions in L

. We say that f ∼ g if and only if f = g almost everywhere.

Then L

is the completion of L

. Furthermore, if p = 2, this is actually a

Hilbert space with inner product deﬁned as

hf|gi

Ω

fg dµ(ω).

This theorem relies on a few fundamental inequalities that we state next.

Theorem 3.3. For f, g ∈ L

, we have that:

1. kf + gk

≤ kfk

+ kgk

(Minkowski’s Inequality);

2. kfgk

≤ kfk

kgk

where

= 1 (H¨older’s Inequality).

The case p = 2 in the second inequality is very important in its own right and is

called the Cauchy-Schwarz inequality. For now, we avoid talking about the case

p = 1 where we have ‘q = ∞’.

These new spaces give us a new way to talk about convergence. There is

a subtle question of what it means exactly to integrate an equivalence class of

functions, but do notice that for any two representatives of a class [f], say f

and f

, their integral is equal. Indeed, this is exactly what we wanted and what

we deﬁned it to be. This is true because

Ω

|f − f

|dµ =

N∪N

|f − f

|dµ =

|f − f

|dµ = 0

where N is a set of measure zero, and thus integrating over it gives us zero.

Notice here, we implicitly assume that in this context, the product ‘0 · ∞

= 0.

This is a convention set to make sure everything holds. Going forward, we will

just assume that by f ∈ L

, we mean a representative of the equivalence class

of f. With that out of the way, let us deﬁne convergence in our new norm.

Definition 3.4. We say that f

converges to f in L

, sometimes denoted by

−→ f, if

lim

n→∞

− fk

= 0.

This gives us a new way to deﬁne convergence. In the next section, we ﬁnally

introduce probability and we will look at even more modes of convergence.

4 Probability Theory

4.1 Introduction

In this section, we begin our study of probability by considering a probability

space (Ω, F, P) and real-valued random variables X, Y . Eﬀectively, this means

that P(Ω) = 1, and X, Y are measurable with respect to F.

4.2 Expectation and Variance

In elementary probability theory, more speciﬁcally when we consider discrete

random variables, we have the notion of expectation. Usually, this is deﬁned as

i=1

P(X = x

i=1

Using measure-theoretic tools, we will look to generalize this by treating this

sum as a special form of a more general construct.

Definition 4.1. Let (Ω, F, P) be a probability space and X a real-valued ran-

dom variable on it. Then the expectation of X is deﬁned as

E [X] =

Ω

X(ω) dP(ω).

This deﬁnition is rather abstract, and it is not so clear what the usual deﬁ-

nition of expectation has to do with it. To relate it to our usual deﬁnition, we

need to work with random variables that have a density. That is to say, those

for which integrating against the probability measure gives us something that

allows us to transfer the integral over to R, which lets us apply calculus to it.

More speciﬁcally:

Definition 4.2 (Probability density). Let X be a real-valued random variable

on Ω. Let S ⊆ R. We say f

(x) is a probability density for X if

P(X ∈ S) = P

(S)

(x) dm(x)

(x) dm(x).

Such a function is necessarily positive a.e. and has integral equal to 1.

This is closely related but not the same as the following:

Definition 4.3 (CDF). Let X be a real-valued random variable on Ω. The

cumulative distribution function of X is deﬁned as F (t) = P(X ≤ t), for t ∈ R.

If X happens to have a density function f

, then P(X ≤ t) =

−∞

(x) dm(x).

Furthermore, if F is diﬀerentiable at t, then F

(t) = f

(t).

We now apply the transfer theorem to understand how to integrate random

variables with density.

Theorem 4.4 (Transferring probability measures). Suppose X has a density

f(x). This means that P

(A) = 1

(x)f(x)dm(x) where dm(x) is the Lebesgue

measure. Then for any measurable function Φ : Ω → R

, we have that

E [Φ(X)] =

Ω

Φ ◦ X dP

Φ(x) dP

(x)

Φ(x)f(x) dm(x)

In particular, we have that

E [X] =

xf(x) dm(x)

Proof. Apply the transfer theorem to X.

Next we deﬁne the last two important concepts from probability theory we

need to prove interesting theorems in the theory. The ﬁrst one is fundamental,

while we won’t use the second one much but mention it for completion’s sake.

Definition 4.5 (Variance). Let X be a real-valued random variable. Then the

variance of X denoted V (X) is deﬁned as E

(X −E [X])

. A quick calculation

shows that V (X) = E





− E [X]

The ﬁnal concept we wish to deﬁne in this subsection is that of characteristic

functions (not to be confused with indicator functions). Those functions arise

from applying Fourier analysis to probability theory.

Proposition 4.6. Let X be a real-valued random variable on Ω. The character-

istic function of X is deﬁned as Φ

(t) = E



itX



. The characteristic function

of X completely determines the distribution of X. In particular, if X and Y are

two random variables, then

(t) = Φ

(t) ⇐⇒ F

(t) = F

(t).

Proof. We don’t concern ourselves with the proof here as the characteristic

function won’t be used anywhere else, we just mention it in passing.

4.3 A Few Inequalities

In this subsection, we take a look at a few useful inequalities that will be nec-

essary to prove some important theorems down the line. They are important

in their own right however, which is why we discuss them now. Here, X always

means a real-valued random variable.

Theorem 4.7 (Jensen’s Inequality). Let ϕ : R → R be a convex function. If X

and ϕ(X) are both integrable functions, then

ϕ (E [X]) ≤ E [ϕ(X)] .

Proof. From convexity, we know that ϕ has at any point x ∈ R a left-derivative

(x) and that

ϕ(y) − ϕ(x) ≥ ϕ

(x)(y − x)

for any y ∈ R. Thus, ϕ (X) − ϕ (E [X]) ≥ ϕ

(E [X]) (X −E [X]). The result is

obtained after taking the expectation of this inequality.

Theorem 4.8 (Markov’s Inequality). Let X be a real-valued random variable.

Then, for all t > 0, we have

P(X ≥ t) ≤

E [X]

Proof. It’s enough to take the expectation of the inequality

X≥t

≤ X1

X≥t

≤ |X|.

From this inequality, we get the following:

Theorem 4.9 (Bienaym´e-Chebyshev’s Inequality). If X

is integrable, then

P (|X − E [X] | ≥ t) ≤

V (X)

Proof. Since {X ≥ t} ⊂ {|X|

≥ t

}, we get from Markov’s inequality that

P(X ≥ t) ≤

E [|X|

]

Take Y = |X − E [X]| and p = 2 and we get our result by applying the above

to Y .

4.4 Independence of Random Variables

In elementary probability theory, we have seen the notion of two events being

independent. Usually, this is stated as

P(A ∩ B) = P(A)P(B).

This sort of relation holds true and we will generalize it to σ-algebras ﬁrst, and

then to random variables themselves, thus answering the question: “What does

it mean for two random variables to be independent?”

Definition 4.10 (Independence of σ-algebras). The events {A

}

i∈I

are said to

be mutually independent if for any ﬁnite family of indices {i

, . . . , i

} in I, we

have that

P (A

∩ A

∩ ··· ∩ A

) =

k=1

P(A

)

Similarly, if {F

}

i∈I

is a family of σ-algebra on Ω, we say that they are mutually

independent if for any ﬁnite family of indices {i

, . . . , i

}in I, and for any choice

of A

∈ F

we have that

P (A

∩ A

∩ ··· ∩ A

) =

k=1

P(A

This is consistent with what we’ve seen in elementary probability. Of interest

to us is to generalize this notion to random variables themselves.

Definition 4.11 (Independence of random variables). Let {X

}

i∈I

be a family

of random variables. We say that they are mutually independent or more often

simply independent if and only if the σ-algebras generated are (i.e., σ(X

) are

independent.)

While this is a perfectly ﬁne deﬁnition on its own, it’s sometimes bother-

some to check. Luckily, there’s a nice proposition that summarizes equivalences

between various ideas of independence for random variables. First, we need to

discuss random vectors. Indeed, so far we’ve only talked about random variables

from Ω to R, but it makes perfect sense to think of random variables from Ω to

for some n. Those random variables will be called random vectors. They also

have a probability distribution and a cumulative distribution function. Usually,

we write F (x

, . . . , x

) = P(X

≤ t

, . . . , X

≤ t

) where (t

, . . . , t

) is a vector

in R

. Again, some of those will have a density function. Henceforth, we will

call it the joint distribution density. To contrast, the density function of a single

random variable will be called the marginal distribution density. With that out

of the way, here is the proposition.

Proposition 4.12. The following are all equivalent:

1. The variables {X

} are mutually independent;

2. For any ﬁnite family of indices, the joint distribution of the random vector

, . . . , X

) is equal to the product of the distributions of each random

variable, i.e. P

,...,X

k=1

3. For any ﬁnite family of indices, and for any choice of measurable bounded

functions f

: R → R, we have E [f

) . . . f

)] =

k=1

E [f

)].

Proof. It is obvious that 2 =⇒ 3 =⇒ 1. We deduce 2 from 1 as in the proof of

Fubini’s theorem, ﬁrst for (f

) indicator functions of measurable (Borel) sets,

simple functions and ﬁnally, we apply the DCT.

With this out of the way, we can now begin discussing the main modes of

convergence that exist in probability theory.

5 Modes of Convergence

5.1 Introduction

We come now to the ﬁnal section. In this section, we set out to do two things.

The ﬁrst one is to investigate what kind of convergence exists between sequences

of random variables and their limits. The second one is to present two important

theorems in probability and statistics: the law of large numbers and the central

limit theorem.

5.2 Various notions of convergence

So far, we’ve encountered four diﬀerent kinds of convergence. We will not be

concerned too much with uniform convergence and pointwise convergence of

random variables here. Instead, we will concern ourselves with convergence

almost everywhere (that we will call convergence almost surely from now on),

convergence in L

and two new modes of convergence. We assume all our

random variables are real-valued.

Definition 5.1. Let (X

) be a sequence of random variables on (Ω, F, P). We

say that the sequence converges

1. almost surely, if

n

ω ∈ Ω : lim

n→∞

(ω) = X(ω)

o

= 1;

2. in L

, for p ≥ 1, if X

and X are in L

and

lim

n→∞

− fk

= 0;

3. in probability, if for all ε > 0, we have

lim

n→∞

P (|X

− X| ≥ ε) = 0.

4. in distribution or weakly, if

lim

n→∞

(x) = F (x),

for continuous points x, where F

and F are the cumulative distribution

functions of X

and X respectively.

We denote convergence in probability with X

−→ X, and convergence in dis-

tribution with X

distr.

−−−→ X. Convergence in distribution is the same as saying

that the distribution and the characteristic functions of X

converge to the

distribution and characteristic function of X.

Notice this deﬁnition of convergence almost surely is precisely the same as

converging except on possibly a set of measure zero. Convergence in probability

is weaker however, as the following shows.

Proposition 5.2. We have that X

→ X a.s. if and only if for all ε > 0,

lim

n→∞



sup

n≥m

− X| ≥ ε



= 0.

In particular, X

a.s.

−−→ X implies X

−→ X.

Proof. By deﬁnition of the convergence of a sequence in R we get

→ X}

[

k∈N

∗

m∈N

[

n≥m



− X| >



[

k∈N

∗

m∈N



sup

n≥m

− X| ≥



Thus, X

→ X a.s. if and only if

∀k ∈ N

∗

lim

n→∞

m∈N

sup

n≥m

− X| ≥

= 0.

Using properties of measures discussed before, we know this to be equal to

∀k ∈ N

∗

lim

m→∞



sup

n≥m

− X| ≥



= 0

But now we are done, since for any ε > 0 we can ﬁnd a k ∈ N

∗

such that ε > 1/k

and this is equivalent to the result we were trying to prove.

This is similar to the diﬀerence between uniform and pointwise convergence

of real functions, where f

unif

−−−→ f if and only if lim

n→∞

sup

x∈R

(x) − f (x)| =

0. So we have now found one implication for our various modes of convergence.

One could ask whether convergence in probability implies convergence almost

surely. The answer in general is no, but there is a certain relationship nonethe-

less.

Proposition 5.3. The sequence (X

) converges to X in probability if and only if

from any increasing sequence of natural numbers (n

), there exists a subsequence

) such that X

a.s.

−−→ X.

Proof. The proof relies on a few other results that we don’t feel are necessary

in the presentation of this material, so the proof is excluded.

So while (X

) itself might not converge almost surely to X, one of its sub-

sequence does. Next we cover the case of convergence in L

Proposition 5.4. Let p ≥ q ≥ 0. If X

−→ X, then X

−→ X. In particular,

convergence implies L

convergence and L

convergence implies convergence

in probability.

Proof. Let p ≥ q > 0 and let α = p/q ≥ 1. From Jensen’s inequality, we get

E [|Y |

] = E [(|Y |

)

] ≥ E [|Y |

]

This shows L

⊂ L

and thus convergence in L

implies convergence in L

. The

fact this implies convergence in probability is related to the fact that convergence

in probability is equivalent to convergence in L

which we have not deﬁned

here.

We now want to see if convergence in probability is equivalent to conver-

gence in L

. Again, this isn’t exactly true but if we can impose an additional

requirement on our sequence, it holds.

Proposition 5.5 (Uniform integrability). A sequence of random variables is

said to be uniformly integrable if

lim

c→∞



sup

n∈N



|≥c





= 0.

Furthermore, if (X

) is uniformly integrable, then it is bounded in L

. Con-

versely, if (X

) is dominated by Y ∈ L

or (X

) ⊂ L

for p > 1, then (X

) is

uniformly integrable.

Proof. We prove both parts separately.

1. We have that sup

E [|X

|] ≤ sup



|≤c



+sup



|≥c



The ﬁrst term is bounded by c while the second is bounded because it is

convergent.

2. Suppose |X

| ≤ Y . Then



{|X

|≥c}



≤ E

Y 1

{

Y ≥

√

}

+ E

Y 1

{

Y ≤

√

}

{|X

|≥c}

≤ E

Y 1

{

Y ≥

√

}

√

cP|X

| ≥ c

≤ E

Y 1

{

Y ≥

√

}

√

E [|X

using Markov’s inequality. The ﬁrst term tends to 0 when c → ∞ via

the MCT. The second one is bounded by

E[Y ]

. Suppose now that (X

) is

bounded in L

for p¿1. We get



|≤c



≤ kX

P (|X

| ≥ c)

≤ kX



E [|X

]



by successively applying H¨older’s and Markov’s inequalities.

With this, we can state the theorem that relates L

convergence and con-

vergence in probability.

Theorem 5.6. Let (X

) be a uniformly integrable sequence of random variables.

If (X

) converges to X in probability, then (X

) converges to X in L

Proof. First, we need to settle the matter of integrability of X. From the

characterization of convergence in probability, we know there exists a sequence

that converges to X almost surely. Using Fatou’s lemma combined with

the theorem above, we deduce

E [|X|] ≤ lim inf

→∞

E [X

] < ∞

which gives us integrability.

Let Y

= |X

−X|. Since (X

) is uniformly integrable and X is integrable, Y

is uniformy integrable (why?). Thus, for any ε > 0,

E [Y

] = E [Y

≥ε

] + E [Y

≤ε

]

≤ E [Y

≥ε

] + ε.

Choose c > ε such that E [Y

≥c

] ≤ ε. Then

E [Y

≥ε

] ≤ E [Y

≥c

] + E [Y

c≥Y

≥ε

]

≤ ε + cP (Y

≥ ε) .

Thus,

lim sup

n→∞

E [Y

] ≤ 2ε

since Y

→ 0 in probability. Since ε was arbitrary, we get the desired result.

Lastly, we discuss the ﬁnal implication we can get. We invite the reader to

try and ﬁnd counterexamples to the other implications.

Theorem 5.7. If (X

) converges to X in probability, then (X

) converges

to X in distribution. Convergence in distribution is thus the weakest form of

convergence we have deﬁned.

Proof. The proof of this result relies on a technical lemma we do not need in

the primary presentation of the material so we leave it out.

Theorem 5.8. The implications we’ve discussed are the only ones that exist

between our various modes of convergence. For q ≥ p, we have that

)

a.s.

−−→ X =⇒ (X

)

−→ X =⇒ (X

)

weakly

−−−−→ X,

)

−→ X =⇒ (X

)

−→ X =⇒ (X

)

−→ X,

)

P+U.I.

−−−−→ X ⇐⇒ (X

)

−→ X.

Proof. Left to the reader as it gives good intuition in how modes of convergence

work. The reader can consult Wikipedia or more academic sources for standard

counterexamples.

5.3 The Law of Large Numbers

We have developed much machinery to discuss probability. We are now able to

deﬁne precisely what the word probability itself means, and how to work with

its most basic objects. But how do we reconcile this with the more naive ideas

of elementary probability? For instance, when we ﬂip a fair coin, we know the

probability of getting heads is exactly 1/2. But clearly, this doesn’t mean that

every time I ﬂip a coin ten times in a row, I will get exactly 5 heads. How are

we sure that our intuitive idea of probability is grounded in mathematics? This

comes in the form of laws of large numbers. Brieﬂy, the theorem tells us that the

more we ﬂip the coin, the closer the average value will get to the expectation.

This is how casinos make money. Even if they get lucky winners from time

to time, the law of large numbers provides a mathematical background that

assures the casino that in the long run, it will make money. We only prove the

weak law of large numbers and leave the proofs of the other two results to the

mathematical literature.

Definition 5.9. Let (X

) be a sequence of random variables. The sample

average up to n is the random variable

k=1

It will be of interest to us to consider sequences of random variables that are

both independent and identically distributed. This means that F

= F

for all

i, j in the index. Eﬀectively, this means the joint distribution is the product of

the marginal distributions, which are all equal. Furthermore, this implies that

E [X

] = E [X

] for all i. We will usually denote this expectation with µ.

Our theorem comes in two ﬂavors, which we can now appreciate because of our

work in the last section. As we have seen, convergence almost surely necessarily

implies convergence in probability. This is the distinction between the two

following theorems.

Theorem 5.10 (Weak Law of Large Numbers). Let (X

) be a sequence of

independent identically distributed random variables. Then the sample average

converges in probability to the expectation. Symbolically:

−→ µ.

Proof. We assume that the variance of X

is ﬁnite and equal to σ

. The variance

of X

is equal to

i=1

nσ

where the ﬁrst equality comes from the independence of the random variables.

Likewise, E





= µ. Using Bienaym´e-Chebyshev’s inequality, we get





− µ



≥ ε



≤

nε

From this, we get





− µ



< ε



= 1 − P





− µ



≥ ε



= 1 −

nε

Letting n → ∞, we get X

−→ µ which proves the result.

Theorem 5.11 (Strong Law of Large Numbers). If (X

) is as above, then the

sample average actually converges almost surely to the expectation. Symbolically:

a.s.

−−→ µ.

These tell us that in the long run, our usual interpretation of expectation as

the expected value is correct. Likewise, we have a law of large numbers for our

usual interpretation of probability itself.

Theorem 5.12 (Borel’s Law of Large Numbers). Suppose we do repeated trials

of a probabilistic experiment. Let E be an event and p = P(E) its probability.

We let N

(E) denote the number of times E occurs in the ﬁrst n trials. Then:

(E)

a.s.

−−−−→

n→∞

This is why we can expect to have approximately 50% of heads and 50% of

tails in the long run after ﬂipping fair coins for a long time. You can test this

empirically by running a simulation.

5.4 The Central Limit Theorem

In our last subsection, we discuss an important theorem that has applications in

statistics. We are again interested in understanding the asymptotic behavior of

the sample average. We assume that all our random variables are independent

and identically distributed (i.i.d.). Recall from elementary probability that the

normal distribution for a random variable is of the form

N(µ, σ

) =

√

2π

−

(

x−µ

)

We know from the previous section that the sample average converges in proba-

bility and almost surely to the expectation. We are interested in understanding

exactly how that happens. More speciﬁcally, we’d like to understand how the

distribution itself changes as n tends to inﬁnity. This is given by the following

classical theorem.

Theorem 5.13 (Classical Central Limit Theorem). Suppose E [X

] = µ and

V (X

) = σ

< ∞. Then, as n approaches inﬁnity, the random variables

√



− µ



converge in distribution to a normal N(0, σ

). Symbolically:

√



− µ



distr.

−−−→ N(0, σ

What does this imply for statistics? Well this explains why many density

estimates have this bell curve shape. This comes from the shape of the normal

distribution itself. If we apply this to the ﬂipping coin example from last section,

we will get that ﬂipping many coins will give us a normal distribution for the

number of heads (or tails, for that matter).

Appendices

A Common Distributions

In this appendix, we list a few common probability distributions for the reader.

For the reader’s sake, we also rewrite the change of variables formula in prob-

abilistic term, so that they may use it to understand how products or sums

modify the distributions of the random variables involved.

Theorem A.1 (Change of variables: probabilistic version). Suppose X is a

real-valued random vector with density f

. Then, if φ is a C

-diﬀeomorphism,

then the random variable Y = φ(X) has density

g(y) = f



−1

(y)





det Dφ

−1

(y)



A.1 Discrete Random Variables

Example A.2 (Binomial distribution). Consider the following probabilistic ex-

periment. You do n independent yes-no experiments, where yes has probability

p and no has probability q = 1 −p. The binomial distribution is the probability

distribution of a random variable X that counts the number of success in the

sequence. We write X ∼ B(n, p) and we note that

P(X = k) =





(1 − p)

n−k

If n = 1, we call this the Bernoulli distribution.

Example A.3 (Poisson distribution). We say that X follows a Poisson distri-

bution with parameter λ > 0 if

P(X = k) =

−λ

This distribution is usually used to compute the probability of a given number

of events occurring in a ﬁxed interval of time, provided these events occur at a

constant mean rate and independently of the time since the last event.

A.2 Continuous Random Variables

Example A.4 (Uniform distribution). The uniform distribution occurs when

X has probability density function

(x) =

b − a

[a,b]

(x).

Example A.5 (Beta distribution). If X follows a Beta distribution (usually

written X ∼ Beta(α, β), then

(x) =

B(α, β)

α−1

(1 − x)

β−1

where B(α, β) is a normalization constant equal to

B(α, β) =

α−1

(1 − t)

β−1

dt.

Example A.6 (Gamma distribution). We say that X follows a Gamma distri-

bution (X ∼ Gamma(α, β) if its density function is of the form

(x) =

Γ(α)

α−1

−βx

{x≥0}

where Γ(α) is the Gamma function applied to α.

Example A.7 (Exponential distribution). We say that X follows an exponential

distribution (X ∼ Exp(λ)) if

(x) = λe

−λx

{x≥0}

Furthermore, we have that lim

n→∞

nBeta(1, n) = Exp(1).

Example A.8 (Normal distribution). As we’ve seen before, a random variable

has a normal distribution



X ∼ N



µ, σ



(x) =

√

2π

−

(

x−µ

)

Example A.9 (Cauchy distribution). This distribution usually has parameters

but we will stick with the simplest version of it. We say X follows a Cauchy

distribution if

(x) =

1 + x

Try to compute the expectation and variance of this distribution.