Research about theory 8_R

Convergence of Random Variables

Convergence of random variables (sometimes called stochastic convergence) is where a set of numbers settle on a particular number. A sequence of numbers (which could represent cars or anything else) can converge (mathematically, this time) on a single, specific number. Certain processes, distributions, and events can result in convergence— which basically means the values will get closer and closer together. When Random variables converge on a single number, they may not settle exactly that number, but they come very, very close. In notation, x (x_n → x) tells us that a sequence of random variables (x_n) converges to the value x.

In notation, that’s:

|x_n − x| → 0 as n → ∞.

What happens to these variables as they converge can’t be crunched into a single definition?

Convergence of Random Variables can be broken down into many types. The ones you’ll most often come across:

Convergence in probability,
Convergence in distribution,
Almost sure convergence,
Convergence in mean.

Each of these definitions is quite different from the others. However, for an infinite series of independent random variables: convergence in probability, convergence in distribution, and almost sure convergence are equivalent.

Convergence in probability

If you toss a coin n times, you would expect heads around 50% of the time. However, let’s say you toss the coin 10 times. You might get 7 tails and 3 heads (70%), 2 tails and 8 heads (20%), or a wide variety of other possible combinations. Eventually though, if you toss the coin enough times (say, 1,000), you’ll probably end up with about 50% tails. In other words, the percentage of heads will converge to the expected probability.

More formally, convergence in probability can be stated as the following formula:

P = probability, X_n = number of observed successes (e.g. tails) in n trials (e.g. tosses of the coin),
Lim (n→∞) = the limit at infinity — a number where the distribution converges to after an infinite number of trials (e.g. tosses of the coin), c = a constant where the sequence of random variables converges in probability to, ε = a positive number representing the distance between the expected value and the observed value.

Convergence in distribution

Convergence in distribution (sometimes called convergence in law) is based on the distribution of random variables, rather than the individual variables themselves.

In more formal terms, a sequence of random variables converges in distribution if the CDFs for that sequence converge into a single CDF. Let’s say you had a series of random variables, X_n. Each of these variables X₁, X₂,…X_n has a CDF FXn(x), which gives us a series of CDFs {FXn(x)}. Convergence in distribution implies that the CDFs converge to a single CDF, F_x(x).

Several methods are available for proving convergence in distribution. For example, Slutsky’s Theorem and the Delta Method can both help to establish convergence.

Almost sure convergence

Almost sure convergence (also called convergence in probability one) answers the question: given a random variable X, do the outcomes of the sequence X_n converge to the outcomes of X with a probability of 1? (Mittelhammer, 2013).

As an example of this type of convergence of random variables, let’s say an entomologist is studying feeding habits for wild house mice and records the amount of food consumed per day. The amount of food consumed will vary wildly, but we can be almost sure (quite certain) that amount will eventually become zero when the animal dies. It will almost certainly stay zero after that point. We’re “almost certain” because the animal could be revived, or appear dead for a while, or a scientist could discover the secret for eternal mouse life. In life — as in probability and statistics — nothing is certain.

Almost sure convergence is defined in terms of a scalar sequence or matrix sequence:

Scalar: X_n has almost sure convergence to X iff: P|X_n → X| = P(lim_n→∞X_n = X) = 1

Matrix: X_n has almost sure convergence to X iff: P|y_n[i,j] → y[i,j]| = P(lim_n→∞y_n[i,j] = y[i,j]) = 1, for all i and j.

Convergence in mean

Given a real number $r \geq 1$ , we say that the sequence $X n$ converges in the r-th mean (or in the L^r-norm) towards the random variable X, if the $r$ -th absolute moments E(|X_n|^r) and E(|X|^r) of $X n$ and X exist, and

\lim _{n\to \infty }\operatorname {E} \left(|X_{n}-X|^{r}\right)=0,

where the operator E denotes the expected value. Convergence in $r$ -th mean tells us that the expectation of the $r$ -th power of the difference between $X n$ and X converges to zero.

Law of Large Numbers

In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value and will tend to become closer to the expected value as more trials are performed.

The LLN is important because it guarantees stable long-term results for the averages of some random events.

There are two different versions of the law of large numbers that are described below. They are called the strong law of large numbers and the weak law of large numbers. Stated for the case where X₁, X₂, … is an infinite sequence of independent and identically distributed (i.i.d.) Lebesgue integrable random variables with expected value E(X₁) = E(X₂) = …= µ, both versions of the law state that – with virtual certainty – the sample average

{\overline {X}}_{n}={\frac {1}{n}}(X_{1}+\cdots +X_{n})

converges to the expected value:

${\overline {X}}_{n}\to \mu \quad {\textrm {as}}\ n\to \infty .$

(Lebesgue integrability of X_j means that the expected value E(X_j) exists according to Lebesgue integration and is finite. It does not mean that the associated probability measure is absolutely continuous with respect to the Lebesgue measure.)

Weak Law

The weak law of large numbers (also called Khinchin’s law) states that the sample average converges in probability towards the expected value.

$\text{[math]}$

That is, for any positive number ε,

$\lim _{n\to \infty }\Pr \!\left(\,|{\overline {X}}_{n}-\mu |>\varepsilon \,\right)=0.$

Interpreting this result, the weak law states that for any nonzero margin specified (ε), no matter how small, with a sufficiently large sample there will be a very high probability that the average of the observations will be close to the expected value; that is, within the margin.

The weak law applies in the case of i.i.d. random variables, but it also applies in some other cases.

Strong Law

The strong law of large numbers (also called Kolmogorov’s law) states that the sample average converges almost surely to the expected value.

${\begin{matrix}{}\\{\overline {X}}_{n}\ \xrightarrow {\text{a.s.}} \ \mu \qquad {\textrm {when}}\ n\to \infty .\\{}\end{matrix}}$

That is,

$\Pr \!\left(\lim _{n\to \infty }{\overline {X}}_{n}=\mu \right)=1.$

What this means is that the probability that, as the number of trials n goes to infinity, the average of the observations converges to the expected value, is equal to one.

The proof is more complex than that of the weak law. This law justifies the intuitive interpretation of the expected value (for Lebesgue integration only) of a random variable when sampled repeatedly as the “long-term average”.

Almost sure convergence is also called strong convergence of random variables. This version is called the strong law because random variables which converge strongly (almost surely) are guaranteed to converge weakly (in probability). However, the weak law is known to hold in certain conditions where the strong law does not hold, and then the convergence is only weak (in probability).

Binomial convergence to normal and Poisson distributions

Binomial to Poisson

Let’s first talk about the relation between binomial and Poisson distribution,

At first glance, the binomial distribution and the Poisson distribution seem unrelated. But a closer look reveals a pretty interesting relationship. It turns out the Poisson distribution is just a special case of the binomial — where the number of trials is large, and the probability of success in any given one is small.

The binomial distribution works when we have a fixed number of events n, each with a constant probability of success p.
Imagine we don’t know the number of trials that will happen. Instead, we only know the average number of successes per time period. So we know the rate of successes per day, but not the number of trials n or the probability of success p that led to that rate. We are gonna prove what we just stated.

Let this be the rate of successes per day. It’s equal to np. That’s the number o trials n – however many there are – times the chance of success p for each of those trials. Think of it like this: if the chance of success is p and we run n trials per day, we’ll observe np successes per day on average. That’s our observed success rate lambda. Recall that te binomial distribution looks like this:

We defined:

Solving for p, we get:

What we’re going to do here is substitute this expression for p into the binomial distribution above, and take the limit as n goes to infinity, and try to come up sith something useful. That is,

Pulling out the constants

and

and splitting the term on the right that’s to the power of (n-k) into a term to the power of n and one to the power of -k, we get

Now let’s take the limit of this right-hand side one term at a time. We’ll do this in three steps. The first step is to find the limit of

In the numerator, we can expand n! into n terms of (n)(n-1)(n-2)…(1). And in the denominator, we can expand (n-k) into n-k terms of (n-k)(n-k-1)(n-k-2)…(1). That is,

Written this way, it’s clear that many of terms on the top and bottom cancel out. The (n-k)(n-k-1)…(1) terms cancel from both the numerator and denominator, leaving the following:

Since we canceled out n-k terms, the numerator here is left with k terms, from n to n-k+1. So this has k terms in the numerator, and k terms in the denominator since n is to the power of k.

Expanding out the numerator and denominator we can rewrite this as:

This has k terms. Clearly, every one of these k terms approaches 1 as n approaches infinity. So we know this portion of the problem just simplifies to one. So we’re done with the first step.

The second step is to find the limit of the term in the middle of our equation, which is

Recall that the definition of e = 2.718… is given by the following:

Our goal here is to find a way to manipulate our expression to look more like the definition of e, which we know the limit of. Let’s define a number x as

Now let’s substitute this into our expression and take the limit as follows:

This terms just simplifies to $e^{-\lambda}$ . So we’re done with out second step. That leaves only one more term for us to find the limit of. Our third and final step is to find the limit of the last term on the right, wich is

This is pretty simple. As n approaches infinity, this term becomes $1^{-k}$ which is equal to one. And that takes care of our last term. Putting these three results together, we can rewrite our original limit as

This just simplifies to the following:

This is equal to the familiar probability density function for the Poisson distribution, which gives us the probability of k successes per period given our parameter lambda.

So we’ve shown that the Poisson distribution is just a special case of the binomial, in which the number of n trials grows to infinity and the chance of success in any particular trial approaches zero. And that completes the proof.

Binomial to normal

If n is large enough, then the skew of the distribution is not too great. In this case, a reasonable approximation to B(n, p) is given by the normal distribution

{\mathcal {N}}(np,\,np(1-p)),

and this basic approximation can be improved in a simple way by using a suitable continuity correction. The basic approximation generally improves as n increases (at least 20) and is better when p is not near to 0 or 1. Various rules of thumb may be used to decide whether n is large enough, and p is far enough from the extremes of zero or one:

One rule is that for n > 5 the normal approximation is adequate if the absolute value of the skewness is strictly less than 1/3; that is, if

{\frac {|1-2p|}{\sqrt {np(1-p)}}}={\frac {1}{\sqrt {n}}}\left|{\sqrt {\frac {1-p}{p}}}-{\sqrt {\frac {p}{1-p}}}\,\right|<{\frac {1}{3}}.

This can be made precise using the Berry–Esseen theorem.

A stronger rule states that the normal approximation is appropriate only if everything within 3 standard deviations of its mean is within the range of possible values; that is, only if

\mu \pm 3\sigma =np\pm 3{\sqrt {np(1-p)}}\in (0,n).

This 3-standard-deviation rule is equivalent to the following conditions, which also imply the first rule avobe.

Central Limit Theorem (CLT)

In probability theory, the central limit theorem (CLT) establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution (informally a bell curve) even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions. This theorem has seen many changes during the formal development of probability theory.

Let ${\textstyle \{X_{1},\ldots ,X_{n}\}}$ be a random sample of size $n$ — that is, a sequence of independent and identically distributed (i.i.d.) random variables drawn from a distribution of expected value given by $μ$ and finite variance given by ${\textstyle \sigma ^{2}}$ . Suppose we are interested in the sample average

{\bar {X}}_{n}\equiv {\frac {X_{1}+\cdots +X_{n}}{n}}

of these random variables. By the law of large numbers, the sample averages converge almost surely (and therefore also converge in probability) to the expected value $μ$ as ${\textstyle n\to \infty }$ . The classical central limit theorem describes the size and the distributional form of the stochastic fluctuations around the deterministic number $μ$ during this convergence. More precisely, it states that as $n$ gets larger, the distribution of the difference between the sample average ${\textstyle {\bar {X}}_{n}}$ and its limit $μ$ , when multiplied by the factor ${\textstyle {\sqrt {n}}}$ (that is ${\textstyle {\sqrt {n}}({\bar {X}}_{n}-\mu )}$ ) approximates the normal distribution with mean 0 and variance ${\textstyle \sigma ^{2}}$ . For large enough $n$ , the distribution of ${\textstyle {\bar {X}}_{n}}$ is close to the normal distribution with mean $μ$ and variance ${\textstyle \sigma ^{2}/n}$ . The usefulness of the theorem is that the distribution of ${\textstyle {\sqrt {n}}({\bar {X}}_{n}-\mu )}$ approaches normality regardless of the shape of the distribution of the individual ${\textstyle X_{i}}$ .