A quick recap about CLT
In probability theory, the central limit theorem (CLT) states that the distribution of a sample variable approximates a normal distribution (i.e., a “bell curve”) as the sample size becomes larger, assuming that all samples are identical in size, and regardless of the population’s actual distribution shape.
Put another way, CLT is a statistical premise that, given a sufficiently large sample size from a population with a finite level of variance, the mean of all sampled variables from the same population will be approximately equal to the mean of the whole population. Furthermore, these samples approximate a normal distribution, with their variances being approximately equal to the variance of the population as the sample size gets larger, according to the law of large numbers.
Introduction to Wiener Process (also called Brownian Motion)
The concept of a Brownian motion was discovered when Einstein observed particles oscillating in liquid. Since fluid dynamics are so chaotic and rapid at the molecular level, this process can be modelled best by assuming the particles move randomly and independently of their past motion. We can also think of Brownian motion as the limit of a random walk as its time and space increments shrink to 0. In addition to its physical importance,
Brownian motion is a central concept in stochastic calculus.
Definition of Wiener Process
A standard (one-dimensional) Wiener process (also called Brownian motion) is a stochastic process
indexed by nonnegative real numbers t with the following properties:
= 0.- With probability 1, the function t → Wt is continuous in t.
- The process
has stationary, independent increments. - The increment
has the
distribution.
Wiener Process as a scaling limit of Random Walks
One of the many reasons that Brownian motion is important in probability theory is that it is, in a certain sense, a limit of rescaled simple random walks. Let
be a sequence of independent, identically distributed random variables with mean 0 and variance 1. For each
define a continuous-time stochastic process
by:

This is a random step function with jumps of size
at times
, where
. Since the random variables
are independent, the increments of
are independent. Moreover, for large n the distribution of
is close to the
distribution, by the Central Limit theorem. Thus, it requires only a small leap of faith to believe that, as
, the distribution of the random function
approaches (in a sense made precise below) that of a standard Brownian motion.
Why is this important? First, it explains, at least in part, why the Wiener process arises so commonly in nature. Many stochastic processes behave, at least for long stretches of time, like random walks with small but frequent jumps. The argument above suggests that such processes will look, at least approximately, and on the appropriate time scale, like Brownian motion.
Second, it suggests that many important “statistics” of the random walk will have limiting distributions and that the limiting distributions will be the distributions of the corresponding statistics of Brownian motion. The simplest instance of this principle is the central limit theorem: the distribution of
is, for large n close to that of
(the gaussian distribution with mean 0 and variance 1). Other important instances do not follow so easily from the central limit theorem. For example, the distribution of:

converges, as
, to that of:

Donsker’s Theorem
In probability theory, Donsker’s theorem (also known as Donsker’s invariance principle, or the functional central limit theorem), named after Monroe D. Donsker, is a functional extension of the central limit theorem.
Let be a sequence of independent and identically distributed (i.i.d.) random variables with mean 0 and variance 1.
Let .
The stochastic process is known as a random walk. Define the diffusively rescaled random walk (partial-sum process) by
The central limit theorem asserts that converges in distribution to a standard Gaussian random variable as . Donsker’s invariance principle extends this convergence to the whole function . More precisely, in its modern form, Donsker’s invariance principle states that: As random variables taking values in the Skorokhod space* , the random function converges in distribution to a standard Brownian motion as .
*The set of all càdlàg functions from E to M is often denoted by D(E; M) (or simply D) and is called Skorokhod space after the Ukrainian mathematician Anatoliy Skorokhod.
References
The concept of a Brownian motion was discovered when Einstein observed particles oscillating in liquid. Since fluid dynamics are so chaotic and rapid at the molecular level, this process can be modeled best by assuming the particles move randomly and independently of their past motion. We can also think of Brownian motion as the limit of a random walk as its time and space increments shrink to 0. In addition to its physical importance, Brownian motion is a central concept in stochastic calculus which can be used in finance and economics to model stock prices and interest rates.
What is Brownian motion and why are we interested?
Much of probability theory is devoted to describing the macroscopic picture emerging in random systems defined by a host of microscopic random effects. Brownian motion is the macroscopic picture emerging from a particle moving randomly in d-dimensional space without making very big jumps. On the microscopic level, at any time step, the particle receives a random displacement, caused for example by other particles hitting it or by an external force, so that, if its position at time zero is
, its position at time n is given as
, where the displacements
are assumed to be independent, identically distributed random variables with values in
. The process
is a random walk, the displacements represent the microscopic inputs. It turns out that not all the features of the microscopic inputs contribute to the macroscopic picture. Indeed, if they exist, only the mean and covariance of the displacements are shaping the picture. In other words, all random walks whose displacements have the same mean and covariance matrix give rise to the same macroscopic process, and even the assumption that the displacements have to be independent and identically distributed can be substantially relaxed. This effect is called universality, and the macroscopic process is often called a universal object. It is a common approach in probability to study various phenomena through the associated universal objects.
If the jumps of a random walk are sufficiently tame to become negligible in the macroscopic picture, in particular if it has finite mean and variance, any continuous time stochastic process
describing the macroscopic features of this random walk should have the following properties:
- for all times
the random variables
are independent; we say that the process has independent increments, - the distribution of the increment
does not depend on
; we say that the process has stationary increments, - the process
has almost surely continuous paths. - for every
and
the increment
is multivariate normally distributed with mean
and covariance matrix
.
Hence any process with the features (1)-(3) above is characterised by just three parameters,
- the initial distribution, i.e. the law of
, - the drift vestor
, - the diffusion matrix
.
If the drift vector is zero, and the diffusion matrix is the identity we say the process is a Brownian motion. If
, i.e. the motion is started at the origin, we use the term standard Brownian motion.
Suppose we have a standard Brownian motion
. If
is a random variable with values in
and
a
matrix, then it is easy to check that
given by
, for
,
is a process with the properties (1)-(4) with initial distribution
, drift vestor
and diffusion matrix
. Hence the macroscopic picture emerging from a random walk with finite variance can be fully described by a standard Brownian motion.
References
The previous graph is taken from application 11_A, as we can see the following distributions are present:
- Poisson distribution
- Uniform distribution
- Exponential Distribution
- Poisson process
A Poisson process is a model for a series of discrete events where the mean time between events is known, but the exact timing of the events is random. The arrival of an event is independent of the previous event. The important point is that we know the mean time between events but they are randomly spaced (stochastic). We could have consecutive failures, but we could also spend years between failures due to the randomness of the process.
A Poisson process satisfies the following criteria (in reality many phenomena modeled as Poisson processes do not exactly satisfy them):
- Events are independent of each other. The occurrence of one event does not affect the likelihood of another event occurring.
- The average rate (events per time period) is constant.
- Two events cannot occur at the same time. It means that we can think of any sub-interval of a Poisson process as a Bernoulli test, i.e. a success or a failure.
Poisson distribution
The Poisson process is the model we use to describe randomly occurring events. Using the Poisson Distribution we could find the probability of a number of events over a period of time or find the probability of waiting some time until the next event.
The probability function of the Poisson distribution gives the probability of observing k events over a period of time given the length of the period and the time-averaged events:

This is a little convoluted, and events/time * time period is usually simplified into a single parameter, λ, lambda, the rate parameter. With this substitution, the Poisson Distribution probability function now has one parameter:

Lambda can be thought of as the expected number of events in the interval.
When we change the parameter λ we change the probability of seeing different numbers of events in an interval. The graph below is the probability mass function of the Poisson distribution which shows the probability of a number of events occurring in an interval with different velocity parameters.

References
The Poisson Distribution and Poisson Process Explained | by Will Koehrsen | Towards Data Science
Order statistics are a very useful concept in statistical sciences. They have a wide range of applications including modelling auctions, car races, and insurance policies, optimizing production processes, estimating parameters of distributions, and so forth.
Suppose we have a set of random variables X1, X2, …, Xn, which are independent and identically distributed (i.i.d). By independence, we mean that the value taken by a random variable is not influenced by the values taken by other random variables. By identical distribution, we mean that the probability density function (PDF) (or equivalently, the Cumulative distribution function, CDF) for the random variables is the same. The kth order statistic for this set of random variables is defined as the kth smallest value of the sample.
To better understand this concept, we’ll take 5 random variables X1, X2, X3, X4, X5. We’ll observe a random realization/outcome from the distribution of each of these random variables. Suppose we get the following values:
The kth order statistic for this experiment is the kth smallest value from the set {4, 2, 7, 11, 5}. So, the 1st order statistic is 2 (smallest value), the 2nd order statistic is 4 (next smallest), and so on. The 5th order statistic is the fifth smallest value (the largest value), which is 11. We repeat this process many times i.e., we draw samples from the distribution of each of these i.i.d random variables, & find the kth smallest value for each set of observations. The probability distribution of these values gives the distribution of the kth order statistics.
In general, if we arrange random variables X1, X2, …, Xn in ascending order, then the kth order statistic is shown as:
The general notation of the kth order statistic is X(k). Note X(k) is different from Xk. Xk is the kth random variable from our set, whereas X(k) is the kth order statistic from our set. X(k) takes the value of Xk if Xk is the kth random variable when the realizations are arranged in ascending order.
The 1st order statistic X(1) is the set of the minimum values from the realization of the set of ‘n’ random variables. The nth order statistic X(n) is the set of the maximum values (nth minimum values) from the realization of the set of ‘n’ random variables. They can be expressed as:
Distribution of Order Statistics
We’ll now try to find out the distribution of order statistics. We’ll first describe the distribution of the nth order statistic, then the 1st order statistic & finally the kth order statistic in general.
A) Distribution of the nth Order Statistic:
Let the probability density function (PDF) & cumulative distribution function (CDF) our random variables be fx(x), and Fx(x) respectively. By definition of CDF,
Since our random variables are identically distributed, they have the same PDF fx(x) & CDF Fx(x). We’ll now calculate the CDF of nth order statistic (Fn(x)) as follows:
The random variables X1, X2, …, Xn are also independent. Therefore, by property of independence,
The PDF of the nth order statistic (fn(x)) is calculated as follows:
Thus, the expression for the PDF & CDF of nth order statistic has been obtained.
B) Distribution of the 1st Order Statistic:
The CDF of a random variable can also be calculated as the one minus the probability that the random variable X takes a value more than or equal to x. Mathematically,
We’ll determine the CDF of 1st order statistic (F1(x)) as follows:
Once again, using the property of independence of random variables,
The PDF of the 1st order statistic (f1(x)) is calculated as follows:
Thus, the expression for PDF & CDF of 1st order statistic has been obtained.
C) Distribution of the kth Order Statistic:
For kth order statistic, in general, the following equation describes its CDF (Fk(x)):
The PDF of kth order statistic (fk(x)) is expressed as:
To avoid confusion, we’ll use geometric proof to understand the equation. As discussed before, the set of random variables have the same PDF (fX(x)). The following graph shows a sample PDF with the kth order statistic obtained from random sampling:
So, the PDF of the random variables fX(x) is defined between the interval [a,b]. The kth order statistic for a random sample is shown by the red line. The other variable realizations (for the random sample) are shown by the small black lines on the x-axis.
There are exactly (k – 1) random variable observations that fall in the yellow region of the graph (the region between a & kth order statistic). The probability that a particular observation falls in this region is given by the CDF of the random variables (FX(x)). But we are aware that (k – 1) observations did fall in the region, which gives us the term (by independence) (FX(x))(k – 1).
There are exactly (n – k) random variable observations that fall in the blue region of the graph (the region between kth order statistic & b). The probability that a particular observation falls in this region is given by the 1 – CDF of the random variables (1– FX(x)). But we are aware that (n – k) observations did fall in the region, which gives us the term (by independence) (1–FX(x))(n – k).
Finally, exactly 1 observation falls exactly at the kth order statistic with probability fX(x). Thus, the product of the 3 terms gives us an idea of the geometric meaning of the equation for PDF of the kth order statistic. But where does the factorial term come from? The above scenario just showed one of the many orderings. There can be many such combinations. The total number of such combinations is shown as follows:
Thus, the product of all of these terms gives us the general distribution of the kth order statistic.
Useful Functions of Order Statistics
Order statistics give rise to various useful functions. Among them, the notable ones include sample range and sample median.
1) Sample range: It is defined as the difference between the largest and smallest value. It is expressed as follows:
2) Sample median: The sample median divides the random sample (realizations from the set of random variables) into two halves, one that contains samples with lower values, and the other that contains the samples with higher values. It’s like the middle/central order statistic. It is mathematically defined as:
Joint PDF of Order Statistics
A joint probability density function can help us better understand the relationship between two random variables (two order statistics in our case). The joint PDF for any 2 order statistics X(a) & X(b), such that 1 ≤ a ≤ b ≤ n is given by the following equation:
References
Order Statistics | What is Order Statistics | Introduction to Order Statistics (analyticsvidhya.com)
In mathematics, a random walk is a mathematical object, known as a stochastic or random process, that describes a path that consists of a succession of random steps on some mathematical space such as the integers.
An elementary example of a random walk is the random walk on the integer number line, Z, which starts at 0 and each step moves +1 or −1 with equal probability. Other examples include the path traced by a molecule as it travels in a liquid or a gas (see Brownian motion), the search path of a foraging animal, the price of a fluctuating stock and the financial status of a gambler: all can be approximated by random walk models, even though they may not be truly random in reality.
As illustrated by those examples, random walks have applications to engineering and many scientific fields including ecology, psychology, computer science, physics, chemistry, biology, economics, and sociology. Random walks explain the observed behaviours of many processes in these fields and thus serve as a fundamental model for the recorded stochastic activity. As a more mathematical application, the value of π can be approximated by the use of a random walk in an agent-based modelling environment. The term random walk was first introduced by Karl Pearson in 1905.
In the simplest context, the walk is in discrete time, that is a sequence of random variables (Xt) = (X1, X2, …) indexed by the natural numbers. However, it is also possible to define random walks which take their steps at random times, and in that case, the position Xt has to be defined for all times t ∈ [0,+∞).
A random walk is a process by which randomly-moving objects wander away from where they started. The video below shows 7 black dots that start in one place randomly walking away.

How can we describe this mathematically?
The simplest random walk to understand is a 1-dimensional walk. Suppose that the black dot below is sitting on a number line. The black dot starts in the centre.

Then, it takes a step, either forward or backwards, with equal probability. It keeps taking steps either forward or backwards each time. Let’s call the 1st step a1, the second step a2, the third step a3 and so on. Each “a” is either equal to +1 (if the step is forward) or -1 (if the step is backward). The picture below shows a black dot that has taken 5 steps and ended up at -1 on the number line.

Suppose we put the black dot at 0 and then let it take N steps (where N is any number). Now we want to know how far the black dot travels after it has taken N steps. Of course, the distance travelled after N steps will vary each time we repeat the experiment, so what we want to know is, if we repeat the experiment many many times, how far the black dot will have travelled on average. Let’s call the distance that the black dot has travelled “d”. Keep in mind that d can either be positive or negative, depending on whether the black dot ends up to the right or left of 0. We know that for anyone time that we repeat the experiment,
d = a1 + a2 + a3 + … + aN
Now we use the notation <d> to mean “the average of d if we repeated the experiments many times”:
<d> = <(a1 + a2 + a3 + … + aN)> = <a1> + <a2> + <a3> + … + <aN>
But <a1>=0, because if we repeated the experiment many many times, and a1 has an equal probability of being -1 or +1, we expect the average of a1 to be 0. So then,
<d> = <a1> + <a2> + <a3> + … + <aN> = 0 + 0 + 0 + … + 0 = 0
This isn’t too surprising if you think about it. After all, <d> is the average location of the black dot after N steps, and since the dot is equally likely to move forward or backwards, we expect d to be 0, on average.
Well, that was useless!
That exercise didn’t tell us anything about how far the black dot gets after N steps! We are going to have to be a little more clever if we want to find out any kind of useful information.
Here’s an idea… Even though d can be positive or negative, d2 is always positive, so it can’t average out to 0. What if we found the average value of d2?
<d2> = <(a1 + a2 + a3 + … + aN)2> = <(a1 + a2 + a3 + … + aN) (a1 + a2 + a3 + … + aN)>
= (<a12> + <a22> + <a32> + … + <aN2>) + 2 (<a1a2> + <a1a3> + … <a1aN> + <a2a3> + … <a2aN> + …)
First let’s think about what <a12> is equal to. Well a1 can either be +1 or -1. Either way, a12 =1. Then on average a12 is 1 and <a12> = 1. The same applies to <a22>, <a32>, and all the way up to <aN2>.
Now let’s think about what <a1a2> is equal to. There are 4 possible combinations of a1 and a2, each with equal probability:

Since a1a2 is equally likely to be +1 or -1, <a1a2> = 0. The same is true of all of <a1a3>, <a1aN>, <a2a3>, <a2aN> and all of the other terms containing two different steps. Then:
<d2> = (<a12> + <a22> + <a32> + … + <aN2>) + 2 (<a1a2> + <a1a3> + … <a1aN> + <a2a3> + … <a2aN> + …)
= (1 + 1 + 1 + … +1) + 2 (0 + 0 + … + 0 + 0 + …) = N
Now we’re getting somewhere! The average of the square of the distance is equal to N. If we take the square root of this equation, we realize that: sqrt(<d2>)=sqrt(N)
Since sqrt(<d2>) is something like the average positive distance away from 0 after N steps (technically, it’s called the “root-mean-squared” distance), we expect that after N steps, the black dot will be roughly sqrt(N) steps away from where it started. So for 25 steps, we expect the black dot to have moved roughly 5 total spaces from 0 in either direction. Of course, sometimes it will move more and sometimes fewer total spaces, but 5 is roughly what we might expect.
Random walks in more than one dimension
Of course, the 1-dimensional random walk is easy to understand, but not as commonly found in nature as the 2D and 3D random walk, in which an object is free to move along a 2D plane or a 3D space instead of a 1D line (think of gas particles bouncing around in a room, able to move in 3D). We won’t work out the math here, but it turns out to be governed by the same rule as a 1D random walk — the total distance travelled from where it started is approximately sqrt(N) where N is the number of steps taken.
Above we assumed a dimensionless step size of 1. In real life, step sizes are measured in units of distance and sqrt(<d2>)=<r>sqrt(N), where “r” is the size of a step, measured in units of length, and <r> is the average size of a step if the random steps vary in size (which they typically do in nature).
Now that we know a little more about random walks, let’s take another look at the video we saw before. The video shows the paths of 7 black dots undergoing a random walk in 2D. All of the dots start in the same place and then start to wander around. Notice that the steps they take are not all the same length. By the end of the video, some of the dots ended up above where they started, some below, some to the right, and some to the left. Some don’t end up very far from where they started and others manage to travel quite a distance.

Not all random walks are “random”
So far all of the random walks we have considered allowed an object to move with equal probability in any direction. But not all random walks follow this rule. A biased random walk is a random walk that is biased in one direction, leading to a net drift on average of particles in one specific direction.
There are several ways that a random walk can be biased. Think back to our 1D number line.
- Suppose that instead of an equal probability of moving left to right, there was a higher probability of moving to the right.
- Or, suppose the probabilities of moving to the left or right remained equal, but whenever the dot moved to the right, it moved 2 spaces, and when it moved to the left, it only moved 1 space.
In both of these cases, we would expect a net drift to the right. In other words, the random walk is biased toward the right. The video below shows a biased random walk in 2D

Each black dot has an equal probability of stepping towards the right or the left. But the step size is on average longer if the step is towards the right then if the step is towards the left.
References
At the centre of statistics lies the normal distribution, known to millions of people as the bell curve, or the bell-shaped curve. This is a two-parameter family of curves that are graphs of the equation

Not only is the bell curve familiar to these millions, but they also know of its main use: to describe the general, or idealized, the shape of graphs of data. It has, of course, many other uses and plays as significant a role in the social sciences as differentiation does in the natural sciences.
An approximation tool
The origins of the mathematical theory of probability are justly attributed to the famous correspondence between Fermat and Pascal, which was instigated in 1654 by the queries of the gambling Chevalier de Mere.
Among the various types of problems they considered were binomial distributions, which today would be described by:

This sum denotes the likelihood of between i and j successes in n trials with success probability p. Such a trial—now called a Bernoulli trial—is the most elementary of all random experiments. It has two outcomes, usually termed success and failure.
As the binomial examples Fermat and Pascal worked out involved only small values of n, they were not concerned with the computational challenge presented by the evaluation of general sums of this type. However, more complicated computations were not long in coming.
For example, in 1712 the Dutch mathematician Gravesande tested the hypothesis that male and female births are equally likely against the actual births in London over the 82 years 1629–1710.
A few years earlier Jacob Bernoulli had found estimates for binomial sums of the type of (2). These estimates, however, did not involve the exponential function e^x.
De Moivre began his search for such approximations in 1721. In 1733, he proved that:

De Moivre also asserted that (4) could be generalized to a similar asymmetrical context, with x varying from n/2 to d + n/2. This is easily done, with the precision of the approximation clarified by De Moivre’s proof.

FIGURE 2 demonstrates how the binomial probabilities associated with 50 independent repetitions of a Bernoulli trial with probability p = 0.3 of success are approximated by such an exponential curve. De Moivre’s discovery is standard fare in all introductory statistics courses where it is called the normal approximation to the binomial and rephrased as

Since this integral is easily evaluated by numerical methods and quite economically described by tables, it does indeed provide a very practical approximation for cumulative binomial probabilities.
The search for an error curve
Astronomy was the first science to call for accurate measurements. Consequently, it was also the first science to be troubled by measurement errors and to face the question of how to proceed in the presence of several distinct observations of the same quantity.
The first scientist to note in print that measurement errors are deserving of a systematic and scientific treatment was Galileo in his famous Dialogue Concerning the Two Chief Systems of the World—Ptolemaic and Copernican, published in 1632.
Thus, even as late as the mid-18th century doubts persisted about the value of repetition of experiments. More important, however, was Simpson’s experimentation with specific error curves—probability densities that model the distribution of random errors. In the two propositions, Simpson computed the probability that the error in the mean of several observations does not exceed a given bound when the individual errors take on the values:

or
![]()
Simpson’s choice of error curves may seem strange, but they were in all likelihood dictated by the state of the art of probability at that time. For r = 1 (the simplest case), these two distributions yield the two top graphs of FIGURE 4. One year later, Simpson, while effectively inventing the notion of continuous error distribution, dealt with similar problems in the context of the error curves described at the bottom of FIGURE4.

In 1774, Laplace proposed the first of his error curves. Denoting this function by φ(x), he stipulated that it must be symmetric in x and monotone decreasing for x > 0. Furthermore, he proposed that
… as we have no reason to suppose a different law for the ordinates than for their differences, it follows that we must, subject to the rules of probabilities, supposes the ratio of two infinitely small consecutive differences to be equal to that of the corresponding ordinates.
Laplace’s argument can be paraphrased as follows. Aside from their being symmetrical and descending (for x > 0), we know nothing about either φ(x) or φ(x). Hence, presumably by Occam’s razor, it must be assumed that they are proportional (the simpler assumption of equality leads to φ(x) = Ce^|x|, which is impossible). The resulting differential equation is easily solved and the extracted error curve is displayed in FIGURE 5. There is no indication that Laplace was in any way disturbed by this curve’s non-differentiability at x = 0. We are about to see that he was perfectly willing to entertain even more drastic singularities.

Laplace must have been aware of the shortcomings of his rationale, for three short years later he proposed an alternative curve [23]. Let a be the supremum of all the possible errors (in the context of a specific experiment) and let n be a positive integer.
Choose n points at random within the unit interval, thereby dividing it into n + 1 spacings. Order the spacings as:
d1 > d2 > ··· > dn+1, d1 + d2 +···+ dn+1 = 1.
Let d¯i be the expected value of di. Draw the points (i/n, d¯i),i = 1, 2,… , n + 1 and let n become infinitely large. The limit configuration is a curve that is proportional to ln(a/x) on (0, a]. Symmetry and the requirement that the total probability must be 1 then yield Laplace’s second candidate for the error curve (FIGURE 6):
This curve, with its infinite singularity at 0 and finite domain (a reversal of the properties of the error curve of FIGURE 5 and the bell-shaped curve) constitutes a step backwards in the evolutionary process and one suspects that Laplace was seduced by the considerable mathematical intricacies of the curve’s derivation. So much so that he seemed compelled to comment on the curve’s excessive complexity and to suggest that error analyses using this curve should be carried out only in “very delicate” investigations, such as the transit of Venus across the sun.
The next important development had its roots in a celestial event that occurred on January 1, 1801. On that day the Italian astronomer Giuseppe Piazzi sighted a heavenly body that he strongly suspected to be a new planet. He announced his discovery and named it Ceres. Unfortunately, six weeks later, before enough observations had been taken to make possible an accurate determination of its orbit, to ascertain that it was indeed a planet, Ceres disappeared behind the sun and was not expected to reemerge for nearly a year. Interest in this possibly new planet was widespread and
astronomers throughout Europe prepared themselves by computer-guessing the location where Ceres was most likely to reappear. The young Gauss, who had already made a name for himself as an extraordinary mathematician, proposed that an area of the sky be searched that was quite different from those suggested by the other astronomers and he turned out to be right.
Gauss explained that he used the least-squares criterion to locate the orbit that best fit the observations. This criterion was justified by a theory of errors that was based on the following three assumptions:
- Small errors are more likely than large errors.
- For any real number the likelihood of errors of magnitudes and − are equal.
- In the presence of several measurements of the same quantity, the most likely value
of the quantity being measured is their average.
Based on these assumptions he concluded that the probability density for the error (that is, the error curve) is:

where his a positive constant that Gauss thought of as the “precision of the measurement process”. We recognize this as the bell curve determined by µ = 0 and σ = 1/√2h.
Gauss’s ingenious derivation of this error curve made use of only some basic probabilistic arguments and standard calculus facts. As it falls within the grasp of undergraduate mathematics majors with a course in calculus-based statistics, his proof is presented here with only minor modifications.
References
Convergence of Random Variables
Convergence of random variables (sometimes called stochastic convergence) is where a set of numbers settle on a particular number. A sequence of numbers (which could represent cars or anything else) can converge (mathematically, this time) on a single, specific number. Certain processes, distributions, and events can result in convergence— which basically means the values will get closer and closer together. When Random variables converge on a single number, they may not settle exactly that number, but they come very, very close. In notation, x (xn → x) tells us that a sequence of random variables (xn) converges to the value x.
In notation, that’s:
|xn − x| → 0 as n → ∞.
What happens to these variables as they converge can’t be crunched into a single definition?
Convergence of Random Variables can be broken down into many types. The ones you’ll most often come across:
- Convergence in probability,
- Convergence in distribution,
- Almost sure convergence,
- Convergence in mean.
Each of these definitions is quite different from the others. However, for an infinite series of independent random variables: convergence in probability, convergence in distribution, and almost sure convergence are equivalent.
Convergence in probability
If you toss a coin n times, you would expect heads around 50% of the time. However, let’s say you toss the coin 10 times. You might get 7 tails and 3 heads (70%), 2 tails and 8 heads (20%), or a wide variety of other possible combinations. Eventually though, if you toss the coin enough times (say, 1,000), you’ll probably end up with about 50% tails. In other words, the percentage of heads will converge to the expected probability.
More formally, convergence in probability can be stated as the following formula:
![]()
P = probability, Xn = number of observed successes (e.g. tails) in n trials (e.g. tosses of the coin),
Lim (n→∞) = the limit at infinity — a number where the distribution converges to after an infinite number of trials (e.g. tosses of the coin), c = a constant where the sequence of random variables converges in probability to, ε = a positive number representing the distance between the expected value and the observed value.
Convergence in distribution
Convergence in distribution (sometimes called convergence in law) is based on the distribution of random variables, rather than the individual variables themselves.
In more formal terms, a sequence of random variables converges in distribution if the CDFs for that sequence converge into a single CDF. Let’s say you had a series of random variables, Xn. Each of these variables X1, X2,…Xn has a CDF FXn(x), which gives us a series of CDFs {FXn(x)}. Convergence in distribution implies that the CDFs converge to a single CDF, Fx(x).
Several methods are available for proving convergence in distribution. For example, Slutsky’s Theorem and the Delta Method can both help to establish convergence.
Almost sure convergence
Almost sure convergence (also called convergence in probability one) answers the question: given a random variable X, do the outcomes of the sequence Xn converge to the outcomes of X with a probability of 1? (Mittelhammer, 2013).
As an example of this type of convergence of random variables, let’s say an entomologist is studying feeding habits for wild house mice and records the amount of food consumed per day. The amount of food consumed will vary wildly, but we can be almost sure (quite certain) that amount will eventually become zero when the animal dies. It will almost certainly stay zero after that point. We’re “almost certain” because the animal could be revived, or appear dead for a while, or a scientist could discover the secret for eternal mouse life. In life — as in probability and statistics — nothing is certain.
Almost sure convergence is defined in terms of a scalar sequence or matrix sequence:
Scalar: Xn has almost sure convergence to X iff: P|Xn → X| = P(limn→∞Xn = X) = 1
Matrix: Xn has almost sure convergence to X iff: P|yn[i,j] → y[i,j]| = P(limn→∞yn[i,j] = y[i,j]) = 1, for all i and j.
Convergence in mean
Given a real number r ≥ 1, we say that the sequence Xn converges in the r-th mean (or in the Lr-norm) towards the random variable X, if the r-th absolute moments E(|Xn|r ) and E(|X|r ) of Xn and X exist, and
where the operator E denotes the expected value. Convergence in r-th mean tells us that the expectation of the r-th power of the difference between Xn and X converges to zero.
Law of Large Numbers
In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value and will tend to become closer to the expected value as more trials are performed.
The LLN is important because it guarantees stable long-term results for the averages of some random events.
There are two different versions of the law of large numbers that are described below. They are called the strong law of large numbers and the weak law of large numbers. Stated for the case where X1, X2, … is an infinite sequence of independent and identically distributed (i.i.d.) Lebesgue integrable random variables with expected value E(X1) = E(X2) = …= µ, both versions of the law state that – with virtual certainty – the sample average
converges to the expected value:
(Lebesgue integrability of Xj means that the expected value E(Xj) exists according to Lebesgue integration and is finite. It does not mean that the associated probability measure is absolutely continuous with respect to the Lebesgue measure.)
Weak Law
The weak law of large numbers (also called Khinchin’s law) states that the sample average converges in probability towards the expected value.
That is, for any positive number ε,
Interpreting this result, the weak law states that for any nonzero margin specified (ε), no matter how small, with a sufficiently large sample there will be a very high probability that the average of the observations will be close to the expected value; that is, within the margin.
The weak law applies in the case of i.i.d. random variables, but it also applies in some other cases.
Strong Law
The strong law of large numbers (also called Kolmogorov’s law) states that the sample average converges almost surely to the expected value.
That is,
What this means is that the probability that, as the number of trials n goes to infinity, the average of the observations converges to the expected value, is equal to one.
The proof is more complex than that of the weak law. This law justifies the intuitive interpretation of the expected value (for Lebesgue integration only) of a random variable when sampled repeatedly as the “long-term average”.
Almost sure convergence is also called strong convergence of random variables. This version is called the strong law because random variables which converge strongly (almost surely) are guaranteed to converge weakly (in probability). However, the weak law is known to hold in certain conditions where the strong law does not hold, and then the convergence is only weak (in probability).
Binomial convergence to normal and Poisson distributions
Binomial to Poisson
Let’s first talk about the relation between binomial and Poisson distribution,
At first glance, the binomial distribution and the Poisson distribution seem unrelated. But a closer look reveals a pretty interesting relationship. It turns out the Poisson distribution is just a special case of the binomial — where the number of trials is large, and the probability of success in any given one is small.
The binomial distribution works when we have a fixed number of events n, each with a constant probability of success p.
Imagine we don’t know the number of trials that will happen. Instead, we only know the average number of successes per time period. So we know the rate of successes per day, but not the number of trials n or the probability of success p that led to that rate. We are gonna prove what we just stated.
![]()
Let this be the rate of successes per day. It’s equal to np. That’s the number o trials n – however many there are – times the chance of success p for each of those trials. Think of it like this: if the chance of success is p and we run n trials per day, we’ll observe np successes per day on average. That’s our observed success rate lambda. Recall that te binomial distribution looks like this:
![]()
We defined:
![]()
Solving for p, we get:
What we’re going to do here is substitute this expression for p into the binomial distribution above, and take the limit as n goes to infinity, and try to come up sith something useful. That is,
Pulling out the constants
and
and splitting the term on the right that’s to the power of (n-k) into a term to the power of n and one to the power of -k, we get
Now let’s take the limit of this right-hand side one term at a time. We’ll do this in three steps. The first step is to find the limit of
In the numerator, we can expand n! into n terms of (n)(n-1)(n-2)…(1). And in the denominator, we can expand (n-k) into n-k terms of (n-k)(n-k-1)(n-k-2)…(1). That is,
Written this way, it’s clear that many of terms on the top and bottom cancel out. The (n-k)(n-k-1)…(1) terms cancel from both the numerator and denominator, leaving the following:
Since we canceled out n-k terms, the numerator here is left with k terms, from n to n-k+1. So this has k terms in the numerator, and k terms in the denominator since n is to the power of k.
Expanding out the numerator and denominator we can rewrite this as:
This has k terms. Clearly, every one of these k terms approaches 1 as n approaches infinity. So we know this portion of the problem just simplifies to one. So we’re done with the first step.
The second step is to find the limit of the term in the middle of our equation, which is
Recall that the definition of e = 2.718… is given by the following:
Our goal here is to find a way to manipulate our expression to look more like the definition of e, which we know the limit of. Let’s define a number x as
Now let’s substitute this into our expression and take the limit as follows:
This terms just simplifies to
. So we’re done with out second step. That leaves only one more term for us to find the limit of. Our third and final step is to find the limit of the last term on the right, wich is
This is pretty simple. As n approaches infinity, this term becomes
which is equal to one. And that takes care of our last term. Putting these three results together, we can rewrite our original limit as
This just simplifies to the following:
This is equal to the familiar probability density function for the Poisson distribution, which gives us the probability of k successes per period given our parameter lambda.
So we’ve shown that the Poisson distribution is just a special case of the binomial, in which the number of n trials grows to infinity and the chance of success in any particular trial approaches zero. And that completes the proof.
Binomial to normal
If n is large enough, then the skew of the distribution is not too great. In this case, a reasonable approximation to B(n, p) is given by the normal distribution
and this basic approximation can be improved in a simple way by using a suitable continuity correction. The basic approximation generally improves as n increases (at least 20) and is better when p is not near to 0 or 1. Various rules of thumb may be used to decide whether n is large enough, and p is far enough from the extremes of zero or one:
- One rule is that for n > 5 the normal approximation is adequate if the absolute value of the skewness is strictly less than 1/3; that is, if
This can be made precise using the Berry–Esseen theorem.
- A stronger rule states that the normal approximation is appropriate only if everything within 3 standard deviations of its mean is within the range of possible values; that is, only if
This 3-standard-deviation rule is equivalent to the following conditions, which also imply the first rule avobe.
Central Limit Theorem (CLT)
In probability theory, the central limit theorem (CLT) establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution (informally a bell curve) even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions. This theorem has seen many changes during the formal development of probability theory.
Let be a random sample of size n — that is, a sequence of independent and identically distributed (i.i.d.) random variables drawn from a distribution of expected value given by μ and finite variance given by . Suppose we are interested in the sample average
of these random variables. By the law of large numbers, the sample averages converge almost surely (and therefore also converge in probability) to the expected value μ as . The classical central limit theorem describes the size and the distributional form of the stochastic fluctuations around the deterministic number μ during this convergence. More precisely, it states that as n gets larger, the distribution of the difference between the sample average and its limit μ, when multiplied by the factor (that is ) approximates the normal distribution with mean 0 and variance . For large enough n, the distribution of is close to the normal distribution with mean μ and variance . The usefulness of the theorem is that the distribution of approaches normality regardless of the shape of the distribution of the individual .
References
Central limit theorem – Wikipedia
Binomial distribution – Wikipedia
Law of large numbers – Wikipedia
Convergence of random variables – Wikipedia
Convergence of Random Variables: Simple Definition – Calculus How To
Click here to add your own text
Our goal is to find the median of all the elements read so far starting from the first integer till the last integer. This is also called the Median of Running Integers. The data stream can be any source of data, for example, a file, an array of integers, an input stream, etc.
But what is the median? Median can be defined as the element in the data set which separates the higher half of the data sample from the lower half. In other words, we can get the median element as, when the input size is odd, we take the middle element of sorted data. If the input size is even, we pick an average of the middle two elements in the sorted stream.
How can we find the median using algorithms? If we keep each number in a sorted sequence then the cost of a single entry is O(n) and the finding median is O(n). A slight modification can be done by keeping the middle pointer and adjusting it based on the insertion on its left side and right side. In that case, the finding median after insertion is O(1). But the overall cost for finding median remains O(n) as an insertion in the sorted sequence is necessary after each number is entered.
So which approach could we use to find the median more efficiently? There are several methods to find the median:
- Method 1: Insertion Sort
If we can sort the data as it appears, we can easily locate the median element. Insertion Sort is one such online algorithm that sorts the data appeared so far. At any instance of sorting, say after sorting i-th element, the first i elements of the array are sorted. The insertion sort doesn’t depend on future data to sort data input till that point. In other words, insertion sort considers data sorted so far while inserting the next element. This is the key part of insertion sort that makes it an online algorithm.
However, insertion sort takes O(n2) time to sort n elements. Perhaps we can use binary search on insertion sort to find the location of the next element in O(log n) time. Yet, we can’t do data movement in O(log n) time. No matter how efficient the implementation is, it takes polynomial time in case of insertion sort. - Method 2: Augmented self-balanced binary search tree (AVL, RB, etc…)
At every node of BST, maintain some elements in the subtree rooted at that node. We can use a node as the root of a simple binary tree, whose left child is self-balancing BST with elements less than root and right child is self-balancing BST with elements greater than root. The root element always holds the effective median.
If the left and right subtrees contain the same number of elements, the root node holds the average of left and right subtree root data. Otherwise, the root contains the same data as the root of subtree which is having more elements. After processing an incoming element, the left and right subtrees (BST) are differed utmost by 1.
Self-balancing BST is costly in managing the balancing factor of BST. However, they provide sorted data which we don’t need. We need median only. The next method makes use of Heaps to trace the median. - Method 3: Heaps
Similar to balancing BST in Method 2 above, we can use a max heap on the left side to represent elements that are less than the effective median, and a min-heap on the right side to represent elements that are greater than the effective median.
After processing an incoming element, the number of elements in heaps differs utmost by 1 element. When both heaps contain the same number of elements, we pick the average of heaps root data as the effective median. When the heaps are not balanced, we select the effective median from the root of the heap containing more elements.
Useful code:
// C# program to find med in// stream of running integersusing System;using System.Collections.Generic;public class MedianMaintain{// method to calculate med of streampublic static void printMedian(int[] a){ double med = a[0];// max heap to store the smaller half elementsList<int> smaller = new List<int>();// min-heap to store the greater half elementsList<int> greater = new List<int>(); smaller.Add(a[0]);Console.WriteLine(med);// reading elements of stream one by one/* At any time we try to make heaps balanced andtheir sizes differ by at-most 1. If heaps arebalanced,then we declare median as average ofmin_heap_right.top() and max_heap_left.top()If heaps are unbalanced,then median is definedas the top element of heap of larger size */for(int i = 1; i < a.Length; i++){int x = a[i];// case1(left side heap has more elements)if(smaller.Count > greater.Count){if(x < med){smaller.Sort();smaller.Reverse();greater.Add(smaller[0]);smaller.RemoveAt(0);smaller.Add(x);}elsegreater.Add(x);smaller.Sort();smaller.Reverse();greater.Sort();med = (double)(smaller[0] + greater[0])/2;}// case2(both heaps are balanced)else if(smaller.Count == greater.Count){if(x < med){smaller.Add(x);smaller.Sort();smaller.Reverse();med = (double)smaller[0];}else{greater.Add(x);greater.Sort();med = (double)greater[0];}}// case3(right side heap has more elements)else{if(x > med){greater.Sort();smaller.Add(greater[0]);greater.RemoveAt(0);greater.Add(x);}elsesmaller.Add(x);smaller.Sort();smaller.Reverse();med = (double)(smaller[0] + greater[0])/2;}Console.WriteLine(med);}}// Driver codepublic static void Main(String []args){// stream of integersint[] arr = new int[]{5, 15, 10, 20, 3};printMedian(arr);}}// This code is contributed by Rajput-Ji |
Output:
5
10
10
12.5
10
Complexity Analysis:
- Time Complexity: O(n log n).
Time Complexity to insert an element in min-heap is log n. So to insert n element is O( n log n). - Auxiliary Space: O(n).
The Space required to store the elements in Heap is O(n).
Demo:
Median of Stream of Running Integers using STL | GeeksforGeeks – YouTube
References
Median of Stream of Running Integers using STL – GeeksforGeeks
Data Structures and Algorithm: Running Median (dsalgo.com)
Median in a stream of integers (running integers) – GeeksforGeeks
This example shows how to generate random numbers using the uniform distribution inversion method. This is useful for distributions when it is possible to compute the inverse cumulative distribution function, but there is no support for sampling from the distribution directly.
Step 1. Generate random numbers from the standard uniform distribution.
Use rand to generate 1000 random numbers from the uniform distribution on the interval (0,1).
rng('default') % For reproducibility
u = rand(1000,1);
The inversion method relies on the principle that continuous cumulative distribution functions (CDFs) range uniformly over the open interval (0,1). If u is a uniform random number on (0,1), then x=F−1(u) generates a random number x from any continuous distribution with the specified cdf F.
Step 2. Generate random numbers from the Weibull distribution.
Use the inverse cumulative distribution function to generate the random numbers from a Weibull distribution with parameters A = 1 and B = 1 that correspond to the probabilities in u. Plot the results.
x = wblinv(u,1,1); histogram(x,20);

The histogram shows that the random numbers generated using the Weibull inverse CDF function wblinv have a Weibull distribution.
Step 3. Generate random numbers from the standard normal distribution.
The same values in u can generate random numbers from any distribution, for example, the standard normal, by following the same procedure using the inverse CDF of the desired distribution.
x_norm = norminv(u,1,1); histogram(x_norm,20)

The histogram shows that, by using the standard normal inverse CDF norminv, the random numbers generated from u now have a standard normal distribution.
References
Generate Random Numbers Using Uniform Distribution Inversion – MATLAB & Simulink – MathWorks Italia











