In probability theory and statistics, Bayes’ theorem, named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For example, if the risk of developing health problems is known to increase with age, Bayes’ theorem allows the risk to an individual of a known age to be assessed more accurately (by conditioning it on their age) than simply assuming that the individual is typical of the population as a whole.

Statement of theorem

Bayes’ theorem is stated mathematically as the following equation:

    \[ P(A|B)=\frac{P(B|A)P(A)}{P(B)} \]

where A and B are events and P(B)\neq 0

  • P(A\mid B is a conditional probability: the probability of event A occurring given that B is true. It is also called the posterior probability of A given B.
  • P(B\mid A) is also a conditional probability: the probability of event B occurring given that A is true. It can also be interpreted as the likelihood of A given a fixed B because P(B\mid A)=L(A\mid B).
  • P(A) and P(B) are the probabilities of observing A and B respectively without any given conditions; they are known as the marginal probability or prior probability.
  • A and B must be different events.

There are several paradigms within statistical inference such as:

  1. early Bayesian inference;
  2. Fisherian inference;
  3. Neyman-Pearson inference;
  4. Neo-Bayesian inference;
  5. Likelihood inference.

Early Bayesian inference

Bayesian inference refers to the English clergyman Thomas Bayes (1702-1761), who was the first who

attempted to use probability calculus as a tool for inductive reasoning, and gave his name to what later became known as Bayes law. However, the one who essentially laid the basis for Bayesian inference was the French mathematician Pierre Simon Laplace (1749-1827). This amounts to having a non-informative prior distribution on the unknown quantity of interest, and with observed data, update the prior by Bayes law, giving a posterior distribution where the expectation (or mode) could be taken as an estimate of the unknown quantity. This way of reasoning was frequently called inverse probability and was picked up by Gauss (1777-1855).

Fisherian inference

The major paradigm change came with Ronald A. Fisher (1890-1962), probably the most influential statistician of all times, who laid the basis for a quite different type of objective reasoning. About 1922 he had concluded that inverse probability was not a suitable scientific paradigm. In his words a few years later “the theory of inverse probability is founded upon an error, and must be wholly rejected”.

Instead, he advocated the use of the method of maximum likelihood and demonstrated its usefulness as well as its theoretical properties. One of his supportive arguments was that inverse probability gave different results depending on the choice of parameterization, while the method of maximum likelihood did not, e.g. when estimating unknown odds instead of the corresponding probability. Fisher clarified the notion of a parameter, and the difference between the true parameter and its estimate, which had often been confusing for earlier writers. He also separated the concepts of probability and likelihood and introduced several new and important concepts, most notably information, sufficiency, consistency, and efficiency.

Neyman-Pearson inference

In the late 1920s, the duo Jerzy Neyman (1890-1981) and Egon Pearson (1895-1980) arrived on the scene.

They are largely responsible for the concepts related to confidence intervals and hypothesis testing. Their ideas had a clear frequentist interpretation, with significance level and confidence level as risk and covering probabilities attached to the method used in repeated application. While Fisher tested a hypothesis with no alternative in mind, Neyman and Pearson pursued the idea of test performance against specific alternatives and the uniform optimality of tests. Moreover, Neyman imagined a wider role for tests as a basis for decisions: “Tests are not rules for inductive inference, but rules of behavior”. Fisher strongly opposed these ideas as the basis for scientific inference, and the fight between Fisher and Neyman went on for decades.

Neo-Bayesian inference

In the period 1920-1960, the Bayes position was dormant, except for a few writers. Important for the reawakening of Bayesian ideas was the new axiomatic foundation for probability and utility, established by von Neumann and Morgenstern (1944). This was based on axioms of coherent behavior. When Savage and his followers looked at the implications of coherency for statistics, they demonstrated that a lot of common statistical procedures, although fair within the given paradigm, they were not coherent. On the other hand, they demonstrated that a Bayesian paradigm could be based on coherency axioms, which implied the existence of (subjective) probabilities following the common rules of probability calculus. Of interest here is also the contributions of Howard Raiffa (1924-) and Robert Schlaifer (1914-1994) who looked at statistics in the context of business decisions.

Likelihood inference

This paradigm is one of the five with the shortest history and is formulated by statisticians George Barnard (1915-2002), Alan Birnbaum (1823-1976), and Anthony Edwards (1935-). They also found the classical paradigms unsatisfactory, but for different reasons than the Bayesians. Their concern was the role of statistical evidence in scientific inference, and they found the decision-oriented classical paradigm sometimes worked against a reasonable scientific process. The Likelihoodists saw it as a necessity to have a paradigm that provides a measure of evidence, regardless of prior probabilities and regardless of available actions and their consequences. Moreover, this measure should reflect the collection of evidence as a cumulative process. They found the classical paradigms (Neyman-Pearson or Fisher) with its significance levels and P-values not able to provide this, and they argued that evidence of observations for a given hypothesis has to be judged against an alternative (and thus Fishers claim that P-values fairly represent a measure of evidence is invalid). They suggested instead to base the scientific reporting on the likelihood function itself and use likelihood ratios as measures of evidence. This gives a kind of plausibility guarantee, but no direct probability guarantees as with the other paradigms, except that probability considerations, can be performed when planning the investigation and its size.

References

Bayes’ theorem – Wikipedia

statistical_inference.pdf (nhh.no)

As Kolmogorov said, the probability is an axiom, based on measure theory. The measure is a quantity defined onset, (non-negative, is zero on the empty set, countable additivity).

Given a set X, a power set P(X) (a set of all subsets of all subsets of X), we have a Sigma algebra which is defined as a subset of the power set of x which has some properties.

Probability is a particular case of measure with a particular property, it’s defined between a range (0,1), this range is called the Probability Space. The probability space is represented as triples (Ω, a, p) as in the measure space, where the triples are ( X, Σ, μ).

Ω: is a set of all elements, a: is the sigma-algebra on the set X, p is the probability function.

Starting from empirical objects, I can derive an infinite number of models (theoretical models), defined by the Θ parameter (state of nature) and from these, I can calculate the most probable model, thanks to the role that probability has in statistics.

Introduction

Probability and statistics, the branches of mathematics concerned with the laws governing random events, including the collection, analysis, interpretation, and display of numerical data. Probability has its origin in the study of gambling and insurance in the 17th century, and it is now an indispensable tool of both social and natural sciences. Statistics may be said to have its origin in census counts taken thousands of years ago; as a distinct scientific discipline, however, it was developed in the early 19th century as the study of populations, economies, and moral actions and later in that century as the mathematical tool for analyzing such numbers.

The assumptions as to set up the axioms can be summarised as follows:

Let (Ω, FP) be a measure space with P(E) being the probability of some event E, and P(Ω) = 1.

Then (Ω, FPis a probability space, with sample space Ω, event space F and probability measure P.

First axiom

The probability of an event is a non-negative real number:

where is the event space. It follows that P(E) is always finite, in contrast with more general measure theory. Theories that assign a negative probability to relax the first axiom.

Second axiom

This is the assumption of unit measure: that the probability that at least one of the elementary events in the entire sample space will occur is 1

P(Ω) = 1

Third axiom

This is the assumption of σ-additivity:

Any countable sequence of disjoint sets (synonymous with mutually exclusive events) E1, E2, . . . . satisfies

Many important laws are derived from Kolmogorov’s three axioms. For example, the Law of Large Numbers can be deduced from the laws by logical reasoning (Tijms, 2004).

Mathematical Statistics

Mathematical statistics is the application of probability theory, a branch of mathematics, to statistics, as opposed to techniques for collecting statistical data.

Data analysis is divided into:

  • descriptive statistics – the part of statistics that describes data, i.e. summarises the data and their typical properties.
  • inferential statistics – the part of statistics that concludes data (using some model for the data): For example, inferential statistics involves selecting a model for the data, checking whether the data fulfil the conditions of a particular model, and quantifying the involved uncertainty (e.g. using confidence intervals).

The Concrete and the Real

When abstract probability theory makes a distinction between the concrete sample ω (also known as a random outcome or trial) and the event A that is realised if ω ∈ A, it does something entirely new: this is essentially a distinction between the concrete and the real. When probability no longer pertains to the random outcome as such, but only to the event, then the probability is separated from randomness. The great foundational gesture of abstract probability theory was to shatter our image of randomness. There is no random generator any longer, and it is no longer a matter of expecting random outcomes. Once it is understood that the random outcome matters only in so far as it is the set-theoretic element of an event, then set theory becomes the foundation of probability theory, and everything relating to expectation and the concrete field of randomness is reduced to the sole measurement of sets. And when we examine the strong law of large numbers, which is what lends tense to the notion of probability and gives us the impression of expecting something to happen with some probability, we realise that measure theory has only been extended to sets of non-denumerable cardinality, and that we now only measure the set of typical (infinite) random sequences, which is of measure 1 and in which no sequence is distinguished in particular.

Our intuitive image of randomness and the random trial is that of drawing balls from an urn; it is that of the materialisation of the random trial, of the manifestation of the concrete; but in the formalism of probability, everything points in the opposite direction, that of the measure of sets alone, that of the infinite and non-constructive limit where, precisely, individual trials are indistinguishable and lose their identity.

In the real world statistics is more used for examples:

Statistics in the Health Industry

Statistics is playing its part in the health industry. It helps the doctor to take and manage the data of their patients. Apart from that WHO is also using statistics to generate their annual report on the healthy populations of the world. Due to statistics, medical science has invented lots of vaccines and anti tode to fight against major diseases.

Education

The beneficial importance of statistics in education is that teachers can be considered to be supportive as researchers during their classrooms to recognize what education technique works on which pupils and know the reason why. They also need to estimate test details to determine whether students are working expectedly, statistically, or not.

Government

The importance of statistics in government is utilized by making judgments about health, populations, education, and much more. It may help the government to check out what education schedule can be beneficial for students. What is the progress report of high school students using that particular curriculum? The government can assemble specific data about the population of the country using a census.

Statistics in Economics

Whenever you are going to study economics, you would also learn statistics. Statistics and Economics are interrelated with each other. It is impossible to separate them. The development of advanced statistics has opened new ways to extensive use of statistics for Economics.

Almost every branch of Economics uses statistics, i.e., consumption, production, distribution, public finance. All these Economic branches use statistics for comparison, presentation, interpretation, and so on.

Income spending problems on and various sections of the people. National wealth production, demand, and supply adjustment, the effect of economic policies. All these indicate the importance of statistics in the field of economics and its various branches. The government uses statistics in economics to calculate its GDP and Per capita Income.

References

Frequentist inference – Wikipedia

Parametric statistics – Wikipedia

Mathematical statistics – Wikipedia

Probability axioms – Wikipedia

Concept and Intuition in Abstract Probability Theory – Urbanomic

2 new messages (statanalytica.com)

With the Window to Viewport Transformation, our goal is to transform the real-world coordinates (i.e. pair of x and y that indicate respectively the coordinate on the x-axis and on the y one) to the viewport ones (the screen coordinates).
As a first step, we have to identify the real-world window which we want to project in our viewport because we may be interested in specific portions of our real-world chart and this is useful to maintain its original shape. How can we do this? We can easily obtain this with a linear transform: in practice, it’s like we consider the real-world window and we stretch or shrink it to fit inside the viewport.
The coordinates of the viewport on the screen are completely different from the real-world window: in the first case, we are measuring pixels, in the second one we are measuring two variables that could have different units of measure. So we need to know how to transform each point in the real world (x_w, y_w) to a viewport point (x_d, y_d); first of all, remind that the origin of device points isn’t in the bottom-left corner, but in the top -left, therefore the y axis on the device is flipped respect to the cartesian system

What we want to find are 2 functions that given the real-world point give us its x and y to represent that point in the viewport:

    \[ \begin{cases} x= f_x(x_w)\\ y=f_y(y_w) \\ \end{cases} \]

To find those functions that can solve our problem, we need to keep track of the viewport size and also of the coordinates of its top-left corner that we indicate with (L, T); for what regards the real world, we will need: MinX, MinY, MaxX and MaxY.

What is written below refers to this image above.

The green segment should correspond to the blue one: for instance, if the green segment is 1/3 of the Rx, we want also that the blue one will be 1/3 of W in order to maintain the proportion. So, (x – MinX)/Rx is the proportion factor which, if multiplied by W, will give us the blue segment; however, to obtain viewport x, we have to add to our blue segment L.
To conclude, viewport x will be given by the following:

    \[ x=L+\frac{x-x_{min}}{R_x}\cdot w \]

The same can be done for Y, paying attention to the fact that the y-axis is flipped in the viewport with respect to the cartesian system.
The proportion of the red segment is: (y – MinY)/Ry, which multiplied by H, gives us the yellow segment.
What we are searching for, though, isn’t the yellow segment, but the orange one.
So, how could we obtain the orange segment? It’s just H – yellowSegment, i.e.  H – H * (y – MinY)/Ry, and adding T we obtain the viewport y.
Here’s the formula to obtain the viewport y:

    \[ y=T+H-\frac{y-y_{min}}{R_y}\cdot h \]

Useful coding for the conversion from a real point to the viewport one:

private int X_Viewport(double realX, double minX_Window, double rangeX, Rectangle viewport)
{
   //converting
   return (int)(viewport.Left + viewport.Width * (realX – minX_Window) / rangeX);
}

private int Y_Viewport(double realY, double minY_Window, double rangeY, Rectangle viewport)
{
   //converting
   return (int)(viewport.Top + viewport.Height – viewport.Height * (realY – minY_Window) / rangeY);
}

A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. As such, measures of central tendency are sometimes called measures of central location. They are also classed as summary statistics. The mean (often called the average) is most likely the measure of central tendency that you are most familiar with, but there are others, such as the median and the mode.
The mean, median, and mode are all valid measures of central tendency, but under different conditions, some measures of central tendency become more appropriate to use than others. Below, we will look at the mean, mode, and median, and understand how to calculate them and under what conditions they are most appropriate to be used.

Mean (or Avarage)

The (arithmetic) mean, or average, of n observations x_bar is simply the sum of the observations divided by the number of observations; thus:

    \[\bar{x}=\sum \frac{x_i}{n}\]

In this equation, xi represents the individual sample values and Σxi their sum.

Median

The median is defined as the middle point of the ordered data. It is estimated by first ordering the data from smallest to largest and then counting upwards for half the observations. The estimate of the median is either the observation at the center of the ordering in the case of an odd number of observations or the simple average of the middle two observations if the total number of observations is even.

There is an obvious disadvantage, the median uses the position of data points rather than their values. That way some valuable information is lost and we have to rely on other kinds of measures such as measures of dispersion to get more information about the data.

Mode

The third measure of location is the mode. This is the value that occurs most frequently, or, if the data are grouped, the grouping with the highest frequency. It may be useful for categorical data to describe the most frequent category. The expression ‘bimodal’ distribution is used to describe a distribution with two peaks in it. This can be caused by mixing populations.

Distance

The measures of central tendency are not adequate to describe data. Two data sets can have the same mean but they can be entirely different. Thus to describe data, one needs to know the extent of variability. This is given by the measures of dispersion. Is important because the smaller the dispersion, the more realistic the data. But to calculate the dispersion we have to define the distance.

In mathematic terms the distance is a numerical measurement of how far apart objects or points are; a distance function or metric is a generalization of the concept of physical distance; it is a way of describing what it means for elements of some space to be “close to”, or “far away from” each other.

In machine learning, a distance measure is an objective score that summarizes the relative difference between two objects in a problem domain. Most commonly, the two objects are rows of data that describe a subject (such as a person, car, or house), or an event (such as a purchase, a claim, or a diagnosis).

In Statistics, distance is a measure calculated between two records that are typically part of a larger dataset, where rows are records and columns are variables.

There are different kinds of distances:

1. Euclidian Distance :

It is a distance measure that best can be explained as the length of a segment connecting two points. The formula is rather straightforward as the distance is calculated from the cartesian coordinates of the points using the Pythagorean theorem.

    \[D(x, y)=\sqrt{\sum_{i=1}^n (x_i-y_i)^2}\]

Euclidean distance works great when you have low-dimensional data and the magnitude of the vectors is important to be measured.

2. Hamming Distance:

Hamming distance is the number of values that are different between two vectors. It is typically used to compare two binary strings of equal length. It can also be used for strings to compare how similar they are to each other by calculating the number of characters that are different from each other:

3. Manhattan Distance:

The Manhattan distance calculates the distance between real-valued vectors. Imagine vectors that describe objects on a uniform grid such as a chessboard. Manhattan distance then refers to the distance between two vectors if they could only move right angles. There is no diagonal movement involved in calculating the distance:

   

All these distances can be used in different subjects such as machine learning, data science, mathematics; but obviously, they will give different results because each of them has a different declaration of distance.

References

Distance – Wikipedia

Measures of Location and Dispersion and their appropriate uses | Health Knowledge

Distance – Statistics.com: Data Science, Analytics & Statistics Courses

4 Distance Measures for Machine Learning (machinelearningmastery.com)


In graphical user interfaces such as Microsoft Windows, drawing on the screen is an important task.

Everything displayed on the screen is based on simple drawing operations. In Visual Studio .NET, developers have easy access to that drawing functionality whenever they need it through a technology called GDI+. Using GDI+, developers can easily perform drawing operations such as generating graphs or building custom controls. Read more

The representation of the data in statistics is realized through tables and/or graphs. The raw data of a statistical survey are not easily interpretable. To make the information or the meaning of the survey understandable, it is necessary to synthesize and represent the statistical data.

Good graphs convey information quickly and easily to the user. Graphs highlight the salient features of the data. They can show relationships that are not obvious from studying a list of numbers. They can also provide a convenient way to compare different sets of data.

Different situations call for different types of graphs, and it helps to have a good knowledge of what types are available. The type of data often determines what graph is appropriate to use.

Pareto Diagram or Bar Graph

A Pareto diagram or bar graph is a way to visually represent ​qualitative data. Data is displayed either horizontally or vertically and allows viewers to compare items, such as amounts, characteristics, times, and frequency. The bars are arranged in order of frequency, so more important categories are emphasized. By looking at all the bars, it is easy to tell at a glance which categories in a set of data dominate the others. Bar graphs can be either single, stacked, or grouped.

Pie Chart or Circle Graph

Another common way to represent data graphically is a pie chart. It gets its name from the way it looks, just like a circular pie that has been cut into several slices. This kind of graph is helpful when graphing qualitative data, where the information describes a trait or attribute and is not numerical. Each slice of pie represents a different category, and each trait corresponds to a different slice of the pie; some slices usually noticeably larger than others. By looking at all of the pie pieces, you can compare how much of the data fits in each category, or slice.

Histogram

A histogram in another kind of graph that uses bars in its display. This type of graph is used with quantitative data. Ranges of values, called classes, are listed at the bottom, and the classes with greater frequencies have taller bars.

Scatterplots

A scatterplot displays data that is paired by using a horizontal axis (the x-axis), and a vertical axis (the y-axis). The statistical tools of correlation and regression are then used to show trends on the scatterplot. A scatterplot usually looks like a line or curve moving up or down from left to right along the graph with points “scattered” along the line.

Time-Series Graphs

A time-series graph displays data at different points in time, so it is another kind of graph to be used for certain kinds of paired data. As the name implies, this type of graph measures trends over time, but the timeframe can be minutes, hours, days, months, years, decades, or centuries. For example, you might use this type of graph to plot the population of the United States over the course of a century. The y-axis would list the growing population, while the x-axis would list the years, such as 1900, 1950, 2000.

References

7 Graphs Commonly Used in Statistics (thoughtco.com)

Marginal distribution

In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the variables in the subset without reference to the values of the other variables. This contrasts with a conditional distribution, which gives the probabilities contingent upon the values of the other variables. Read more

Floating-point representation is an alternative technique based on scientific notation.

Though we’d like to use scientific notation, we’ll base our scientific notation on powers of 2, not powers of 10, because we’re working with computers that prefer binary.

Once we have a number in binary scientific notation, we still must have a technique for mapping that into a set of bits:

We use the first bit to represent the sign (1 for negative, 0 for positive), the next four bits for the sum of 7 and the actual exponent (we add 7 to allow for negative exponents), and the last three bits for the mantissa’s fractional part. Note that we omit the integer part of the mantissa: Since the mantissa must have exactly one nonzero bit to the left of its decimal point, and the only nonzero bit is 1, we know that the bit to the left of the decimal point must be a 1. There’s no point in wasting space in inserting this 1 into our bit pattern, so we include only the bits of the mantissa to the right of the decimal point.

We call this floating-point representation because the values of the mantissa bits “float” along with the decimal point, based on the exponent’s given value. This is in contrast to fixed-point representation, where the decimal point is always in the same place among the bits given.

Approximation

Squeezing infinitely many real numbers into a finite number of bits requires an approximate representation. Although there are infinitely many integers, in most programs the result of integer computations can be stored in 32 bits. In contrast, given any fixed number of bits, most calculations with real numbers will produce quantities that cannot be exactly represented using that many bits. Therefore the result of a floating-point calculation must often be rounded to fit back into its finite representation. This rounding error is the characteristic feature of floating-point computation.

Comparison

Due to rounding errors, most floating-point numbers end up being slightly imprecise. As long as this imprecision stays small, it can usually be ignored. However, it also means that numbers expected to be equal (e.g. when calculating the same result through different correct methods) often differ slightly, and a simple equality test fails. For example:

float a = 0.15 + 0.15
float b = 0.1 + 0.2
if(a == b) // can be false!
if(a >= b) // can also be false!

The solution is to check not whether the numbers are the same, but whether their difference is very small. The error margin that the difference is compared to is often called epsilon.

Rounding

Because floating-point numbers have a limited number of digits, they cannot represent all real numbers accurately: when there are more digits than the format allows, the leftover ones are omitted – the number is rounded. There are three reasons why this can be necessary:

  • Too many significant digits – The great advantage of floating-point is that leading and trailing zeroes (within the range provided by the exponent) don’t need to be stored. But if without those, there are still more digits than the significand can store, rounding becomes necessary. In other words, if your number simply requires more precision than the format can provide, you’ll have to sacrifice some of it, which is no big surprise. For example, with a floating-point format that has 3 digits in the significand, 1000 does not require rounding, and neither does 10000 or 1110 – but 1001 will have to be rounded. With a large number of significant digits available in typical floating-point formats, this may seem to be a rarely encountered problem, but if you perform a sequence of calculations, especially multiplication and division, you can very quickly reach this point.
  • Periodical digits – Any (irreducible) fraction where the denominator has a prime factor that does not occur in the base requires an infinite number of digits that repeat periodically after a certain point, and this can already happen for very simple fractions. For example, in decimal 1/4, 3/5 and 8/20 are finite, because 2 and 5 are the prime factors of 10. But 1/3 is not finite, nor is 2/3 or 1/7 or 5/6 because 3 and 7 are not factors of 10. Fractions with a prime factor of 5 in the denominator can be finite in base 10, but not in base 2 – the biggest source of confusion for most novice users of floating-point numbers.
  • Non-rational numbers – Non-rational numbers cannot be represented as a regular fraction at all, and in positional notation (no matter what base) they require an infinite number of non-recurring digits.

References

Floating-point representation (cburch.com)

What Every Computer Scientist Should Know About Floating-Point Arithmetic (oracle.com)

The Floating-Point Guide – What Every Programmer Should Know About Floating-Point Arithmetic (floating-point-gui.de)



In computer science, an online algorithm is one that can process its input piece-by-piece in a serial fashion, i.e., in the order that the input is fed to the algorithm, without having the entire input available from the start.
In contrast, an offline algorithm is given the whole problem data from the beginning and is required to output an answer which solves the problem at hand. Read more



Data analysis is a process of inspecting, cleansing, transforming, and modelling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, and is used in different business, science, and social science domains. In today’s business world, data analysis plays a role in making decisions more scientific and helping businesses operate more effectively. Read more