Researches about theory

4_R. Explain what are marginal, joint and conditional distributions and how we can explain the Bayes theorem using relative frequencies. Explain the concept of statistical independence and why, in case of independence, the relative joint frequencies are equal to the products of the corresponding marginal frequencies.


Applications

4_A. Create a program – in both languages C# and VB.NET (and optionally in js) – to read data from a CSV file, and store it into a suitable collection of suitably designed objects, for further processing. Compute mean and standard deviation and frequency distribution for at least one of the variable, and for one pair of variables.



5_A. Compute – in both languages C# and VB.NET (and optionally in js) – a frequency distribution of the meaningful words from any text file and create a personal graphical representation of the corresponding “word cloud” (in case, can use animation if you wish), keeping into account the frequencies of the words.


Researches about applications

2_RA. Do a review about charts useful for statistics and data presentation (example of some: StatCharts.txt ). What is the chart type that impressed you most and why ?



3_RA. Do a comprehensive research about the GRAPHICS object and all its members (to get ready to create any statistical chart.)



In graphical user interfaces such as Microsoft Windows, drawing on the screen is an important task.

Everything displayed on the screen is based on simple drawing operations. In Visual Studio .NET, developers have easy access to that drawing functionality whenever they need it through a technology called GDI+. Using GDI+, developers can easily perform drawing operations such as generating graphs or building custom controls. Read more

The representation of the data in statistics is realized through tables and/or graphs. The raw data of a statistical survey are not easily interpretable. To make the information or the meaning of the survey understandable, it is necessary to synthesize and represent the statistical data.

Good graphs convey information quickly and easily to the user. Graphs highlight the salient features of the data. They can show relationships that are not obvious from studying a list of numbers. They can also provide a convenient way to compare different sets of data.

Different situations call for different types of graphs, and it helps to have a good knowledge of what types are available. The type of data often determines what graph is appropriate to use.

Pareto Diagram or Bar Graph

A Pareto diagram or bar graph is a way to visually represent ​qualitative data. Data is displayed either horizontally or vertically and allows viewers to compare items, such as amounts, characteristics, times, and frequency. The bars are arranged in order of frequency, so more important categories are emphasized. By looking at all the bars, it is easy to tell at a glance which categories in a set of data dominate the others. Bar graphs can be either single, stacked, or grouped.

Pie Chart or Circle Graph

Another common way to represent data graphically is a pie chart. It gets its name from the way it looks, just like a circular pie that has been cut into several slices. This kind of graph is helpful when graphing qualitative data, where the information describes a trait or attribute and is not numerical. Each slice of pie represents a different category, and each trait corresponds to a different slice of the pie; some slices usually noticeably larger than others. By looking at all of the pie pieces, you can compare how much of the data fits in each category, or slice.

Histogram

A histogram in another kind of graph that uses bars in its display. This type of graph is used with quantitative data. Ranges of values, called classes, are listed at the bottom, and the classes with greater frequencies have taller bars.

Scatterplots

A scatterplot displays data that is paired by using a horizontal axis (the x-axis), and a vertical axis (the y-axis). The statistical tools of correlation and regression are then used to show trends on the scatterplot. A scatterplot usually looks like a line or curve moving up or down from left to right along the graph with points “scattered” along the line.

Time-Series Graphs

A time-series graph displays data at different points in time, so it is another kind of graph to be used for certain kinds of paired data. As the name implies, this type of graph measures trends over time, but the timeframe can be minutes, hours, days, months, years, decades, or centuries. For example, you might use this type of graph to plot the population of the United States over the course of a century. The y-axis would list the growing population, while the x-axis would list the years, such as 1900, 1950, 2000.

References

7 Graphs Commonly Used in Statistics (thoughtco.com)

Marginal distribution

In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the variables in the subset without reference to the values of the other variables. This contrasts with a conditional distribution, which gives the probabilities contingent upon the values of the other variables. Read more


This video showing how application 3_A works (in both C # and vb-net).

The form is composed by:

  • a pictureBox;
  • a bitmap;
  • a rectangle.

Upon execution, a basic rectangle is shown, by holding down the left mouse button inside the rectangle it is possible to move it while holding down the right mouse button it is possible to resize it.


This video showing how application 2_A works (in both C # and vb-net).

The form is composed by:

  • two buttons;
  • nine labels;
  • a richTextBox;
  • a timer;
  • seven textBox.

Pressing the “Generate values” button on the left in the richTextBox, the random values, the mean and the distribution will appear. In the three textBox you can specify how many values you want and the range.

Pressing the “Generate values” button on the right in the richTextBox at each tick of timer1, random values will appear, the online mean and the distribution, the latter two will be updated with each new value. In the four textBoxes it is possible to specify how many values to generate, the range and size of the intervals for the distribution.

Researches about theory

2_R. Describe the most common configuration of data repositories in the real world and corporate environment. Concepts such as Operational or Transactional systems (OLTP), Data Warehouse DW, Data Marts, Analytical and statistical systems (OLAP), etc. Try to draw a conceptual picture of how all these components may work together and how the flow of data and information is processed to extract useful knowledge from raw data.



3_R. Show how we can obtain an online algo for the arithmetic mean and explain the various possible reasons why it is preferable to the “naive” algo based on the definition.


Applications

2_A. Create – in both languages C# and VB.NET – a demonstrative program which computes the online arithmetic mean (if it’s a numeric variable) and your own algo to compute the distribution for a discrete variable and for a continuous variable (can use values simulated with RANDOM object).



3_A. Create an object providing a rectangular area which can be moved and resized using the mouse. This area will hold our future charts and graphics.


Researches about applications

1_RA. Understand how the floating point representation works and describe systematically (possibly using categories) all the possible problems that can happen. Try to classify the various issues and limitations (representation, comparison, rounding, propagation, approximation, loss of significance, cancellation, etc.) and provide simple examples for each of the categories you have identified.


Floating-point representation is an alternative technique based on scientific notation.

Though we’d like to use scientific notation, we’ll base our scientific notation on powers of 2, not powers of 10, because we’re working with computers that prefer binary.

Once we have a number in binary scientific notation, we still must have a technique for mapping that into a set of bits:

We use the first bit to represent the sign (1 for negative, 0 for positive), the next four bits for the sum of 7 and the actual exponent (we add 7 to allow for negative exponents), and the last three bits for the mantissa’s fractional part. Note that we omit the integer part of the mantissa: Since the mantissa must have exactly one nonzero bit to the left of its decimal point, and the only nonzero bit is 1, we know that the bit to the left of the decimal point must be a 1. There’s no point in wasting space in inserting this 1 into our bit pattern, so we include only the bits of the mantissa to the right of the decimal point.

We call this floating-point representation because the values of the mantissa bits “float” along with the decimal point, based on the exponent’s given value. This is in contrast to fixed-point representation, where the decimal point is always in the same place among the bits given.

Approximation

Squeezing infinitely many real numbers into a finite number of bits requires an approximate representation. Although there are infinitely many integers, in most programs the result of integer computations can be stored in 32 bits. In contrast, given any fixed number of bits, most calculations with real numbers will produce quantities that cannot be exactly represented using that many bits. Therefore the result of a floating-point calculation must often be rounded to fit back into its finite representation. This rounding error is the characteristic feature of floating-point computation.

Comparison

Due to rounding errors, most floating-point numbers end up being slightly imprecise. As long as this imprecision stays small, it can usually be ignored. However, it also means that numbers expected to be equal (e.g. when calculating the same result through different correct methods) often differ slightly, and a simple equality test fails. For example:

float a = 0.15 + 0.15
float b = 0.1 + 0.2
if(a == b) // can be false!
if(a >= b) // can also be false!

The solution is to check not whether the numbers are the same, but whether their difference is very small. The error margin that the difference is compared to is often called epsilon.

Rounding

Because floating-point numbers have a limited number of digits, they cannot represent all real numbers accurately: when there are more digits than the format allows, the leftover ones are omitted – the number is rounded. There are three reasons why this can be necessary:

  • Too many significant digits – The great advantage of floating-point is that leading and trailing zeroes (within the range provided by the exponent) don’t need to be stored. But if without those, there are still more digits than the significand can store, rounding becomes necessary. In other words, if your number simply requires more precision than the format can provide, you’ll have to sacrifice some of it, which is no big surprise. For example, with a floating-point format that has 3 digits in the significand, 1000 does not require rounding, and neither does 10000 or 1110 – but 1001 will have to be rounded. With a large number of significant digits available in typical floating-point formats, this may seem to be a rarely encountered problem, but if you perform a sequence of calculations, especially multiplication and division, you can very quickly reach this point.
  • Periodical digits – Any (irreducible) fraction where the denominator has a prime factor that does not occur in the base requires an infinite number of digits that repeat periodically after a certain point, and this can already happen for very simple fractions. For example, in decimal 1/4, 3/5 and 8/20 are finite, because 2 and 5 are the prime factors of 10. But 1/3 is not finite, nor is 2/3 or 1/7 or 5/6 because 3 and 7 are not factors of 10. Fractions with a prime factor of 5 in the denominator can be finite in base 10, but not in base 2 – the biggest source of confusion for most novice users of floating-point numbers.
  • Non-rational numbers – Non-rational numbers cannot be represented as a regular fraction at all, and in positional notation (no matter what base) they require an infinite number of non-recurring digits.

References

Floating-point representation (cburch.com)

What Every Computer Scientist Should Know About Floating-Point Arithmetic (oracle.com)

The Floating-Point Guide – What Every Programmer Should Know About Floating-Point Arithmetic (floating-point-gui.de)



In computer science, an online algorithm is one that can process its input piece-by-piece in a serial fashion, i.e., in the order that the input is fed to the algorithm, without having the entire input available from the start.
In contrast, an offline algorithm is given the whole problem data from the beginning and is required to output an answer which solves the problem at hand. Read more



Data analysis is a process of inspecting, cleansing, transforming, and modelling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, and is used in different business, science, and social science domains. In today’s business world, data analysis plays a role in making decisions more scientific and helping businesses operate more effectively. Read more