July 29, 2020 10 min to read

Statistics - Descriptive Statistics - Visualization

This article is a part of the Statistics - 101 series, you can access the full version of the series here:

Foundation
- Basic terminology
Descriptive Statistics
- Summarizing data
- Visualization (you are here!)
Inferential Statistics
Type of statistical study
- Probability and distribution
- Hypothesis testing
- Estimation
- Regression

Welcome to the third article of the Statistics 101 series. In the last article, you have learnt how to summarize data using numbers.

But “a picture is worth a thousand words”, right? Visualization is an integral part in a Data Scientist toolbox, not only because of its effect on human’s perspective (some people prefer graphs and charts to numbers), but it can also yield useful information that is hidden behind numerical figures.

After reading this article, you will learn:

1. Overview
2. Types of plots
3. Wrap up

1. Overview

1.1. Why do we even need visualization?

As mentioned before, visualization plays a very crucial part in every data project. To better understand its importance, refer to the example below:

Can you guess what happened here? This is a famous exercise by Francis Anscombe (1973).

The graphs are illustrated by using 4 different datasets which all have identical linear statistics: mean, variance, correlation, even the regression line are identical. By looking at only numerical summary statistics to determine the true nature of the data can result in misleading result, which can be avoided if visualization is implemented.

For example, the top right dataset would be better represented by a polynomial regression function, not linear as described in the graph. Or the point at (20,13) in the bottom right corner can almost be surely classified as an outlier, and should be remove from the analysis.

Such information is not apparent without directly seeing what the data would be in graphs and charts. And that’s when data visualization comes in handy, it supplements the information we gathered from summary figures.

1.2. Types of visualization

When you access the visualization universe, there are two types of visualization, each comes with very different target and process.

Explore:
- Target: Specific audience like yourself and your colleagues
- When to use: Start of the project, to understand more clearly the data you are working with
- Unique properties: Easily generated, data-heavy
Explanatory:
- Target: Broader audience
- When to use: Presenting the findings to others (BOD, decision-makers, etc.)
- Unique properties: Labor-intensive, data-specific

1.3. The grammar of visualization

Before we dive into different types of plot, first you need to equip yourself with the vocabulary in the visualization universe.

Element	Description	Example
Data	The dataset being plotted	iris dataset
Aesthetics	The scale onto which we map our data	x, y
Geometries	The visual elements used for our data	points, lines, areas
Facets	Plotting small multiples	dividing into smaller plots on different flower types
Statistics	Representation of the data to aid understanding	linear regression line, confidence interval
Coordinates	The space on which the data will be plotted	zooming, flipping
Themes	All non-data ink	fonts, axis ticks

2. Types of plots

2.1. Proportion

2.1.1. Pie chart

Advantage:

Intuitive, easy to understand
Popular to viewers
Compact

Disadvantage:

Always need to use together with text specifying the percent each part of the chart occupies.
If there are too many classes and no text -> hard to compare even in the same chart because there is no anchoring point (illustrated in the right chart above)
Not very precise because data is encoded in angles.

2.1.2. Waffle chart

Taken into account these weakness of pie chart, waffle chart is born to tackle these. By dividing the space into 100 equal squares, each stands for 1%, you can easily compare the proportions of classes within one chart by counting how many squares a class occupies. However, comparing among several class is still a pain since the problem of no anchoring points still linger in this type of plot.

Advantage:

Can handle a large (not too large) number of classes

Disadvantage:

Not very compact
Cannot compare because there is no anchoring point

2.1.3. Stacked bar

If pie and waffle chart both cannot resolve the problem of comparing multiple proportion, stacked bar is born with the view to achieve that goal.

By presenting each proportion as a single column and combine multiple columns together, stacked bar efficiently allow us to visually compare the value of a class among multiple proportion.

Advantage:

Allow each population to share the same y-axis

Disadvantage:

Lack of anchoring point for in-group comparison
Worse in isolation
Need to keep number of class small

2.2. Point

2.2.1. Bar chart

One of the most popular chart used to represent data point is bar chart. Bar chart is used when a categorical value is mapped to one of the axis

Advantage:

Popular
Simple
Accurate

Disadvantage:

Large number of classes would be a problem
Take up lots of space

2.2.2. Point chart

Point chart is an alternative to bar chart, fixing the problem of requiring large space. To increase the interpretability of the chart, the result should be sorted as in the above figure.

Advantage:

High precision
Efficient representation (require smaller space)
Simple

Disadvantage:

Nothing so far

2.3. Distribution

2.3.1. Histogram

A histogram is an approximate representation of the distribution of numerical data.

To construct a histogram, the first step is to “bin” (or “bucket”) the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval.

Choosing an appropriate numbers of bins in a histogram is a work that require great data sense and patience too. Too few bins can lead to a graph that looks like a bar code (as in the following chart), and too many would hide useful insights from the data.

Advantage:

Useful for one distribution at a time
Can use count/density
Intuitive
Interpretable

Disadvantage:

Sensitive to bin placements
Iffy with small amounts of data (create data spike, as in the chart below)

2.3.2. Kernel Density Plot

To avoid data spike as stated in the previous part, Kernel Density Plot (KDE) is usually adopted.

In KDE plot, each data point is considered as a normal distribution with equal binwidth, the overlapping part is then summed up to finalize the plot

Advantage:

Can use with data with multiple strong peaks
Can deal with small data

Disadvantage:

Choosing binwidth would be a pain (like bin in histogram)

2.3.3. Kernel Density Plot together with rug plot

Using such method to smooth out the distribution of a numerical data, some insight could be hidden in KDE plot. As a best practice, use KDE plot together with rug plot to fully extract information from the dataset.

Rug plot is usually located under the KDE plot, each data point would be represented by a black line. The darker the area, the more data points are located in such area.

2.4. Comparing distribution

2.4.1. Boxplot

We know that we can use histogram or KDE plot to demonstrate a distribution, but such charts are not effective when being used to compare different distribution since they cannot be stacked together for comparison.

In such case, the most popular chart is boxplot, since it can contain a lot of information: where most of the data is located (median, 25% percentile, 75% percentile, whisker) and potential outliers. If you forgot about IQR, refer to the last article.

However, using boxplot alone can create a confusing result, since all the measures used (median, 25% percentile, 75% percentile) are constructed using only the relative position of such value in the dataset, which means that these measure do not take into account the actual value of other data points. As a result, two similar boxplot can have very different underlying data, as shown here:

As a result, using boxplot, together with a scatterplot (jittered) would yield more information and a more accurate conclusion. However, there is a downside for this, which will be discussed in the next section.

Advantage:

Can spot outlier

Disadvantage:

Skew & bimodality can be tricky
Too many data point can be tricky

2.4.2. Beeswarm

Using boxplot and scatterplot simultaneously is very useful, most of the cases. But there should there be a medium to large amount of data, it would create even more confusion.

That’s when plots like beeswarm and violin come in handy. While a box plot only shows summary statistics such as mean/median and interquartile ranges, the violin plot shows the full distribution of the data

Beeswarm plot is built on the basis of scatterplot and histogram. In beeswarm plot, every data point is shown, and packed next to each other. Individual point are clumped together as close to the axis as possible but not allowed to be overlapped (this is called smart jittering).

Advantage:

Can be used with medium amount of observation
Distributional shape

Disadvantage:

Get hard with lots of data, because they are not allowed to be stacked
Arbitrary stacking: which point get drawn first is a matter

2.4.3. Violin

Remember that beeswarm is somewhat the combination of scatterplot and histogram? And histogram has a problem that should be fixed by KDE plot?

The notion is the same for beeswarm plot and violin plot.

Advantage:

KDE reflected
Easy to compare (because of symmetry)
Every datapoint is heard in a less arbitrary manner (since there are no stacking order biases)
Can deal with a large amount of data

Disadvantage:

Kernel width
Not every datapoint is seen
Can’t use rug (use point/boxplot instead)

2.4.4. Ridgeline

Ridgeline plot is a special plot used to compare distribution. It is specifically used for ordinal data. In Ridgeline plot, one ordered dimension is mapped onto one of the axis, which is useful for comparison.

Ridgeline plot adopted KDE, as a result, it bears pretty much every advantage and disadvantage of KDE plot.

Advantage:

Convey shifts in distribution over ordinal axes

Disadvantage:

Smaller plots could be overlapped
Hard to choose binwidth

Blog by Aiden

Statistics - Descriptive Statistics - Visualization