Exploratory data analysis refers to the analysis of the variables (descriptive statistics) in a study.

Variables

A quantitative variable is a variable where the values of the observations are numerical.

discrete variables are quantitative variables where they are countable
- population
- number of pets
continuous variables are quantitative variables which have a continuum of infinitely many possible values.
- temperature

A categorical variable is a variable where the values of the observations are a set of categories.

nominal variables are categorical variables that cannot be ordered.
- gender
- race
ordinal variables are readings that can be ordered
- ratings (good to bad)

Single-Variable Exploration

Numerical Summaries

Frequency Table

Modal category is the category with the highest frequency. Frequency table is the listing of possible values, with the frequency of each value. Proportion refers to the count of observations in category divided by total number of observations. Proportions and percentages are relative frequencies.

	A	B	C	D	E	total
frequency
proportion

Note

When summarising a frequency table, mention

modal category

proportion or percentage for model category

To use a frequency table for a categorical variable, categorise the quantitative variables into ranges and set them as categories.

Center

Two common measures to summarise center are:

mean
median

The sample mean is the formula:

Median

The middle value of the ordered observations:

if odd:

if even: average of -th and -th largest observation

Note

Mean is sensitive to extreme observations, unlike median.

If dataset is:

highly skewed: report median

symmetric and bell-shaped: report mean

Variability

The common measurements of variability:

range
- always
variance and standard deviation
- used in conjunction with the mean if distribution is approximately bell-shaped
interquartile range (IQR)
- used in conjunction with the median to summarise sample if distribution not bell-shaped

Range

the difference between the largest and smallest observations in a dataset.

easy to compute

sensitive

Variance & standard deviation

Defined to be the average of the squared deviations of the values from the mean.

The standard deviation is defined to be the square root of the variance.

The standard deviation, tells us how close the values are to the mean

larger - the values are more spread out from the mean

Linear Transformations

does not affect variance (and standard deviation) of transformed data

Empirical Data

If a distribution is bell-shaped

~68% of observations fall within 1 standard deviation
~95% of observations fall within 2 standard deviation
~99% of observations fall within 3 standard deviation

Quantile/Percentiles

is a value such that of the values fall below or at that value.

Quartile	/	/	/
	lower quartile	median	upper quartile

Interquartile range

Difference between upper and lower quartile.

Graphical Summaries

Histogram

A histogram uses bars to portray (relative) frequencies of possible outcomes for a quantitative variables.

Note

When analysing a histogram, mention

overall pattern (is there clustering)

modality (is the distribution unimodal, bimodal or multimodal)

skew (is the distribution symmetric, right skew, or left skew)

Two-Variable Exploration

Association

exists if a particular value for one variable is more likely to occur with certain values of other variable.

Response variable is the variable of which comparisons are made. Explanatory variable is variable of which the response is (believed to be) depending on.

Numerical Summaries

Contingency Table

Used for two categorical variables

Contingency table

Rows list the categories for one variable, and columns list the categories of other variable. Each entry in the table is the number of observations in the sample at particular combination of categories of the two variables.

	`y1`	`y2`	`y3`	`...`
`x1`
`x2`
`...`

Using relative frequencies like percentage, the contingency table is:

	`y1`	`y2`	`y3`	`...`
`x1`
`x2`
`...`
`Total`	`100`	`100`	`100`	`100`

Bar Plots

Boxplot

Used for one categorical, one quantitative

We can look for:

outliers
skew of distributions
spread

Scatterplot

Look for

relationship/association between two variables
- positive, negative or no association?

Correlation

is always between and .
positive value indicates positive association and vice-versa

Explorer

Exploratory Data Analysis

Variables

Single-Variable Exploration

Numerical Summaries

Frequency Table

Center

Variability

Graphical Summaries

Histogram

Two-Variable Exploration

Numerical Summaries

Contingency Table

Bar Plots

Boxplot

Scatterplot

Correlation

Graph View

Table of Contents

Backlinks