Exploratory data analysis refers to the analysis of the variables (descriptive statistics) in a study.
Variables
A quantitative variable is a variable where the values of the observations are numerical.
- discrete variables are quantitative variables where they are countable
- population
- number of pets
- continuous variables are quantitative variables which have a continuum of infinitely many possible values.
- temperature
A categorical variable is a variable where the values of the observations are a set of categories.
- nominal variables are categorical variables that cannot be ordered.
- gender
- race
- ordinal variables are readings that can be ordered
- ratings (good to bad)
Single-Variable Exploration
Numerical Summaries
Frequency Table
Modal category is the category with the highest frequency. Frequency table is the listing of possible values, with the frequency of each value. Proportion refers to the count of observations in category divided by total number of observations. Proportions and percentages are relative frequencies.
A | B | C | D | E | total | |
---|---|---|---|---|---|---|
frequency | ||||||
proportion |
Note
When summarising a frequency table, mention
- modal category
- proportion or percentage for model category
To use a frequency table for a categorical variable, categorise the quantitative variables into ranges and set them as categories.
Center
Two common measures to summarise center are:
- mean
- median
The sample mean
Median
The middle value of the ordered observations:
- if
odd: - if
even: average of -th and -th largest observation
Note
Mean is sensitive to extreme observations, unlike median.
If dataset is:
- highly skewed: report median
- symmetric and bell-shaped: report mean
Variability
The common measurements of variability:
- range
- always
- variance and standard deviation
- used in conjunction with the mean if distribution is approximately bell-shaped
- interquartile range (IQR)
- used in conjunction with the median to summarise sample if distribution not bell-shaped
Range
the difference between the largest and smallest observations in a dataset.
- easy to compute
- sensitive
Variance & standard deviation
Defined to be the average of the squared deviations of the values from the mean.
The standard deviation is defined to be the square root of the variance.
- larger
- the values are more spread out from the mean
Linear Transformations
does not affect variance (and standard deviation) of transformed data
Empirical Data
If a distribution is bell-shaped
- ~68% of observations fall within 1 standard deviation
- ~95% of observations fall within 2 standard deviation
- ~99% of observations fall within 3 standard deviation
Quantile/Percentiles
is a value such that of the values fall below or at that value.
Quartile | |||
---|---|---|---|
lower quartile | median | upper quartile |
Interquartile range
Difference between upper and lower quartile.
Graphical Summaries
Histogram
A histogram uses bars to portray (relative) frequencies of possible outcomes for a quantitative variables.
Note
When analysing a histogram, mention
- overall pattern (is there clustering)
- modality (is the distribution unimodal, bimodal or multimodal)
- skew (is the distribution symmetric, right skew, or left skew)
Two-Variable Exploration
Association
exists if a particular value for one variable is more likely to occur with certain values of other variable.
Response variable is the variable of which comparisons are made. Explanatory variable is variable of which the response is (believed to be) depending on.
Numerical Summaries
Contingency Table
Used for two categorical variables
Contingency table
Rows list the categories for one variable, and columns list the categories of other variable. Each entry in the table is the number of observations in the sample at particular combination of categories of the two variables.
y1 | y2 | y3 | ... | |
---|---|---|---|---|
x1 | ||||
x2 | ||||
... |
Using relative frequencies like percentage, the contingency table is:
y1 | y2 | y3 | ... | |
---|---|---|---|---|
x1 | ||||
x2 | ||||
... | ||||
Total | 100 | 100 | 100 | 100 |
Bar Plots
Boxplot
Used for one categorical, one quantitative
We can look for:
- outliers
- skew of distributions
- spread
Scatterplot
Look for
- relationship/association between two variables
- positive, negative or no association?
Correlation
is always between and . - positive value indicates positive association and vice-versa