Exploratory data analysis refers to the analysis of the variables (descriptive statistics) in a study.

Variables

VariablesQuantitativeCqtegoricalDiscreteContinuousOrdinalNominal

A quantitative variable is a variable where the values of the observations are numerical.

  • discrete variables are quantitative variables where they are countable
    • population
    • number of pets
  • continuous variables are quantitative variables which have a continuum of infinitely many possible values.
    • temperature

A categorical variable is a variable where the values of the observations are a set of categories.

  • nominal variables are categorical variables that cannot be ordered.
    • gender
    • race
  • ordinal variables are readings that can be ordered
    • ratings (good to bad)

Single-Variable Exploration

Numerical Summaries

Frequency Table

Modal category is the category with the highest frequency. Frequency table is the listing of possible values, with the frequency of each value. Proportion refers to the count of observations in category divided by total number of observations. Proportions and percentages are relative frequencies.

ABCDEtotal
frequency
proportion

Note

When summarising a frequency table, mention

  • modal category
  • proportion or percentage for model category

To use a frequency table for a categorical variable, categorise the quantitative variables into ranges and set them as categories.


Center

Two common measures to summarise center are:

  • mean
  • median

The sample mean is the formula:

Median

The middle value of the ordered observations:

  • if odd:
  • if even: average of -th and -th largest observation

Note

Mean is sensitive to extreme observations, unlike median.

If dataset is:

  • highly skewed: report median
  • symmetric and bell-shaped: report mean

Variability

The common measurements of variability:

  • range
    • always
  • variance and standard deviation
    • used in conjunction with the mean if distribution is approximately bell-shaped
  • interquartile range (IQR)
    • used in conjunction with the median to summarise sample if distribution not bell-shaped

Range

the difference between the largest and smallest observations in a dataset.

  • easy to compute
  • sensitive

Variance & standard deviation

Defined to be the average of the squared deviations of the values from the mean.

The standard deviation is defined to be the square root of the variance.

The standard deviation, tells us how close the values are to the mean

  • larger - the values are more spread out from the mean

Linear Transformations

  • does not affect variance (and standard deviation) of transformed data

Empirical Data

If a distribution is bell-shaped

  • ~68% of observations fall within 1 standard deviation
  • ~95% of observations fall within 2 standard deviation
  • ~99% of observations fall within 3 standard deviation

Quantile/Percentiles

is a value such that of the values fall below or at that value.

Quartile / / /
lower quartilemedianupper quartile

Interquartile range

Difference between upper and lower quartile.


Graphical Summaries

Histogram

A histogram uses bars to portray (relative) frequencies of possible outcomes for a quantitative variables.

Note

When analysing a histogram, mention

  • overall pattern (is there clustering)
  • modality (is the distribution unimodal, bimodal or multimodal)
  • skew (is the distribution symmetric, right skew, or left skew)

HistogramUnimodalMultimodalSymmetricmedian roughly equal to meanLeft-skewedmean lower than medianRight-skewedmedian lower than meanmeanmeanmeanModalitySymmetryClustersclustered together?gap?


Two-Variable Exploration

Association

exists if a particular value for one variable is more likely to occur with certain values of other variable.

Response variable is the variable of which comparisons are made. Explanatory variable is variable of which the response is (believed to be) depending on.

Numerical Summaries

Contingency Table

Used for two categorical variables

Contingency table

Rows list the categories for one variable, and columns list the categories of other variable. Each entry in the table is the number of observations in the sample at particular combination of categories of the two variables.

y1y2y3...
x1
x2
...

Using relative frequencies like percentage, the contingency table is:

y1y2y3...
x1
x2
...
Total100100100100

Bar Plots

Clustered Bar PlotStacked Bar Plot

Boxplot

Used for one categorical, one quantitative

quantitativevalue1value2categorical

We can look for:

  • outliers
  • skew of distributions
  • spread

Scatterplot

PositiveNegativeNotypetyeType of associationPoints vary aboutapproximationsLow varianceHigh variance

Look for

  • relationship/association between two variables
    • positive, negative or no association?

Correlation

  • is always between and .
  • positive value indicates positive association and vice-versa