Data Visualization and Exploration

Tables and data manipulation

Introduction

Recap

Applications of histograms and densities

  • breaks vs. bandwidth
  • hatching vs. shading
  • frequency vs. probability

Tables

Tables are a form of data visualization.

Generating tabular displays (I)

Within “base R”, personal favorites are table() and aggregate().

table(warpbreaks$wool)
#> 
#>  A  B 
#> 27 27
table(warpbreaks[, c("wool", "tension")])
#>     tension
#> wool L M H
#>    A 9 9 9
#>    B 9 9 9
  • It is often useful to assign these tables to a temporary variable such as tab or tmp.
  • Then you can append, reorder, reencode values.

Generating tabular displays (II)

We can apply any relevant function computational by one group.

aggregate(warpbreaks$breaks, by = list(warpbreaks$wool), mean)
#>   Group.1        x
#> 1       A 31.03704
#> 2       B 25.25926

Or by more groups.

agg2 <- aggregate(warpbreaks$breaks, by = list(warpbreaks$wool, 
                                               warpbreaks$tension), mean)

Explore relevant functions, possibly starting from ?mean to look for related calculations.

Model syntax

Alternate “model” syntax is possible!

agg2b <- aggregate(breaks ~ wool + tension,  warpbreaks, mean)

This is especially nice because the “group names” are preserved.

Merging and renaming

We can use a related command merge() to combine multiple aggregated values and names() to give it some organization.

agg1 <- aggregate(warpbreaks$breaks, by = list(warpbreaks$wool,
                                               warpbreaks$tension), mean)
agg2 <- aggregate(warpbreaks$breaks, by = list(warpbreaks$wool, 
                                               warpbreaks$tension), sd)
agg <- merge(agg1, agg2, by = c("Group.1", "Group.2"))

The names Group.1 and Group.2 are automatically assigned.

More aggregating

Using the iris data, we can quickly create comprehensive summaries.

Each row in the data contains a species label and 4 numerical measurements. We can simultaneously compute multiple variable means within species.

aggregate(iris[, 1:4], by = list(iris[, 5]), mean)
#>      Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1     setosa        5.006       3.428        1.462       0.246
#> 2 versicolor        5.936       2.770        4.260       1.326
#> 3  virginica        6.588       2.974        5.552       2.026

We could have listed by column names, but here the numbers are convenient.

Data “shapes”

Data is too often recorded in “wide” form, but “long” form is preferred.

The reshape() command can be used for converting between “shapes”.

This can get slow and messy, so we will save this for a later date.

Dates

Dates, entered in a variety of formats, make analyses challenging.

Use substr() to extract parts of character-formatted dates, and attempt to aggregate observations by year (or other time increment(s)).

dat <- read.delim("./data/inat-brief.csv", sep = ',')
dat$year <- substr(dat$observed_on, 1, 4)
tab <- table(dat[, c("year", "species_guess")])
tab <- as.data.frame(tab)
tab <- tab[tab$Freq > 0, ]
  • The command substr() above takes a column, and row-by-row extracts characters 1 through 4 (a 4-digit year).
  • The remaining lines drop unobserved combinations.

Dates (II)

Use julian() to compute a “julian” date (day as number).

dat <- read.delim("./data/inat-brief.csv", sep = ',')
dat$julian <- julian(as.Date(dat$observed_on, "%Y-%m-%d"))

Group activity

  1. Form groups.
  2. Explore the provided datasets1 and assess visualization options for one or more datasets.
  3. Create one or more visualization.
  4. Share results within group.
  5. As a group, pick one (or more) graphics to refine. If you were the original deveoper, hand your work off to someone else.

Ask questions along the way!

Some advice

If you have,

  • multiple columns of numbers, consider a scatterplot
  • numbers and categories, consider boxplots, histograms, or densities

Consider annotating with,

  • color, symbol, linestyle, or shading
  • positioned text()