Axes, Boxplots, and Histograms
Differences between .Rmd
and .qmd
Similarities? Pretty much everything else of relevance.
lastname-firstname-basename.<ext>
.zip
file of the main directory.zip
file (format lastname-firstname-basename.zip
) to the relevant D2L Assignments folder.To follow.
Read in the data exponential.csv
, give it a brief inspection, and make a plot that shows all three trials plotted against time. This could be done
matplot()
.Now plot just the first two rows. You can do this within the plot by putting a 1:2
in the first rows position of the data frame.
Suppose we are interested in “percent change” as a metric.
Percent change | Multiplier |
---|---|
700% | 8 |
100% | 2 |
0% | 1 |
-50% | \(\frac{1}{2}\) |
-87.5% | \(\frac{1}{8}\) |
Graph all three columns of the data in exponential.csv
, first on linear scale and next on a log scale.
Observe how the use of vertical space draws attention to the “action” in the data. In other words, “what do you notice?”
The default R axes are not necessarily pleasing.
As it is, few things will change that.
The default R axes are not necessarily pleasing.
What to plot, what to label?
Regardless of choice, how do we make it user-friendly and “nice” aesthetically?
Suppress axes with axes = F
, but invite them back with the axis()
command.
You’ll have to ask and answer
Fortunately we do not have to worry about the horizontal axis (here representing “time”) because that pretty much only makes sense on a linear scale.
Suppose we wanted to show tick mark labels as powers of ten.
After setting axes = F
, we could try a variety of things using axis()
. We could
at
locationsat
locations and labels
at
locations and labels
(one at a time)Or we could (and will) use more advanced tools or specialized packages.
To illustrate value use a linear scale, to inspect rate of change use a log scale.
Consider the log-transformation of \(y = ae^{bt}\) which becomes \(\ln(y) = \ln(a) + bt\). (Verify this.)
For convenience we use the natural logarithm, but the rules work the same for any other choice of base.
This means the underlying exponential parameter \(b\) is emphasized as the slope when graphed on the log scale.
Recall the “preattentive pop-out” data stored in cards.csv
.
What vizualizations are relevant?
A basic boxplot, as we’ve seen, comes pretty quickly.
What does it mean?
It might help to know what in our data is reflected in the plot.
For simplicity we will switch to a built-in dataset, warpbreaks
. Use ?warpbreaks
to view its description and citation.
Roughly, it contains columns
breaks
the number of breaks per length of yarnwool
the type of wool (A or B)tension
the weaving tension (L, M, H)Notice that the order of specifying the variables matters!
We could experiment with
There are some interesting features that emerge.
“Recently” boxplots (circa 1977) have evolved into violin plots (circa 1997).
More simply, boxplots can be overplotted with dots.
Since many of the dots might overlap, we might be interested in incorporating horizontal jitter to alter the readability
Things get a little slippery here.
tension
and wool
are categorical, so it might take a bit of scratchwork to determine sensible, numerical input values.The boxplot easily shows some major summary statistics. Sometimes we want more resolution.
A few comments:
freq = TRUE
(or nothing).freq = FALSE
. This scales the graph to an AUC of one.hst
(NOT “hist”) allows us to peek at the numerical properties.It is possible to “smooth” a histogram by computing a “kernel density estimate”, analogous to an empirically-generated “probablity density function”.
The density()
command is worth some independent investigations if it it interests you.
density(warpbreaks[warpbreaks$wool == "A", "breaks"], from = 0, to = 100)
#>
#> Call:
#> density.default(x = warpbreaks[warpbreaks$wool == "A", "breaks"], from = 0, to = 100)
#>
#> Data: warpbreaks[warpbreaks$wool == "A", "breaks"] (27 obs.); Bandwidth 'bw' = 5.733
#>
#> x y
#> Min. : 0 Min. :3.000e-09
#> 1st Qu.: 25 1st Qu.:1.851e-03
#> Median : 50 Median :6.585e-03
#> Mean : 50 Mean :9.956e-03
#> 3rd Qu.: 75 3rd Qu.:1.551e-02
#> Max. :100 Max. :3.152e-02
That could be
plot()
, orplot(..., add = T)
.