How can we see if there’s a difference between the trends over the years for these two groups? What will be suitable? ANOVA or T-tests ? How to pass Assumptions tests?
My answer: use an Analysis of Covariance (ANCOVA) and test for difference in slopes.
I tried to simulate some data somewhat similar to yours. The details are not important but I include the code in case you’re interested. One thing I do is make year a number instead of categories of “97-00”, “01-02”, etc.
# counts
intl_n <- c(11, 9, 19, 14, 15, 3)
us_n <- c(10, 28, 47, 46, 25, 13)
# data values
grp <- c("intl", "us")
yr <- 1:6
# simulate data
year <- c(rep(yr, intl_n), rep(yr, us_n))
group <- rep(grp, c(sum(intl_n), sum(us_n)))
Simulating a count is somewhat tricky. Below I simulate a count for which all ANCOVA assumptions will be satisfied! Notice the formula is basically 2 lines:
Notice the line generated depends whether group=="us"
. I also a bit of “noise” by way of drawing random values from a Normal distribution with mean = 0 and a standard deviation of 2. These are two key ANCOVA assumptions: constant variance and normally distributed residuals or noise.
count <- 4 + 2*(group=="us") + 1.5*year + -1*year*(group=="us") +
rnorm(n = sum(intl_n) + sum(us_n), mean = 0, sd = 2)
# create a data frame of our simulated data and round count to whole number
d <- data.frame(count = round(count), year, group)
It’s not quite the same as your data but close enough for a demo. The smooth straight line represents the trend for each group. It’s visibly different.
library(ggplot2)
ggplot(d) +
aes(x = year, y = count) +
geom_jitter(width = 0.1, height = 0) +
geom_smooth(method = "lm") +
facet_wrap(~group)
## `geom_smooth()` using formula 'y ~ x'
Are the trends different, and if so, how different are they?
Use the lm
function to run the ANCOVA. Truth be told, ANCOVA is simply regression with one numeric predictor and one categorical predictor.
A significant interaction is good sign the trends are honestly different. We can see that with the anova
function, which returns the familiar ANOVA table, and with the summary
function which summarizes the regression result. Again ANCOVA and ANOVA are both just special cases of regression.
m <- lm(count ~ year*group, data = d)
anova(m)
## Analysis of Variance Table
##
## Response: count
## Df Sum Sq Mean Sq F value Pr(>F)
## year 1 272.45 272.445 66.021 2.502e-14 ***
## group 1 88.61 88.613 21.473 5.936e-06 ***
## year:group 1 136.04 136.044 32.967 2.869e-08 ***
## Residuals 236 973.89 4.127
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(m)
##
## Call:
## lm(formula = count ~ year * group, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4344 -1.2755 0.0764 1.3144 6.3039
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.5091 0.6044 5.806 2.05e-08 ***
## year 1.6036 0.1674 9.577 < 2e-16 ***
## groupus 2.6636 0.7535 3.535 0.000491 ***
## year:groupus -1.1831 0.2060 -5.742 2.87e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.031 on 236 degrees of freedom
## Multiple R-squared: 0.3379, Adjusted R-squared: 0.3295
## F-statistic: 40.15 on 3 and 236 DF, p-value: < 2.2e-16
The trends appear different. How different are they? We can analyze that with the emtrends function from the emmeans package. (You may need to install the emmeans package.)
The result says the difference in slopes is about 1.2. It also estimates the trend for “intl” to be about 1.5 and the trend for “us” to be about 0.3. Notice these are close to the “true” estimates we used to generate the data.
library(emmeans)
emtrends(m, pairwise ~ group, var = "year")
## $emtrends
## group year.trend SE df lower.CL upper.CL
## intl 1.604 0.167 236 1.274 1.934
## us 0.421 0.120 236 0.184 0.657
##
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## intl - us 1.18 0.206 236 5.742 <.0001
We can calculate a 95% confidence interval for the estimated difference of 1.2.
1.2 + (c(-1, 1) * 1.2 * 0.203 * qt(0.025, df = 236, lower.tail = FALSE))
## [1] 0.7200917 1.6799083
The easiest way to do this is with diagnostic plots. Simply call plot
on the model object. By default the 3 most “extreme” measures will be labeled, even if they’re not extreme relative to the rest of the data.
Check constant variance. This plot looks perfect. We want a fairly straight red line and even vertical scatter of points.
plot(m, which = 3)
Check normality of residuals. This plot looks perfect. We want the points to fall close to the diagonal line.
plot(m, which = 2)
Both plots look great because I simulated the data according to what ANOVA/ANCOVA models assume.