<- read.csv("63x 24hpf # Phago (by cell).csv")
d_count <- read.csv("63x 24hpf Avg Phago Vol (by cell).csv")
d_avg
# reshape data
library(tidyr)
<- pivot_longer(d_count, cols = 1:2,
d_count names_to = "genotype", values_to = "count")
$genotype <- factor(d_count$genotype)
d_count<- na.omit(d_count)
d_count <- pivot_longer(d_avg, cols = 1:2,
d_avg names_to = "genotype", values_to = "avg")
$genotype <- factor(d_avg$genotype)
d_avg<- na.omit(d_avg) d_avg
Statistical Analysis Approaches
Request
KL: I’ve recently run into some issues regarding statistical analyses of some data. I have been troubleshooting by Googling and asking a colleague, but we would like to bounce ideas off someone who has more knowledge than us. I was wondering if it was possible to set up an appointment? Following are some details of the data I’m working with:
Data from two different measurements were collected for two different genotypes
- organized by measurement, comparing one genotype against anothee
For both measurements, data is non-normal and has unequal variance
For data collected it may be possible that we need to further subset the measurement data as a possible difference between genotypes may be from a subset of data points
Have been using GraphPad Prism v9.5.0
Unsure of:
- whether to log transform the data and how that affects the biological & statistical interpretation
- whether/how to parse out outliers versus a potential subset of data points
- how to address the unequal variance
Import and visualize data
library(ggplot2)
ggplot(d_count) +
aes(x = genotype, y = count, fill = genotype) +
geom_jitter(width = 0.05) +
geom_violin(alpha = 1/4)
ggplot(d_avg) +
aes(x = genotype, y = avg, fill = genotype) +
geom_jitter(width = 0.05) +
geom_violin(alpha = 1/4)
Boxplots identify outliers as those observations that are more than 1.5 times the IQR (Interquartile Range) away from the median. The IQR is the range (max - min) of the middle 50% of the data. Multiply that value by 1.5. Values beyond that are labeled a statistical “outlier” relative to the rest of the data. They may not actually be an outlier in any scientific sense.
ggplot(d_count) +
aes(x = genotype, y = count) +
geom_boxplot()
ggplot(d_avg) +
aes(x = genotype, y = avg) +
geom_boxplot()
We can use the boxplot()
function to extract the outlying values and which “group” they belong to. Group 1 is blb, group 2 is WT.het.
<- boxplot(count ~ genotype, data = d_count, plot = FALSE)
box1 cbind(box1$group, box1$out)
[,1] [,2]
[1,] 1 12
[2,] 1 11
[3,] 1 14
[4,] 2 13
[5,] 2 17
[6,] 2 19
[7,] 2 16
[8,] 2 18
[9,] 2 27
[10,] 2 14
[11,] 2 14
[12,] 2 25
[13,] 2 19
[14,] 2 13
[15,] 2 31
[16,] 2 16
<- boxplot(avg ~ genotype, data = d_avg, plot = FALSE)
box2 cbind(box2$group, box2$out)
[,1] [,2]
[1,] 1 9.928139
[2,] 1 11.007095
[3,] 1 9.605910
[4,] 1 9.124490
[5,] 1 17.307765
[6,] 1 8.460956
[7,] 1 11.392602
[8,] 2 8.217241
[9,] 2 9.123331
[10,] 2 9.516083
[11,] 2 7.909019
[12,] 2 7.529833
[13,] 2 8.358252
[14,] 2 11.727533
[15,] 2 10.826498
[16,] 2 11.093107
[17,] 2 28.236500
[18,] 2 8.752085
Some analysis approaches
Traditional T-test
Null: means of two groups are the same. Small p-values provide against this hypothesis. The confidence interval on the difference of means is more informative.
<- t.test(count ~ genotype, data = d_count)
t_out_count t_out_count
Welch Two Sample t-test
data: count by genotype
t = -0.60912, df = 177.56, p-value = 0.5432
alternative hypothesis: true difference in means between group blb and group WT.het is not equal to 0
95 percent confidence interval:
-1.1955546 0.6315816
sample estimates:
mean in group blb mean in group WT.het
4.560606 4.842593
<- t.test(avg ~ genotype, data = d_avg)
t_out_avg t_out_avg
Welch Two Sample t-test
data: avg by genotype
t = 2.0846, df = 98.367, p-value = 0.0397
alternative hypothesis: true difference in means between group blb and group WT.het is not equal to 0
95 percent confidence interval:
0.04576414 1.85876651
sample estimates:
mean in group blb mean in group WT.het
3.292480 2.340214
Log-transformed T-test
<- t.test(log(count) ~ genotype, data = d_count)
t_out_Lcount t_out_Lcount
Welch Two Sample t-test
data: log(count) by genotype
t = 1.0507, df = 143.54, p-value = 0.2952
alternative hypothesis: true difference in means between group blb and group WT.het is not equal to 0
95 percent confidence interval:
-0.08519907 0.27855733
sample estimates:
mean in group blb mean in group WT.het
1.344350 1.247671
<- t.test(log(avg) ~ genotype, data = d_avg)
t_out_Lavg t_out_Lavg
Welch Two Sample t-test
data: log(avg) by genotype
t = 2.5549, df = 119.19, p-value = 0.01188
alternative hypothesis: true difference in means between group blb and group WT.het is not equal to 0
95 percent confidence interval:
0.1145066 0.9033791
sample estimates:
mean in group blb mean in group WT.het
0.56216633 0.05322348
The average of log-transformed values is called the geometric mean. We exponentiate to return to the original scale of the data.
exp(t_out_Lavg$estimate)
mean in group blb mean in group WT.het
1.754469 1.054665
The exponentiated confidence interval on the difference in log-transformed means is for the ratio of the geometric means. The blb genotype geometric mean is estimated to be anywhere from 12% to 2.5 times bigger than the mean for the WT genotype.
exp(t_out_Lavg$conf.int)
[1] 1.121320 2.467928
attr(,"conf.level")
[1] 0.95
Wilcoxon test
Null: the distributions of the two groups are the same. A small p-value provides evidence against this hypothesis. The confidence interval estimates the median of the difference between a sample from one genotype and a sample from the other genotype.
<- wilcox.test(count ~ genotype, data = d_count, conf.int = TRUE)
w_out_count w_out_count
Wilcoxon rank sum test with continuity correction
data: count by genotype
W = 7661, p-value = 0.3543
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
-4.284189e-05 9.999645e-01
sample estimates:
difference in location
4.392141e-05
<- wilcox.test(avg ~ genotype, data = d_avg, conf.int = TRUE)
w_out_avg w_out_avg
Wilcoxon rank sum test with continuity correction
data: avg by genotype
W = 8624, p-value = 0.009903
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
0.142138 1.266976
sample estimates:
difference in location
0.6963284
The results agree with the previous two approaches.
Permutation test
In a permutation test we hold the group labels fixed, shuffle the values around, and then calculate the difference in means between the two groups. If there is no difference in group means then it shouldn’t matter how we label the values. We do this many times and then see how often shuffling the values resulted in a larger test statistic (ie, difference in means) than what we observed. If there is no difference in group means, then we should see this happen a lot. If there is a difference in group means, then reshuffling the labels will make a difference and result in few instance of exceeding the observed difference in means.
<- d_avg$genotype == "WT.het" # the wt labels
wt <- !wt # The blb labels
blb <- d_avg$avg # the observed values
avg # observed difference in means
<- mean(avg[wt]) - mean(avg[blb])
test_stat test_stat
[1] -0.9522653
Now do a for loop 9999 times randomly reshuffling the values and calculate the difference in means each time.
<- double(9999)
meandiffs for(i in 1:length(meandiffs)){
<- sample(avg)
savg <- mean(savg[wt] - mean(savg[blb]))
meandiffs[i] }
Calculate proportion of test statistics larger than or smaller than -0.9522653.
<- abs(meandiffs) > abs(test_stat)
greater mean(greater)
[1] 0.02410241
The proportion of test statistics is anologous to a p-value and is the area in the histogram to the left and the right of the straight lines, which are the positive and negative values of the observed difference in means.
hist(meandiffs)
abline(v = test_stat)
abline(v = -test_stat)
This is fairly close, but a bit larger, than the p-value calculated by the first t-test.
The coin package in R will do this for us if you don’t feel like coding a for loop.
library(coin)
independence_test(avg ~ genotype, data= d_avg)
Asymptotic General Independence Test
data: avg by genotype (blb, WT.het)
Z = 2.2052, p-value = 0.02744
alternative hypothesis: two.sided
The result for the count data agrees with all previous analyses.
independence_test(count ~ genotype, data= d_count)
Asymptotic General Independence Test
data: count by genotype (blb, WT.het)
Z = -0.47708, p-value = 0.6333
alternative hypothesis: two.sided
Bootstrap analysis : difference in means
In a bootstrap analysis we resample the data (with replacement) over and over and calculate a statistic of interest such as a difference in means or medians. Below we resample the row numbers and then use those row numbers to select observations from the data frame using indexing brackets in the call to aggregate()
. We then take the difference means. We use the replicate()
function to replicate this chunk of code 999 times. When finished we find the 2.5 and 97.5 percentiles to form a 95% confidence interval on the difference in means.
<- replicate(n = 999, expr = {
boot_out <- sample(nrow(d_avg), replace = TRUE)
i <- aggregate(avg ~ genotype, data = d_avg[i,], FUN = mean)
d $avg[1] - d$avg[2]
d
})quantile(boot_out, probs = c(0.025, 0.975))
2.5% 97.5%
0.09464155 1.81291284
The boot package makes this a little easier and provides more options for the confidence interval. The “bca” type is the adjusted bootstrap percentile (BCa) interval.
library(boot)
<- function(d, i){
diffmeans <- aggregate(avg ~ genotype, data = d[i,], FUN = mean)
d $avg[1] - d$avg[2]
d
}<- boot(data = d_avg, statistic = diffmeans, R = 999)
b_out boot.ci(b_out, type = c("perc", "bca"))
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 999 bootstrap replicates
CALL :
boot.ci(boot.out = b_out, type = c("perc", "bca"))
Intervals :
Level Percentile BCa
95% ( 0.0968, 1.9252 ) ( 0.1224, 1.9846 )
Calculations and Intervals on Original Scale
Again the differences between the means in the count data is so small we can’t tell which one is bigger.
<- function(d, i){
diffmeans <- aggregate(count ~ genotype, data = d[i,], FUN = mean)
d $count[1] - d$count[2]
d
}<- boot(data = d_count, statistic = diffmeans, R = 999)
b_out2 boot.ci(b_out2, type = c("perc", "bca"))
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 999 bootstrap replicates
CALL :
boot.ci(boot.out = b_out2, type = c("perc", "bca"))
Intervals :
Level Percentile BCa
95% (-1.2952, 0.6346 ) (-1.2584, 0.6798 )
Calculations and Intervals on Original Scale
Bootstrap analysis : difference in medians
We might also want to look at a difference in medians. We simply change mean
to median
and re-run the code.
<- function(d, i){
diffmedian <- aggregate(avg ~ genotype, data = d[i,], FUN = median)
d $avg[1] - d$avg[2]
d
}<- boot(data = d_avg, statistic = diffmedian, R = 999)
b_out boot.ci(b_out, type = c("perc", "bca"))
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 999 bootstrap replicates
CALL :
boot.ci(boot.out = b_out, type = c("perc", "bca"))
Intervals :
Level Percentile BCa
95% ( 0.174, 1.835 ) ( 0.272, 1.969 )
Calculations and Intervals on Original Scale
<- function(d, i){
diffmedian <- aggregate(count ~ genotype, data = d[i,], FUN = median)
d $count[1] - d$count[2]
d
}<- boot(data = d_count, statistic = diffmedian, R = 999)
b_out2 boot.ci(b_out2, type = c("perc", "bca"))
Warning in norm.inter(t, adj.alpha): extreme order statistics used as endpoints
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 999 bootstrap replicates
CALL :
boot.ci(boot.out = b_out2, type = c("perc", "bca"))
Intervals :
Level Percentile BCa
95% (-1.0, 1.5 ) (-1.0, 0.0 )
Calculations and Intervals on Original Scale
Warning : BCa Intervals used Extreme Quantiles
Some BCa intervals may be unstable
Notice the warning issued when this is tried with the count data. I think this is probably due to the fact that the counts are highly distinct and skewed and that there just isn’t much difference in the medians, as seen when we compare summaries of counts between the two genotypes.
tapply(d_count$count, d_count$genotype, summary)
$blb
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 3.000 4.000 4.561 5.750 14.000
$WT.het
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 4.000 4.843 6.000 31.000
Conclusion
All of these results agree in substance. All show a difference in means/medians between the averages and show no differences in means/medians between the counts. In my opinion it hardly makes a difference which one you pick, though I suppose because of the skewed nature of your data reviewers will probably want to see something besides a basic t-test on the raw data.
However, as we discussed in the meeting, I think your analysis could be improved by using all the data in mixed-effect model. Some of the “significant” results of the average data could be due to a few large raw values skewing the average value for a cell.