Demonstration: Simulating measurement reliability

Let’s start by recallling some notation introduced in lecture. For a measurement (test score) X, the variance
of the score Var(X) can be broken down into Var(X) = Var(T) + Var(e), where Var(T) is the “true” variance
in test scores and Var(e) is the error for this measurement. (This is assuming T and e are uncorrelated,
otherwise Var(X) = Var(T) + Var(e) + 2Cov(T, e), but no need to complicate things here.)
Usually, we never know the true score variance Var(T). Here, however, we’re going to simulate data X = T +e
so that you can see how our reliability and effect-size estimates are affected when we vary Var(T) and Var(e).
(While this is meant to provide an example of how you could simulate data in R – in case you ever need or
want to – we aren’t going to make you do that for this class.)
To simulate data in R, you can use R’s built-in distribution functions. Distribution functions that start
the letter r (for “random”) generate random variables. For example, rnorm, which we’ll use here, generates
normally distributed random variables.
So let’s pretend we have 250 subjects, and we want to know how much their height influences their true skill
T on our test. (Maybe T is basketball skill, I don’t know). So we’ll start by simulating heights, then we’ll
simulate a T that’s correlated to it.
Interlude: Random Seeds
A quick note about simulating random variables: since they’re supposed to be random, you wouldn’t expect
running the same code twice to produce the same results.
print(rnorm(10, mean = 60, sd = 10))

[1] 65.46131 57.03320 67.29597 37.39969 41.78761 57.95927 67.88379 64.63283

[9] 56.42395 59.32222

print(rnorm(10, mean = 60, sd = 10))

[1] 71.91635 43.98500 52.34182 70.90530 63.07640 59.75175 44.15960 45.72911

[9] 70.22320 67.31072

But sometimes we want to be able to reproduce our simulation results exactly, which is where set.seed
comes in. It resets R’s random number generator to a specific state, so we can get the same “random” values
every time.
set.seed(1234) # this is an arbitrary integer you choose
print(rnorm(10, mean = 60, sd = 10))

[1] 47.92934 62.77429 70.84441 36.54302 64.29125 65.06056 54.25260 54.53368

[9] 54.35548 51.09962

set.seed(1234)
print(rnorm(10, mean = 60, sd = 10))

[1] 47.92934 62.77429 70.84441 36.54302 64.29125 65.06056 54.25260 54.53368

[9] 54.35548 51.09962

1
Back to the Simulation: A Single Measurement
Let’s make some random heights.
set.seed(1)
n_subjects <- 250
height <- rnorm(n_subjects, 5, .5)
height[1:10] # show first ten

[1] 4.686773 5.091822 4.582186 5.797640 5.164754 4.589766 5.243715 5.369162

[9] 5.287891 4.847306

Now let’s simulate T to be correlated with height.
mean_T <- 10 + height * 7
T <- rnorm(n_subjects, mean = mean_T, sd = 10)
T[1:10]

[1] 44.16963 49.71443 41.37875 48.10684 53.10879 53.59064 22.67504 53.31153

[9] 50.76248 39.67846

Now we have T, but in a real study, we don’t observe their “true” score/skill – we observe their true score
plus measurement noise. So let’s add some noise; we’ll call the noise for Test i e1. Then, we’ll define our first
test measurement as Xi = T + ei
.
std_e = 15 # square root of the variance of e
e1 <- rnorm(n_subjects, mean = 0, sd = std_e)
X1 <- T + e1
X1[1:10]

[1] 45.32918 45.26140 23.63012 48.27623 67.98280 77.50016 2.08437 49.56737

[9] 68.15385 22.96513

Let’s combine X1 and height into a dataframe as if we just loaded in a dataset.
df <- data.frame(height, T, X1)
head(df)

height T X1

1 4.686773 44.16963 45.32918

2 5.091822 49.71443 45.26140

3 4.582186 41.37875 23.63012

4 5.797640 48.10684 48.27623

5 5.164754 53.10879 67.98280

6 4.589766 53.59064 77.50016

Obviously, in a real dataset, we wouldn’t have access to T, but we’ll plot it here so you can see what the
“true” effect we’re trying to study is.
library(ggpubr)

Loading required package: ggplot2

Warning: replacing previous import 'vctrs::data_frame' by 'tibble::data_frame'

ggscatter(df, x = "height", y = "T",
add = "reg.line", # Add regressin line
add.params = list(color = "blue", fill = "lightgray"),
2
conf.int = TRUE # Add confidence interval
) + stat_cor(aes(label = ..r.label..), label.x = 3)

`geom_smooth()` using formula 'y ~ x'

R = 0.34
25
50
75
3 4 5 6
height
T
Looks like heights are indeed correlated with true skill in our test. Now let’s look at the measurement we’d
have access to in a real study, which is X1 = T + e1 (a.k.a. T plus noise).
ggscatter(df, x = "height", y = "X1",
add = "reg.line", # Add regressin line
add.params = list(color = "blue", fill = "lightgray"),
conf.int = TRUE # Add confidence interval
) + stat_cor(aes(label = ..r.label..), label.x = 3)

`geom_smooth()` using formula 'y ~ x'

3
R = 0.13
0
25
50
75
100
3 4 5 6
height
X1
Our effect estimate is biased to be much lower than the true effect size! How can we fix that? One idea
is that, if we can measure T mutliple times (say, give multiple test), then we can average over all our our
measurements for each subject to assign them a score. Let’s try that and see if it helps.
Averaging over Multiple Measurements
We’ll simulate e2 through e5 and add them to T to make X2 through X5, which we’ll add to our dataset.

simulate new data

new_e <- matrix(rnorm(4 * n_subjects, 0, std_e), ncol = 4)
new_X <- new_e + T

let's put the new data in a dataframe

new_df <- data.frame(new_X)
colnames(new_df) <- c('X2', 'X3', 'X4', 'X5')

and add the new dataframe to the old dataframe

df <- cbind(df, new_df) # add new data to old dataframe
head(df)

height T X1 X2 X3 X4 X5

1 4.686773 44.16963 45.32918 22.34926 61.19411 45.14863 56.92028

2 5.091822 49.71443 45.26140 37.02961 66.39341 65.24432 35.83473

3 4.582186 41.37875 23.63012 22.62156 28.31709 75.28199 54.78247

4 5.797640 48.10684 48.27623 58.11616 51.26781 67.82728 33.99169

5 5.164754 53.10879 67.98280 33.74724 54.14972 40.05844 61.19307

6 4.589766 53.59064 77.50016 23.06559 28.65091 46.04369 50.86103

4
Great, now we have more data. Let’s see if averaging across our Xi measurements helps.
X_avg <- rowMeans(df[, 3:7]) # mean of columns 3-7
df$X_avg <- X_avg # add average to dataframe
head(df)

height T X1 X2 X3 X4 X5 X_avg

1 4.686773 44.16963 45.32918 22.34926 61.19411 45.14863 56.92028 46.18829

2 5.091822 49.71443 45.26140 37.02961 66.39341 65.24432 35.83473 49.95269

3 4.582186 41.37875 23.63012 22.62156 28.31709 75.28199 54.78247 40.92664

4 5.797640 48.10684 48.27623 58.11616 51.26781 67.82728 33.99169 51.89584

5 5.164754 53.10879 67.98280 33.74724 54.14972 40.05844 61.19307 51.42625

6 4.589766 53.59064 77.50016 23.06559 28.65091 46.04369 50.86103 45.22428

ggscatter(df, x = "height", y = "X_avg",
add = "reg.line", # Add regressin line
add.params = list(color = "blue", fill = "lightgray"),
conf.int = TRUE # Add confidence interval
) + stat_cor(aes(label = ..r.label..), label.x = 3)

`geom_smooth()` using formula 'y ~ x'

R = 0.28
25
50
75
3 4 5 6
height
X_avg
Now our correlation estimate is much closer to the true correlation between T and height!
Of course, averaging over multiple measurements (be it multiple tests, multiple items on a scale, repeated
recordings of physiological data, etc.), will only help us insofare as those measurements are all measuring the
same thing.
5
Estimating Consistency with Cronbach’s Alpha
In this case, we know that they do because we know they were all generated from T, which is what we’re trying
to measure. However, we normally don’t have access to T, so we have to estimate whether our measurements
are reliable, which is to say that a sufficiently large fraction of the variance in each of our measurements Xi
are accounted for by some unobserved T.
This is often estimated with Cronbach’s alpha, which is defined as
Cronbach’s α =
k
k − 1
×
total score variance - sum of item variances
total score variance
where k is the number of measurements per subject/unit.
Let’s see what that looks like for our simulated X.
library(psych) # you may need to install this package

#

Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

#

%+%, alpha

X <- df[, 3:7]
alpha(X) # alpha function is from the psych package

Number of categories should be increased in order to count frequencies.

#

Reliability analysis

Call: alpha(x = X)

#

raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r

0.72 0.72 0.68 0.34 2.6 0.028 45 13 0.33

#

lower alpha upper 95% confidence boundaries

0.67 0.72 0.78

#

Reliability if an item is dropped:

raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r

X1 0.66 0.66 0.60 0.33 1.9 0.035 0.00199 0.32

X2 0.69 0.69 0.63 0.36 2.2 0.032 0.00143 0.34

X3 0.67 0.67 0.61 0.34 2.1 0.034 0.00270 0.34

X4 0.66 0.66 0.60 0.33 2.0 0.035 0.00076 0.33

X5 0.68 0.68 0.62 0.35 2.2 0.033 0.00265 0.35

#

Item statistics

n raw.r std.r r.cor r.drop mean sd

X1 250 0.71 0.71 0.61 0.51 44 19

X2 250 0.66 0.66 0.52 0.44 45 19

X3 250 0.69 0.69 0.57 0.48 45 19

X4 250 0.70 0.71 0.60 0.51 46 19

X5 250 0.67 0.67 0.53 0.45 45 19

Our relibability estimate α correctly anticipates that we have decent reliability.
6
What if our data weren’t reliable? To see what happens to α when the variance of our measurement due
to noise Var(e) greatly exceeds the variance due to our construct of interest of Var(T). We’ll make a new
variable Y which is a much noiser measurement.

simulate measurements that are three times as noisy

new_e <- matrix(rnorm(5 * n_subjects, 0, 3*std_e), ncol = 5)
Y <- new_e + T
alpha(Y)

Number of categories should be increased in order to count frequencies.

#

Reliability analysis

Call: alpha(x = Y)

#

raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r

0.31 0.32 0.28 0.085 0.47 0.069 46 25 0.089

#

lower alpha upper 95% confidence boundaries

0.18 0.31 0.45

#

Reliability if an item is dropped:

raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r

V1 0.23 0.24 0.20 0.074 0.32 0.079 0.00412 0.071

V2 0.27 0.27 0.22 0.085 0.37 0.076 0.00175 0.078

V3 0.27 0.27 0.23 0.085 0.37 0.076 0.00459 0.092

V4 0.22 0.22 0.18 0.065 0.28 0.081 0.00219 0.078

V5 0.35 0.35 0.29 0.119 0.54 0.067 0.00084 0.116

#

Item statistics

n raw.r std.r r.cor r.drop mean sd

V1 250 0.56 0.55 0.33 0.19 44 51

V2 250 0.53 0.52 0.28 0.15 43 50

V3 250 0.51 0.52 0.27 0.15 48 48

V4 250 0.54 0.57 0.38 0.21 44 45

V5 250 0.45 0.44 0.11 0.06 48 50

Example 2.1: More Consistency Measures
For this part, and for the practice problem, you’ll need to (possibly install with install.packages('psych')
and) load the psych package (if you haven’t already for estimating α).
library(psych)
We’ll prepare some mock data.
df = data.frame(
x1 = c(0,0,0,1,0,1,0,1,0,0),
x2 = c(0,0,0,0,0,1,0,0,0,1),
x3 = c(0,1,0,1,0,1,0,1,1,1),
x4 = c(0,1,1,1,0,1,1,1,0,1)
)
head(df)

x1 x2 x3 x4

1 0 0 0 0

2 0 0 1 1

3 0 0 0 1

4 1 0 1 1

5 0 0 0 0

6 1 1 1 1

2.1.1 Inter-item Correlation
R = lowerCor(df)

x1 x2 x3 x4

x1 1.00

x2 0.22 1.00

x3 0.53 0.41 1.00

x4 0.43 0.33 0.36 1.00

corPlot(R)
x4
x3
x2
x1
x1 x2 x3 x4
0.43 0.33 0.36 1
0.53 0.41 1 0.36
0.22 1 0.41 0.33
1 0.22 0.53 0.43
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
2.1.2 Split-Half Methods
There are a number of reliability metrics we can compute that consist of splitting the set of measurements
into two and checking if results (across subjects) are consistent between tests.
So let’s start by making an odd-even split of the measurements (a.k.a. the columns of our dataframe). To do
so, it’ll be helpful to use R’s seq command, which can generate a list of ordered values.
seq(1, 10, 2) # every 2nd value from 1-10

[1] 1 3 5 7 9

seq(2, 15, 3) # every 3rd value from 2-15

[1] 2 5 8 11 14

8
So we can get the index for every other column by seq(1, ncol(df), 2) (for odd numbered columns) and
seq(2, ncol(df), 2) (for even numbered columns). We’ll sum across odd measurements for the first half
of our splits and even for the second half.

make an odd-even split

halfA = rowSums(df[,seq(1, ncol(df), 2)])
halfB = rowSums(df[,seq(2, ncol(df), 2)])
Now we can compute various estimates.

Spearman Brown formula

r <- cor(halfA, halfB) # correlation between splits
n_splits <- 2 # number of splits
sb <- (n_splits * r) / (1 + (n_splits - 1) * r)
sb

[1] 0.665547

Rulon's method

1 - (var(halfA - halfB) / var(halfA + halfB))

[1] 0.6590909

2.1.3 Hoyt’s Method

this is a bit unwieldy but does what you see in the course notes

n <- nrow(df)
k <- ncol(df)
ssp <- k * var(rowMeans(df))*(n-1)
dfp <- n - 1
msp <- ssp/dfp
ssi <- n * var(colMeans(df)) * (k - 1)
dfi <- k - 1
sst <- sum(apply(df, 1, var) * (k - 1))
ssr <- sst - ssi
dfr <- n * (k - 1) - dfi
msr <- ssr / dfr

MS_person and MS_residual

c(ms_p = msp, df_p = dfp, ms_r = msr, df_r = dfr)

ms_p df_p ms_r df_r

0.4888889 9.0000000 0.1407407 27.0000000

reliability estimate

(msp - msr) / msp

[1] 0.7121212

Practice Problem
You can estimate Cronbach’s α, as we did in the simulations above, with the psych package’s alpha function.
Go ahead and do that with the dataframe we made in example 2.1, which gives the below output.

#

Reliability analysis

Call: alpha(x = df)

#

raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r

0.71 0.71 0.67 0.38 2.4 0.15 0.45 0.35 0.38

#

lower alpha upper 95% confidence boundaries

0.42 0.71 1

#

Reliability if an item is dropped:

raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r

x1 0.63 0.63 0.54 0.36 1.7 0.20 0.0017 0.36

x2 0.70 0.70 0.62 0.44 2.4 0.16 0.0080 0.43

x3 0.59 0.59 0.50 0.32 1.4 0.22 0.0111 0.33

x4 0.66 0.65 0.59 0.39 1.9 0.18 0.0254 0.41

#

Item statistics

n raw.r std.r r.cor r.drop mean sd

x1 10 0.76 0.75 0.64 0.53 0.3 0.48

x2 10 0.64 0.67 0.49 0.40 0.2 0.42

x3 10 0.80 0.79 0.70 0.58 0.6 0.52

x4 10 0.72 0.72 0.57 0.48 0.7 0.48

#

Non missing response frequency for each item

0 1 miss

x1 0.7 0.3 0

x2 0.8 0.2 0

x3 0.4 0.6 0

x4 0.3 0.7 0

Your estimate should be α ≈ 0.71. However, it’s also relatively simple to estimate α without using the psych
package.
Recall the formula for α, which is
Cronbach’s α =
k
k − 1
×
total score variance - sum of item variances
total score variance
Let’s write code that computes it. First, define a variable k that hold the number of measurements. Remember
that each measurement gets its own column in our dataframe, so the number of measurements is just the
number of columns.
k <- # your code here
We also need the total score variance. The total score is the sum of all the scores. And the total score
variance is the variance of that.
tot_score <- # sum (across columns) of each row
tot_score_var <- var(tot_score)
And finally, we need the sum of item variances. While you could compute the variance of each item (a.k.a.
each column) by hand (i.e. call var for each column individually), it is more efficient to tell R to apply it to
all the columns at once. We haven’t told you how to do this, so here’s how:
item_vars <- apply(df, 2, var)
apply applies the function you specify (in this case, var) along the specified axis of the input dataframe. (In
10
this case, that is the second axis, which is columns. The first axis would be rows.) So we just computed the
variance within each column, which is what we need.
Okay, now the moment of truth. Use k, tot_score_var, and sum(item_vars) to compute Cronbach’s α
with the formula given above.
alpha <- # your code here
alpha

[1] 0.7121212

You should again get α ≈ 0.71.

Demonstration: Simulating measurement reliability

[1] 65.46131 57.03320 67.29597 37.39969 41.78761 57.95927 67.88379 64.63283

[9] 56.42395 59.32222

[1] 71.91635 43.98500 52.34182 70.90530 63.07640 59.75175 44.15960 45.72911

[9] 70.22320 67.31072

[1] 47.92934 62.77429 70.84441 36.54302 64.29125 65.06056 54.25260 54.53368

[9] 54.35548 51.09962

[1] 47.92934 62.77429 70.84441 36.54302 64.29125 65.06056 54.25260 54.53368

[9] 54.35548 51.09962

[1] 4.686773 5.091822 4.582186 5.797640 5.164754 4.589766 5.243715 5.369162

[9] 5.287891 4.847306

[1] 44.16963 49.71443 41.37875 48.10684 53.10879 53.59064 22.67504 53.31153

[9] 50.76248 39.67846

[1] 45.32918 45.26140 23.63012 48.27623 67.98280 77.50016 2.08437 49.56737

[9] 68.15385 22.96513

height T X1

1 4.686773 44.16963 45.32918

2 5.091822 49.71443 45.26140

3 4.582186 41.37875 23.63012

4 5.797640 48.10684 48.27623

5 5.164754 53.10879 67.98280

6 4.589766 53.59064 77.50016

Loading required package: ggplot2

Warning: replacing previous import 'vctrs::data_frame' by 'tibble::data_frame'

when loading 'dplyr'

geom_smooth() using formula 'y ~ x'

geom_smooth() using formula 'y ~ x'

simulate new data

let's put the new data in a dataframe

and add the new dataframe to the old dataframe

height T X1 X2 X3 X4 X5

1 4.686773 44.16963 45.32918 22.34926 61.19411 45.14863 56.92028

2 5.091822 49.71443 45.26140 37.02961 66.39341 65.24432 35.83473

3 4.582186 41.37875 23.63012 22.62156 28.31709 75.28199 54.78247

4 5.797640 48.10684 48.27623 58.11616 51.26781 67.82728 33.99169

5 5.164754 53.10879 67.98280 33.74724 54.14972 40.05844 61.19307

6 4.589766 53.59064 77.50016 23.06559 28.65091 46.04369 50.86103

height T X1 X2 X3 X4 X5 X_avg

1 4.686773 44.16963 45.32918 22.34926 61.19411 45.14863 56.92028 46.18829

2 5.091822 49.71443 45.26140 37.02961 66.39341 65.24432 35.83473 49.95269

3 4.582186 41.37875 23.63012 22.62156 28.31709 75.28199 54.78247 40.92664

4 5.797640 48.10684 48.27623 58.11616 51.26781 67.82728 33.99169 51.89584

5 5.164754 53.10879 67.98280 33.74724 54.14972 40.05844 61.19307 51.42625

6 4.589766 53.59064 77.50016 23.06559 28.65091 46.04369 50.86103 45.22428

geom_smooth() using formula 'y ~ x'

#

Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

#

%+%, alpha

Number of categories should be increased in order to count frequencies.

#

Reliability analysis

Call: alpha(x = X)

#

raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r

0.72 0.72 0.68 0.34 2.6 0.028 45 13 0.33

#

lower alpha upper 95% confidence boundaries

0.67 0.72 0.78

#

Reliability if an item is dropped:

raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r

X1 0.66 0.66 0.60 0.33 1.9 0.035 0.00199 0.32

X2 0.69 0.69 0.63 0.36 2.2 0.032 0.00143 0.34

X3 0.67 0.67 0.61 0.34 2.1 0.034 0.00270 0.34

X4 0.66 0.66 0.60 0.33 2.0 0.035 0.00076 0.33

X5 0.68 0.68 0.62 0.35 2.2 0.033 0.00265 0.35

#

Item statistics

n raw.r std.r r.cor r.drop mean sd

X1 250 0.71 0.71 0.61 0.51 44 19

X2 250 0.66 0.66 0.52 0.44 45 19

X3 250 0.69 0.69 0.57 0.48 45 19

X4 250 0.70 0.71 0.60 0.51 46 19

X5 250 0.67 0.67 0.53 0.45 45 19

simulate measurements that are three times as noisy

Number of categories should be increased in order to count frequencies.

#

Reliability analysis

`geom_smooth()` using formula 'y ~ x'

`geom_smooth()` using formula 'y ~ x'

`geom_smooth()` using formula 'y ~ x'