R: Sample Variance And SD - InfluentialPoints

InfluentialPoints.com Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

Sample variance and Standard Deviation using R

Variance and SD

R can calculate the sample variance and sample standard deviation of our cattle weight data using these instructions:

# enter data y=c(445, 530, 540, 510, 570, 530, 545, 545, 505, 535, 450, 500, 520, 460, 430, 520, 520, 430, 535, 535, 475, 545, 420, 495, 485, 570, 480, 495, 470, 490)

var(y) sd(y)

Giving:

> var(y) [1] 1713.333 > sd(y) [1] 41.39243
    Note:
  • var(y) instructs R to calculate the sample variance of Y. In other words it uses n-1 'degrees of freedom', where n is the number of observations in Y.

  • sd(y) instructs R to return the sample standard deviation of y, using n-1 degrees of freedom.

  • sd(y) = sqrt(var(y)). In other words, this is the uncorrected sample standard deviation.

  • This var function cannot give the 'population variance', which has n not n-1 d.f. But, there are 2 simple ways to achieve that:

    # 'pop. var.' where n > 1 n=length(y); var(y)*(n-1)/n # 'pop. var.' where n > 0 mean((y-mean(y))^2)

  • Remember if n=1 the second variance formula will always yield zero, because the mean of y will equal y, whereas the first formula will always yield NA, because 0/(1-1) = 0/0 and cannot be evaluated.

  • Similarly, to obtain the 'population' standard deviation, use:

    sqrt(mean((y-mean(y))^2)

Variance from frequencies and midpoints

R can calculate the variance from the frequencies (f) of a frequency distribution with class midpoints (y) using these instructions:

y=c(110, 125, 135, 155) f=c(23, 15, 6, 2) ybar=sum(y*f)/sum(f) sum(f*(y-ybar)^2) / (sum(f)-1)

Giving:

[1] 143.8768
    Note:
  • y=c(110, 125, 135, 155) copies the class interval midpoints into a variable called y.

  • f=c(23, 15, 6, 2) copies the frequency of each class into a variable called f.

  • ybar=sum(y*f)/sum(f) creates a variable called ybar, containing the arithmetic mean - as calculated from these frequencies and midpoints.

    However, even if you have a more accurate arithmetic mean, calculated directly from the observations themselves, you need to use this formula. If you do not do this your estimated variance will be too high - because this formula gives the mean based upon the same assumptions as your variance will be calculated.

  • sum(f*(y-ybar)^2) / (sum(f)-1) calculates the sample variance from the frequencies, f, midpoints, y, and the mean estimated from them, ybar.

    Alternately, you could combine two of these instructions as: sum(f*(y-sum(y*f)/sum(f))^2)/(sum(f)-1)

  • Remember this only provides an estimate of the variance you would obtain from the original data - and is dependent upon the choice of midpoints, and the number of class intervals used.