18.443, February 6, 2008 The R environment; gravitational constant In Athena, you can get into the statistical computing environment R by typing "add r" and then "R" (without quotes). I entered the data on 8 observations of the Newtonian gravitational constant from 1987 through 1998 by this command: x <- c(6.6670,6.6500,6.6656,6.6685,6.7154,6.6729,6.6740,6.6873) "c" can be interpreted as "concatenate." It takes the numbers to be components of a vector, and "<-" is the assignment symbol, defining the left side to equal the right, like "=" in many computer languages. I created a further data vector y to include also the 1981 observation with the command y <- c(6.1,x) which gives a vector with 9 components. There is a command in R, sink("filename") which causes output to be written to a file called filename rather than displayed on the screen. When you're finished creating the file you then type sink() to return to displaying output on the screen. There is a test of whether data could have reasonably come from a normal distribution (we haven't defined the notion of hypothesis test yet, but we will later in the course). The test is called the Shapiro-Wilk test. I created a file by starting writing to it, then giving the following sequence of commands: x shapiro.test(x) y shapiro.test(y) Then I stopped writing to the file with sink() and got out of R by q() [quit]. Here is the file. R has treated the vector x as a 1 by 8 matrix, and y as a 1 by 9 matrix, so the [1] denotes first row in each case. --------- [1] 6.6670 6.6500 6.6656 6.6685 6.7154 6.6729 6.6740 6.6873 Shapiro-Wilk normality test data: x W = 0.88, p-value = 0.1881 [1] 6.1000 6.6670 6.6500 6.6656 6.6685 6.7154 6.6729 6.6740 6.6873 Shapiro-Wilk normality test data: y W = 0.4782, p-value = 3.538e-06 - - - - - - - - - - The quantity W ("Shapiro-Wilk test statistic," I guess W is Wilk's initial) is between 0 and 1. Values closer to 1 indicate a better fit to a normal distribution (which could have any mean and variance). As the number n of observations gets larger, W needs to be closer to 1 to be consistent with normality. The p-value is the probability, if the data are i.i.d. normal, of observing a W as far from 1 as is observed, for the given number of data points (8 or 9 in this case). You can see that for x, the p-value is about 0.19, quite substantial, so we wouldn't reject the hypothesis that the data in x are normally distributed. However, for y, the p-value is less than 4 times 10^{-6}, very small. So for y, we reject the hypothesis of normality. The components of x are grouped rather closely together, whereas the observation 6.1 is far from them, it's what's called an outlier. In a normal distribution, outliers shouldn't occur. The observation 6.1 could have come from a normal distribution with the same mean as the others, but a larger variance.