**Click image to download PDF**

John

For a given data set, a five-number summary consists of:

- The median of the data
- The upper and lower quartiles of the data
- The minimum value of the data
- The maximum value of the data

The median of a numerical data set is a number m such that the number of data points less than m is the same as the number of data points greater than m.

For an odd number of data points, the median is chosen to be the middle data point, while for an even number of data points, the median is usually chosen to be the average of the two data points nearest the middle of the data.

The lower quartile splits the data below the median in the same way as the median splits the whole data set, and the upper quartile splits the data above the median in the same way.

A more formal definition of the median of a set of numerical data is that it is the number that minimizes

as a function of . In other words, for all we have:

A five-number summary is a built-in function in R where it is called as **summary()**, although this R five-number summary is really a six-number summary since it includes the mean as well as the median of the data.

On that note, prior to Tukey, Arthur Lyon Bowley, in 1915, used a seven-number summary, consisting of the deciles in addition to the median, upper and lower quartiles, and the minimum and maximum.

A five-number summary of data is often presented in visual format as a box and whisker plot in which the bottom and top of the box are the first and third quartiles, and a line inside the box indicates the median. Whiskers extending from the top and bottom of the box are used to represent a variety of aspects of the data including the maximum and minimum values of the data, or one standard deviation above and below the mean. In the latter case, data not included in the whiskers is often indicated as an outlier by a dot.

In R, a boxplot for a data set can be produced using the **boxplot()** function. For example, with the data for birth weights (in ounces) of babies of a a sample of mothers who did not smoke during pregnancy (details posted here) , we get the following boxplot in R, using the code **boxplot(nosmokedata,range=1.5)**

The specification **range** determines how far the whiskers extend from the box: a range of 1.5 means that the whiskers extend to the extreme data point that is no more than 1.5 times the interquartile range (the difference between the upper and lower quartiles) from the box.

There are various options to the **boxplot()** function that allow us to jazz up the resulting image. For example, we can add color and a notch in the box to indicate the median more clearly. In the following we have done that and also increased the range to 2:

The input to the **boxplot()** function in R is either a vector or a data frame. Using a data frame allows us to do a box and whisker plot, on the same output image, for more than one data set.

There is one proviso, however: the data sets have to be the same length to incorporate them into a data frame. What happens when we have related data sets that are of different lengths?

For example, the Stat Labs Data Page also has data on birth weights of babies of mothers who smoked during pregnancy, and that data is not the same size as the data on the babies of non-smoking mothers.

How do we combine these two different size data sets into a data frame?

Fortunately for us, this problem has arisen often enough that the prodigious Hadley Wickham has written a useful package called **plry,** that, among other things, will combine data sets of different lengths into a single data frame.

When you are using RStudio, for example, you can load the **plyr** package into R using the command **install.packages(‘plyr’)**.

With the ply package installed, you call it using the library command **library(plyr)**.

The plyr package is now ready to be used, and for our purposes we will utilize the **r.bind.fill()** function in the plyr package.

The function **r.bind()** can be used to join two data vectors of the same length into a data frame: it does not require the plyr package. The **r.bind.fill()** function in the plyr package creates two new data sets, each of total length equal to the sum of the lengths of the original data sets, and places NA values after the first data set, and before the second data set:

and

and then uses **r.bind()** to combine these two modified data sets into a single data frame.

After importing the data on birth weights of babies of mothers who smoked during pregnancy – smoke.data – we used **r.bind.fill()** to combine these two data sets into a data frame:

> combined<-rbind.fill(nosmokedata,smokedata)

and then used the **boxplot()** function with the data frame combined as input to plot both box plots together:

> boxplot(combined, ylab=”Birthweight(ounces)”, col=c(“royalblue2″,”red”))

Here we have colored the boxes differently for visual distinction, and labelled the vertical axis.

This is a basic skill in descriptive and exploratory data analysis – something that should be carried out for all data sets you encounter.

As a wanna-be, or practicing, data analyst it’s a skill you need at your fingertips. It is basic and, like all basic skills, should be practiced regularly.

x

*Box-plot with R – Tutorial* from R Bloggers, by Fabio Veronesi

Today I want to show how this might be easier to do in Mathematica®.

First let’s define the function** zsf[data,k]** which calculates the proportion of data that lies within *k* standard deviations of the mean of the given data set:

**zsf[data_, k_] := Length[Cases[data, x_ /; Abs[x – mean[data]] <= k*StandardDeviation[data]]]/Length[data]**

The code “**Cases[data, x_ /; Abs[x – mean[data]] <= k*StandardDeviation[data]]” **keeps those instances, called x, of the data set that are within *k* standard deviations of the mean of the data.

As in R, we import the data as a text file from a URL:

**nosmokedata = Import[“http://www.blog.republicofmath.com/wp-content/uploads/2015/\06/nosmoke.txt”, “List”]**

The **“List”** option tells Mathematica® to import the data string as a formatted (ordered) list, which in R would be seen as a vector.

We plot a histogram of the data:

**Histogram[nosmokedata]**

We calculate the fraction of data that lies within 1 standard deviation of the mean and express that both as a fraction and a floating point number:

**zsf[nosmokedata, 1]**

**N[%]**

270/371

0.727763

Then we plot *zsf[k]* as a function of *k* over the range 0 through 4, subdivided into 20 equal intervals, as well as present the results in table form:

**T = Table[{N[k], N[zsf[nosmokedata, k]]}, {k, 0, 4, 4/20}];**

**TableForm[T, TableHeadings -> {None, {“k”, “zsf[k]”}}]**

**ListPlot[T, Joined -> True, Mesh -> All]**

Well, that’s it … the result could have been written very nicely in Mathematica® and saved as a PDF, or as a CDF and placed as an interactive document on the Web.

R has similar capabilities, so you pays your money and takes your choices.

I just feel data analysts should be aware there is a choice.

Now if Wolfram (Steve) could lower the price of Mathematica® to $50 …

]]>x

For programming purposes, one should think of a function as a type of object that takes certain types of INPUTS, and – once those inputs are given – produces a definite OUTPUT.

A simple example is the AVERAGE function, for which the input is a list of numbers

and the output is the average

We can think of the AVERAGE function schematically as follows:

The “object” that acts as input to the function is a finite list of numeric data. The “object” that is output by the function is a single numeric value. The function itself is also an “object”.

x

You can define functions yourself in R using the **function** command. When you create a function it is an **object**, that can be used in R like any other object.

A function in R has 3 components:

- The function body. This is the entire code inside the function.
- The function arguments. These are the types of input needed for the function to work. They are specified in the body of the function.
- The function environment. This is a specification of the location of the inputs to the function.

The structure of a function will typically look like this:

functionname<-function(*argument1, argument2, … *){computations using the arguments}

In this example we use a function to compute the fraction of data points, from a given data set, that lie within standard deviations from the mean.

We INPUT a vector of numerical data and a positive number .

Note that the input vector is a single object in R: it has numerical components, yet so far as R is concerned it is a column vector (= an ordered list).

We OUTPUT a single number: the fraction of data points, from, that lie within standard deviations from the mean of .

In specifying this function we will do something that is very common – we will make use of other, existing R functions, such as **mean** and **standard deviation** (among others).

Let’s name the function** zsf** (for “z-score function”).

So our definition will begin:

zsf<-function(*argument1, argument2, … *){computations using the arguments}

and we have to fill in the arguments and the computations to get the output.

The function arguments will be a column vector and a numeric value. Let’s call the data vector argument **data**, and the numeric argument * k*:

zsf<-function(data*, *k){computations using the arguments}

To do the computation in the body of the function we have to compute the **mean** and **standard deviation** of the input data set:

mean(data), sd(data)

Note that the functions **mean** and **sd** are part of the R stats package, which loads automatically when you open R.

Now we have to carry out the remaining computations that will give us the output:

- First, we extract from the data those data points that are within
*k*standard deviations of the mean*m*. We do this using the built-in**subset**and**abs**function:

subset(data, abs(data-mean(data))<=k*sd(data))

- Then we want the ratio of the size of this data subset to the full data set:

length(subset(data, abs(data-mean(data))<=k*sd(data)))/length(data)

Stringing this together, the final form for the function is:

zsf<-function(data*, *k){length(subset(data, abs(data-mean(data))<=k*sd(data)))/length(data)}

To realize this function in R we first need to import a data set. Here is a data set consisting of birth weights, in ounces, of babies of mothers who did not smoke during pregnancy, taken from the Stat Labs Data Page, and formatted as a .txt file: **nosmoke**

By right-clicking on the “nosmoke” file link you can get the URL of this file, which is *http://www.blog.republicofmath.com/wp-content/uploads/2015/06/nosmoke.txt*

We import this data into R, name it “nosmoke” and define the z-score function as above, and use it to compute what fraction of the data is within 1 standard deviation of the mean:

- First, the
**read.table**command reads in the nosmoke data, which has no header row, and for which we have blank separators:

> nosmoke<-read.table(“http://www.blog.republicofmath.com/wp-content/uploads/2015/06/nosmoke.txt”, header=FALSE, sep=””)

- However, when R reads the data into memory in this format, it is a
**data frame**and not a simple column vector. You can see this by entering:

> str(nosmoke)

to get the structure of the data frame “nosmoke”. You will see in the structure of the data frame a header for the birth weights. In our case the header, given by R, is ‘V1’:

‘data.frame’: 742 obs. of 1 variable:

$ V1: int 120 113 123 136 138 132 120 140 114 115 …

We can extract the column headed ‘V1’ as follows:

> nosmokedata=nosmoke[[‘V1’]]

The “nosmokedata” is now a column vector, and we can, for example, plot a histogram of this data:

> hist(nosmokedata)

- Now we define the z-score function:

> zsf<-function(data,k){length(subset(data,abs(data-mean(data))<=k*sd(data)))/length(data)}

- And then we use the function to calculate what fraction of the data is within 1 standard deviation of the mean:

> zsf(nosmokedata,1)

[1] 0.7277628

So, approximately 72.8% of the data is within 1 standard deviation of the mean.

Given a data set, such as the “nosmokedata” above, we can vary *k* and plot the fraction of data within *k* standard deviations of the mean, as simply a function of *k* (with the data as a given: in other words, with the data argument fixed).

- First we need to calculate a potential range of values for the variable
*k*.

We calculate:

> (mean(nosmokedata)-min(nosmokedata))/sd(nosmokedata)

[1] 3.911052

and

> (max(nosmokedata)-mean(nosmokedata))/sd(nosmokedata)

[1] 3.043495

and so see that all the data lies within 4 standard deviations of the mean.

- We then create a vector of possible values for
*k*using the**seq**function:

> values<-seq(0, 4, by=4/20)

Here we have chosen the range from 0 through 4, and divided that into 20 equally spaced intervals. A readout shows us the computed range of k values:

> values

[1] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0

We check that “values” is a vector:

> is.vector(values)

[1] TRUE

- Now we apply a function of 1 variable,
*k*, to this vector of values.

First we define the function of the variable *k*:

> func<-function(k){zsf(nosmokedata,k)}

Then we use the function **lapply** to create a list as a result of applying this function of 1 variable to the vector “values”, and we **unlist** the result to get a data frame:

> y=unlist(lapply(values, func))

- We combine these two data frames with the
**cbind**function, to get a data frame of*k*values paired with*zsf(nosmokedata, k)*values, and then plot that data frame:

> pairs<-cbind(values,z)

> plot(pairs,type=”b”,main=”zsf(k) as a function of k”, xlab=”k”, ylab=”zsf(k)”)

The plot allows us to see visually how the proportion of data within *k* standard deviations of the mean varies with *k.*

Note: the option type=“b” in the plot function draws both points and lines.

A skilled R programmer could have done this quicker and slicker: I hope I have laid out enough information so that you can see how functions are defined, can be applied to vectors, and plotted to give useful visual information.

Having defined a function such as the **zsf** function above, we would like to save it for future use.

Customizing R at* Quick R* byRob Kabacoff has a useful discussion on loading user-defined functions into R on startup.

x

- User-written Functions from
*Quick R*by Rob Kabacoff - Functions from
*Advanced R*by Hadley Wickham

Ben

How could we find similar numbers, and could there be a larger number than 3608528850368400786036725 with this property? (which we will call the* Vitale property*)

The numbers with 2 digits having the Vitale property are just the *even* numbers between 10 and 98.

We will investigate this issue using Mathematica®, so here’s our list of such 2 digit numbers generated in Mathematica®:

**vitaleproperty[2] = Range[10, 99, 2]**

{10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,

52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98}

How about numbers with 3 digits with the Vitale property?

They have to be obtained from even numbers by adding a digit to the left, and must be divisible by 3.

Here’s how Mathematica® can generate all the numbers obtained by appending a digit from a given number:

**adddigits[n_] := Table[FromDigits[Append[IntegerDigits[n], d]], {d, 0, 9}]**

For example, here’s what we get by applying the **adddigits** function to 248:

**adddigits[248]**

{2480, 2481, 2482, 2483, 2484, 2485, 2486, 2487, 2488, 2489}

To obtain 3 digit numbers with the Vitale property we have to append a digit to even numbers, and check which resulting 3 digit numbers are divisible by 3:

**Cases[Flatten[Map[adddigits, vitaleproperty[2]]], x_ /; Mod[x, 3] == 0]**

{102,105,108,120,123,126,129,141,144,147,162,165,168,180,183,186,189,201,

204,207,222,225,228,240,243,246,249,261,264,267,282,285,288,300,303,

306,309,321,324,327,342,345,348,360,363,366,369,381,384,387,402,405,

408,420,423,426,429,441,444,447,462,465,468,480,483,486,489,501,504,

507,522,525,528,540,543,546,549,561,564,567,582,585,588,600,603,606,

609,621,624,627,642,645,648,660,663,666,669,681,684,687,702,705,708,

720,723,726,729,741,744,747,762,765,768,780,783,786,789,801,804,807,

822,825,828,840,843,846,849,861,864,867,882,885,888,900,903,906,909,

921,924,927,942,945,948,960,963,966,969,981,984,987}

These are the 3 digit numbers divisible by 3, who’s first 2 digits are divisible by 2.

This suggests how we can *recursively* build n digit numbers with the Vitale property, from n-1 digit numbers with the Vitale property:

**vitaleproperty[n_] := vitaleproperty[n] = Cases[Flatten[Map[adddigits, vitaleproperty[n – 1]]], x_ /; Mod[x, n] == 0]**

For example:

**vitaleproperty[4]**

yields:

{1020,1024,1028,1052,1056,1080,1084,1088,1200,1204,1208,1232,1236,1260,1264,

1268,1292,1296,1412,1416,1440,1444,1448,1472,1476,1620,1624,1628,1652,1656,

1680,1684,1688,1800,1804,1808,1832,1836,1860,1864,1868,1892,1896,2012,2016,

2040,2044,2048,2072,2076,2220,2224,2228,2252,2256,2280,2284,2288,2400,

2404,2408,2432,2436,2460,2464,2468,2492,2496,2612,2616,2640,2644,2648,2672,

2676,2820,2824,2828,2852,2856,2880,2884,2888,3000,3004,3008,3032,3036,

3060,3064,3068,3092,3096,3212,3216,3240,3244,3248,3272,3276,3420,3424,3428,

3452,3456,3480,3484,3488,3600,3604,3608,3632,3636,3660,3664,3668,3692,3696,

3812,3816,3840,3844,3848,3872,3876,4020,4024,4028,4052,4056,4080,4084,

4088,4200,4204,4208,4232,4236,4260,4264,4268,4292,4296,4412,4416,4440,

4444,4448,4472,4476,4620,4624,4628,4652,4656,4680,4684,4688,4800,4804,

4808,4832,4836,4860,4864,4868,4892,4896,5012,5016,5040,5044,5048,5072,

5076,5220,5224,5228,5252,5256,5280,5284,5288,5400,5404,5408,5432,5436,

5460,5464,5468,5492,5496,5612,5616,5640,5644,5648,5672,5676,5820,5824,5828,

5852,5856,5880,5884,5888,6000,6004,6008,6032,6036,6060,6064,6068,6092,

6096,6212,6216,6240,6244,6248,6272,6276,6420,6424,6428,6452,6456,6480,6484,

6488,6600,6604,6608,6632,6636,6660,6664,6668,6692,6696,6812,6816,6840,

6844,6848,6872,6876,7020,7024,7028,7052,7056,7080,7084,7088,7200,7204,

7208,7232,7236,7260,7264,7268,7292,7296,7412,7416,7440,7444,7448,7472,7476,

7620,7624,7628,7652,7656,7680,7684,7688,7800,7804,7808,7832,7836,7860,

7864,7868,7892,7896,8012,8016,8040,8044,8048,8072,8076,8220,8224,8228,

8252,8256,8280,8284,8288,8400,8404,8408,8432,8436,8460,8464,8468,8492,

8496,8612,8616,8640,8644,8648,8672,8676,8820,8824,8828,8852,8856,8880,

8884,8888,9000,9004,9008,9032,9036,9060,9064,9068,9092,9096,9212,9216,

9240,9244,9248,9272,9276,9420,9424,9428,9452,9456,9480,9484,9488,9600,

9604,9608,9632,9636,9660,9664,9668,9692,9696,9812,9816,9840,9844,9848,9872,9876}

The size of the collection of n-digit numbers with the Vitale property grows with n for a while, but then begins to decrease:

**T = Table[{n, Length[vitaleproperty[n]]}, {n, 2, 15}]**

{{2,45},{3,150},{4,375},{5,750},{6,1200},{7,1713},{8,2227},{9,2492},{10,2492},{11,2225},{12,2041},{13,1575},{14,1132},{15,770}}

A plot shows this initials increase and then decrease quite markedly:

**ListPlot[T, PlotStyle -> Red, PlotMarkers -> {Automatic, Medium}]**

So, lets’s calculate, and plot, the number of n-digit numbers with the Vitale property versus n for n from 2 through 26:

**T = Table[{n, Length[vitaleproperty[n]]}, {n, 2, 26}]**

{{2,45},{3,150},{4,375},{5,750},{6,1200},{7,1713},{8,2227},{9,2492},{10,2492},{11,2225},{12,2041},{13,1575},{14,1132},{15,770},{16,571},{17,335},{18,180},{19,90},{20,44},{21,18},{22,12},{23,6},{24,3},{25,1},{26,0}}

**ListPlot[T, PlotStyle -> Red, PlotMarkers -> {Automatic, Medium}]**

So we see that there is exactly one 25 digit number with the Vitale property – the venerable 3608528850368400786036725 – and that appending any of the digits 0 through 9 to this number does not result in a 26 digit number that is divisible by 26. This means that 3608528850368400786036725 is the LARGEST number with the Vitale property.

Kudos to Ben Vitale for finding it!

**Postscript**: Éric Angelini points out that the number 3608528850368400786036725, and its properties discussed above, appears in the Online Encyclopedia of Integer Sequences.

Publisher: Republic of Mathematics

Publication Date: 2014

Number of Pages: 389

Format: Paperback

ISBN: 9780692262306

x

x

____________________________________________________________

Coffee, Love, and Matrix Algebra is a delightful work of fiction that chronicles the events of roughly a year in the life of the mathematics department at a university in Rhode Island. Can there be entertainment value in a book whose principal characters are math professors? Believe it or not, this is a page turner: The reader becomes emotionally invested in the ten to fifteen central characters and humor abounds throughout.

Many of the familiar features of academia are present. Faculty spar with administrators and faculty spar with faculty. In this particular department, there are essentially two groups: an energetic society of junior faculty engaged in exciting research, mostly in applied mathematics and/or statistics, and a smaller collection of senior faculty who no longer are — or never were — doing research. Members of the latter group for the most part stand in the way of the various initiatives that the younger faculty propose and strive to realize. Which group will exert more influence?

The main character is the very self-absorbed Jeffrey Albacete, whose fame in mathematics circles is due entirely to his very popular textbook *Matrix Algebra*, now in its 9th edition. Jeffrey is content, knowing that he is regarded internationally as an expert in matrix algebra. In Jeffrey’s opinion, the International Linear Algebra Society puts too much emphasis on linear algebra. The breakaway International Matrix Algebra Society, of which Jeffrey is a founding member and past president, puts the emphasis in the right place — on matrix algebra — and holds his textbook*Matrix Algebra* in high esteem. This is academic math humor done right.

Jeffrey enjoys sitting in his office and looking over all of the editions of his influential textbook. He also enjoys drinking coffee. Part of his routine is to walk to the campus gym, spend a few very leisurely minutes on the exercise bike, and then head over to the Daily Grind, a campus coffee shop. While he waits in line to order coffee, he counts the number of customers ahead of him in the queue and estimates the number of bricks on the wall. He muses about how humans start with counting and then progress to advanced topics like matrix algebra, of which he is an acknowledged expert. (Readers who share Jeffrey’s counting compulsion might want to count how many blueberry muffins he consumes over the course of the story.)

Fortunately, most of the abundant humor in the book derives neither from mathematical compulsions nor from the perceived strangeness of mathematicians. Most of the characters would be at home in any intellectual line of work, and much of the book’s humor in founded in the characters’ humanity, especially in their relationships with each other. Numerous ironic comments and observations, ranging from explicit to subtle, focus on characters’ foibles and interactions; the best of these remarks are worthy of Austen or Trollope. So one need not be a math professor to enjoy the book, although it is undoubtedly helpful. Granted, plenty of the mirth derives from the absurdities of academia, and the story delights those of us on “the inside” because it paints such an accurate picture of our work lives. Thankfully, however, perhaps even this academic humor can find an audience beyond the ivory tower: When situations arise that might be opaque to “outsiders,” the author routinely devotes a paragraph to spelling things out.

The characters who receive the most favorable treatment in the book — including those outside the university — are young, bright, driven, and for the most part technologically proficient. Their home is scientific computation, broadly defined, as they work to both advance and disseminate knowledge. One character who does not dwell in this realm is Alex the dog, who like Snoopy doesn’t talk but has powers beyond those of the typical canine. Another non-human character is perhaps the author’s favorite: Wolfram Research’s Computable Document Format. It is lauded throughout, even to the point that on more than one occasion someone conjectures that MathWorks must be sweating bullets.

Most of the main characters have good intentions as they strive to excel at their jobs and positively influence those around them. The characters frequently offer each other (and us) a healthy dose of simple wisdom, such as to view the vagaries of life not as problems but as opportunities. Over the course of the story, the optimistic and energetic characters do exert an influence on the pessimistic and moribund, but exactly how I will leave to the reader to discover. This is an enjoyable read and highly recommended.

My

The gain statistic assesses the amount individual students increase their test scores from initial-test to final-test, and as a proportion of the possible increase for each student.

We examined the written work in mathematics classes of pre-service elementary teachers with very high gain and those with very low gain and showed that these groups exhibit distinct psychological attitudes and dispositions to learning mathematics.

We showed a statistically significant, small, increase in average gain when course goals focus on patterns, connections, and meaning making in mathematics.

A common belief is that students with low initial-test scores will have higher gains, and students with high initial-test scores will have lower gains. We showed that this is not correct for a cohort of pre-service elementary teachers.

For example, if we produce one data set of 500 numbers between 0 and 1, chosen uniformly and randomly (at least as randomly as we can using a pseudo-random number generator), and another data set of 500 numbers distributed normally, with mean 0 and variance 1, then eyeballing their histograms tells us , even if we did not know, that they are not similarly distributed:

A

Mathematica® incorporates these, and other goodness-of-fit tests in the function **DistributionFitTest[]**

These goodness-of-fit tests basically perform a hypothesis test with the null hypothesis being that the data sets are identically distributed, and an alternative hypothesis that they are not.

The goodness-of-fit tests return a p-value, and a small p-value indicates it is unlikely the data sets are similarly distributed.

So, if we carry out the Pearson Chi-Square test on the uniform and normal data sets, as above, we get the exceptionally small p-value indicating, very strongly, that the two data sets are not similarly distributed.

The p-value from a Pearson Chi-Square test is a random variable: if we carry out the test on two other data sets, one from a uniform distribution, the other from a normal distribution, we will get a somewhat different p-value. How different? The plot below shows the variation in p-values when we simulated choosing 500 uniformly distributed numbers and 500 normally distributed numbers 1,000 times:

We see that despite some variation the values are all very small, as we would expect.

Now let’s see what happens if we choose 500 points from a uniform distribution, and 500 more points from the same uniform distribution. We expect the Pearson Chi-Square test to return a reasonably high p-value, indicating that we cannot reject the idea that the data come from the same distribution. We did this once and got a satisfying 0.741581 as the p-value.

But what if we repeat this experiment 1,000 times. How will the p-values vary?

The plot below shows the result of 1,000 simulations of choosing two data sets of 500 points, each from the same uniformly distribution:

These p-values seem reasonably uniformly spread between 0 and 1. Are they? The Cramér-von Mises goodness-of-fit test indicates that we cannot reject the hypothesis that these p-values are uniformly distributed in the interval [0,1].

We set the confidence level for the Pearson Chi-Square test at 0.01, so we could expect that 1 time in 100 the Pearson Chi-Square test will indicate that the two data sets are not from the same distribution, even though they are. In 1,000 trials we could expect about 10 such instances, and that is more or less what we find.

The uniform distribution of the p-values is, at first glance, quite surprising, but since the p-values themselves are random values we expect that they will indicate something other than what we know to be the case every so often, dependent on the confidence level we set beforehand. For example, with the confidence level set at 0.05, we see that about 5% of the time the Pearson Chi-Square test indicates that the two data sets are not from the same distribution even though they are. :

Notes:

- We randomly reset the seed for the pseudo-random number generator in Mathematica® at each of the 1,000 simulations.
- The result of uniformly distributed p-values for data sets from the same distribution is not peculiar to the Pearson Chi-Square test.
- The uniform distribution of p-values under the null hypothesis is proved here.

]]>

]]>

Sometimes you will see this sort of reasoning:

so so

which is somewhat suspect in light of Euler’s “argument”:

so .

We need first to know that converges

the term of divided by the term is 1/2 <1.

Another application of the ratio test shows that the series is absolutely convergent:

the of this series is , the term is and their ratio is which approaches as .

If we denote then a simple, and legitimate, calculation shows that which we know to be .

In fact, for any natural number the ratio test shows that the series is absolutely convergent.

A Mathematica® calculation:

**TableForm[Table[{p, Limit[Sum[k^p/2^k, {k, 1, n}], n -> Infinity]}, {p, 0, 10}], TableHeadings -> {None, {“p”, “S(p)”}}]**

yields the following results for :

The first curious thing is that the results are all whole numbers. Why is that?

The second curious thing is that if we enter this sequence of whole numbers into the The On-Line Encyclopedia of Integer Sequences we get a match:

these whole numbers match the number of necklaces of partitions of p+1 labeled beads. They also match the sequence of cumulants of the probability distribution of the number of tails before the first head in a sequence of fair coin tosses. Is that right? If so, why? is the first really interesting case.

A detailed account of these and similar series is dealt with by Mircea Cîrnu “Determinantal formulas for sum of generalized arithmetic-geometric series” Boletn de la Asociacion Matematica Venezolana, Vol. XVIII, No. 1 (2011), 13-25.

Enjoy! It’s connections like these that give mathematicians a buzz.

]]>