Data analysis: R functions for beginners

x

What is a function, anyway?

x

For programming purposes, one should think of a function as a type of object that takes certain types of INPUTS, and – once those inputs are given – produces a definite OUTPUT.

A simple example is the AVERAGE function, for which the input is a list of numbers

x_1,\ldots, x_n

and the output is the average

\frac{x_1+\ldots +x_n}{n}

We can think of the AVERAGE function schematically as follows:

Average function machine

The “object” that acts as input to the function is a finite list of numeric data. The “object” that is output by the function is a single numeric value. The function itself is also an “object”.

Functions in R

x

You can define functions yourself in R using the function command. When you create a function it is an object, that can be used in R like any other object.

A function in R has 3 components:

  • The function body. This is the entire code inside the function.
  • The function arguments. These are the types of input needed for the function to work. They are specified in the body of the function.
  • The function environment. This is a specification of the location of the inputs to the function.

The structure of a function will typically look like this:

functionname<-function(argument1, argument2, … ){computations using the arguments}

An example

In this example we use a function to compute the fraction of data points, from a given data set, that lie within k>0 standard deviations from the mean.

We INPUT a vector of numerical data c(x_1,\ldots , x_n) and a positive number k.

Note that the input vector is a single object in R: it has numerical components, yet so far as R is concerned it is a column vector (= an ordered list).

We OUTPUT a single number: the fraction of data points, fromc(x_1,\ldots , x_n), that lie within k>0 standard deviations from the mean of c(x_1,\ldots , x_n).

In specifying this function we will do something that is very common – we will make use of other, existing R functions, such as mean and standard deviation (among others).

Let’s see how this might work

Let’s name the function zsf (for “z-score function”).

So our definition will begin:

zsf<-function(argument1, argument2, … ){computations using the arguments}

and we have to fill in the arguments and the computations to get the output.

The function arguments will be a column vector and a numeric value. Let’s call the data vector argument data, and the numeric argument k:

zsf<-function(data, k){computations using the arguments}

To do the computation in the body of the function we have to compute the mean and standard deviation of the input data set:

mean(data), sd(data)

Note that the functions mean and sd are part of the R stats package, which loads automatically when you open R.

Now we have to carry out the remaining computations that will give us the output:

  • First, we extract from the data those data points that are within k standard deviations of the mean m. We do this using the built-in subset and abs function:

subset(data, abs(data-mean(data))<=k*sd(data))

  • Then we want the ratio of the size of this data subset to the full data set:

length(subset(data, abs(data-mean(data))<=k*sd(data)))/length(data)

Stringing this together, the final form for the function is:

zsf<-function(data, k){length(subset(data, abs(data-mean(data))<=k*sd(data)))/length(data)}

Realizing the function in R

To realize this function in R we first need to import a data set. Here is a data set consisting of birth weights, in ounces, of babies of mothers who did not smoke during pregnancy, taken from the Stat Labs Data Page, and formatted as a .txt file: nosmoke

By right-clicking on the “nosmoke” file link you can get the URL of this file, which is http://www.blog.republicofmath.com/wp-content/uploads/2015/06/nosmoke.txt

We import this data into R, name it “nosmoke” and define the z-score function as above, and use it to  compute what fraction of the data is within 1 standard deviation of the mean:

  • First, the read.table command reads in the nosmoke data, which has no header row, and for which we have blank separators:

> nosmoke<-read.table(“http://www.blog.republicofmath.com/wp-content/uploads/2015/06/nosmoke.txt”, header=FALSE, sep=””)

  • However, when R reads the data into memory in this format, it is a data frame and not a simple column vector. You can see this by entering:

> str(nosmoke)

to get the structure of the data frame “nosmoke”. You will see in the structure of the data frame a header for the birth weights. In our case the header, given by R, is ‘V1’:

‘data.frame’: 742 obs. of 1 variable:
$ V1: int 120 113 123 136 138 132 120 140 114 115 …

We can extract the column headed ‘V1’ as follows:

> nosmokedata=nosmoke[[‘V1’]]

The “nosmokedata” is now a column vector, and we can, for example, plot a histogram of this data:

> hist(nosmokedata)

nosmokedata

  • Now we define the z-score function:

> zsf<-function(data,k){length(subset(data,abs(data-mean(data))<=k*sd(data)))/length(data)}

  • And then we use the function to calculate what fraction of the data is within 1 standard deviation of the mean:

> zsf(nosmokedata,1)
[1] 0.7277628

So, approximately 72.8% of the data is within 1 standard deviation of the mean.

Plotting the function

Given a data set, such as the “nosmokedata” above, we can vary k and plot the fraction of data within k standard deviations of the mean, as simply a function of k (with the data as a given: in other words, with the data argument fixed).

  • First we need to calculate a potential range of values for the variable k.

We calculate:

> (mean(nosmokedata)-min(nosmokedata))/sd(nosmokedata)
[1] 3.911052

and

> (max(nosmokedata)-mean(nosmokedata))/sd(nosmokedata)
[1] 3.043495

and so see that all the data lies within 4 standard deviations of the mean.

  • We then create a vector of possible values for k using the seq function:

> values<-seq(0, 4, by=4/20)

Here we have chosen the range from 0 through 4, and divided that into 20 equally spaced intervals. A readout shows us the computed range of k values:

> values
[1] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0

We check that “values” is a vector:

> is.vector(values)
[1] TRUE

  • Now we apply a function of 1 variable, k, to this vector of values.

First we define the function of the variable k:

> func<-function(k){zsf(nosmokedata,k)}

Then we use the function lapply to create a list as a result of applying this function of 1 variable to the vector “values”, and we unlist the result to get a data frame:

> y=unlist(lapply(values, func))

  • We combine these two data frames with the cbind function, to get a data frame of k values paired with zsf(nosmokedata, k) values, and then plot that data frame:

> pairs<-cbind(values,z)

> plot(pairs,type=”b”,main=”zsf(k) as a function of k”, xlab=”k”, ylab=”zsf(k)”)

zsf(k) versus k

The plot allows us to see visually how the proportion of data within k standard deviations of the mean varies with k.

Note: the option type=“b” in the plot function draws both points and lines.

A skilled R programmer could have done this quicker and slicker: I hope I have laid out enough information so that you can see how functions are defined, can be applied to vectors, and plotted to give useful visual information.

Saving the function

Having defined a function such as the zsf function above, we would like to save it for future use.

Customizing R at Quick R byRob Kabacoff has a useful discussion on loading user-defined functions into R on startup.

Further reading

x

The number 3608528850368400786036725

Vitale_number

Ben Vitale (@BenVitale) announced that 3608528850368400786036725 is  a 25 digit number with the property that  each number formed by its first n digits is divisible by n, for n from 1 through 25.

How could we find similar numbers, and could there be a larger number than 3608528850368400786036725 with this property? (which we will call the Vitale property)

The numbers with 2 digits having the Vitale property are just the even numbers between 10 and 98.

We will investigate this issue using Mathematica®, so here’s our list of such 2 digit numbers generated in Mathematica®:

vitaleproperty[2] = Range[10, 99, 2]

{10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,

52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98}

How about numbers with 3 digits with the Vitale property?

They have to be obtained from even numbers by adding a digit to the left, and must be divisible by 3.

Here’s how Mathematica® can generate all the numbers obtained by appending a digit from a given number:

adddigits[n_] := Table[FromDigits[Append[IntegerDigits[n], d]], {d, 0, 9}]

For example, here’s what we get by applying the adddigits function to 248:

adddigits[248]

{2480, 2481, 2482, 2483, 2484, 2485, 2486, 2487, 2488, 2489}

To obtain 3 digit numbers with the Vitale property we have to append a digit to even numbers, and check which resulting 3 digit numbers are divisible by 3:

Cases[Flatten[Map[adddigits, vitaleproperty[2]]], x_ /; Mod[x, 3] == 0]

 {102,105,108,120,123,126,129,141,144,147,162,165,168,180,183,186,189,201,

204,207,222,225,228,240,243,246,249,261,264,267,282,285,288,300,303,

306,309,321,324,327,342,345,348,360,363,366,369,381,384,387,402,405,

408,420,423,426,429,441,444,447,462,465,468,480,483,486,489,501,504,

507,522,525,528,540,543,546,549,561,564,567,582,585,588,600,603,606,

609,621,624,627,642,645,648,660,663,666,669,681,684,687,702,705,708,

720,723,726,729,741,744,747,762,765,768,780,783,786,789,801,804,807,

822,825,828,840,843,846,849,861,864,867,882,885,888,900,903,906,909,

921,924,927,942,945,948,960,963,966,969,981,984,987}

 These are the 3 digit numbers divisible by 3, who’s first 2 digits are divisible by 2.

This suggests how we can recursively build n digit numbers with the Vitale property, from n-1 digit numbers with the Vitale property:

vitaleproperty[n_] := vitaleproperty[n] = Cases[Flatten[Map[adddigits, vitaleproperty[n – 1]]], x_ /; Mod[x, n] == 0]

For example:

vitaleproperty[4]

yields:

{1020,1024,1028,1052,1056,1080,1084,1088,1200,1204,1208,1232,1236,1260,1264,

1268,1292,1296,1412,1416,1440,1444,1448,1472,1476,1620,1624,1628,1652,1656,

1680,1684,1688,1800,1804,1808,1832,1836,1860,1864,1868,1892,1896,2012,2016,

2040,2044,2048,2072,2076,2220,2224,2228,2252,2256,2280,2284,2288,2400,

2404,2408,2432,2436,2460,2464,2468,2492,2496,2612,2616,2640,2644,2648,2672,

2676,2820,2824,2828,2852,2856,2880,2884,2888,3000,3004,3008,3032,3036,

3060,3064,3068,3092,3096,3212,3216,3240,3244,3248,3272,3276,3420,3424,3428,

3452,3456,3480,3484,3488,3600,3604,3608,3632,3636,3660,3664,3668,3692,3696,

3812,3816,3840,3844,3848,3872,3876,4020,4024,4028,4052,4056,4080,4084,

4088,4200,4204,4208,4232,4236,4260,4264,4268,4292,4296,4412,4416,4440,

4444,4448,4472,4476,4620,4624,4628,4652,4656,4680,4684,4688,4800,4804,

4808,4832,4836,4860,4864,4868,4892,4896,5012,5016,5040,5044,5048,5072,

5076,5220,5224,5228,5252,5256,5280,5284,5288,5400,5404,5408,5432,5436,

5460,5464,5468,5492,5496,5612,5616,5640,5644,5648,5672,5676,5820,5824,5828,

5852,5856,5880,5884,5888,6000,6004,6008,6032,6036,6060,6064,6068,6092,

6096,6212,6216,6240,6244,6248,6272,6276,6420,6424,6428,6452,6456,6480,6484,

6488,6600,6604,6608,6632,6636,6660,6664,6668,6692,6696,6812,6816,6840,

6844,6848,6872,6876,7020,7024,7028,7052,7056,7080,7084,7088,7200,7204,

7208,7232,7236,7260,7264,7268,7292,7296,7412,7416,7440,7444,7448,7472,7476,

7620,7624,7628,7652,7656,7680,7684,7688,7800,7804,7808,7832,7836,7860,

7864,7868,7892,7896,8012,8016,8040,8044,8048,8072,8076,8220,8224,8228,

8252,8256,8280,8284,8288,8400,8404,8408,8432,8436,8460,8464,8468,8492,

8496,8612,8616,8640,8644,8648,8672,8676,8820,8824,8828,8852,8856,8880,

8884,8888,9000,9004,9008,9032,9036,9060,9064,9068,9092,9096,9212,9216,

9240,9244,9248,9272,9276,9420,9424,9428,9452,9456,9480,9484,9488,9600,

9604,9608,9632,9636,9660,9664,9668,9692,9696,9812,9816,9840,9844,9848,9872,9876}

The size of the collection of n-digit numbers with the Vitale property grows with n for a while, but then begins to decrease:

T = Table[{n, Length[vitaleproperty[n]]}, {n, 2, 15}]

{{2,45},{3,150},{4,375},{5,750},{6,1200},{7,1713},{8,2227},{9,2492},{10,2492},{11,2225},{12,2041},{13,1575},{14,1132},{15,770}}

A plot shows this initials increase and then decrease quite markedly:

ListPlot[T, PlotStyle -> Red, PlotMarkers -> {Automatic, Medium}]

Vitale_numbers

So, lets’s calculate, and plot, the number of n-digit numbers with the Vitale property versus n for n from 2 through 26:

T = Table[{n, Length[vitaleproperty[n]]}, {n, 2, 26}]

{{2,45},{3,150},{4,375},{5,750},{6,1200},{7,1713},{8,2227},{9,2492},{10,2492},{11,2225},{12,2041},{13,1575},{14,1132},{15,770},{16,571},{17,335},{18,180},{19,90},{20,44},{21,18},{22,12},{23,6},{24,3},{25,1},{26,0}}

ListPlot[T, PlotStyle -> Red, PlotMarkers -> {Automatic, Medium}]

Vitale_numbers_2

So we see that there is exactly one 25 digit number with the Vitale property – the venerable 3608528850368400786036725 – and that appending any of the digits 0 through 9 to this number does not result in a 26 digit number that is divisible by 26. This means that 3608528850368400786036725 is the LARGEST number with the Vitale property.

Kudos to Ben Vitale for finding it!

Postscript: Éric Angelini points out that the number 3608528850368400786036725, and its properties discussed above, appears in the Online Encyclopedia of Integer Sequences.