Republic of Mathematics blog

The Rime of the Data Scientist

Posted by: Gary Ernest Davis on: September 13, 2013

The Rime of the Data Scientist (with apologies to Samuel Taylor Coleridge)

Part I

It is a Data Scientist,

And he stoppeth one of three.

`By thy Python code and glittering eye,

Now wherefore stopp’st thou me?

 

The classroom doors are opened wide,

And I am next one in;

The others are met, the test is set:

Mayst hear the noisy din.’

 

He holds him with his skinny hand,

“There was a cluster,” quoth he.

`Hold off! unhand me, open-source loon!’

Eftsoons his hand dropped he.

 

He holds him with his glittering eye –

The student stood quite still,

And listens like a three years’ child:

The Scientist hath his will.

 

The student sat upon a stone:

He cannot sort his list;

And thus spake on the young person,

The Data Scientist.

 

“The code was cleared, the whole team cheered,

Merrily did we drop

Unto the pub, and there to drink,

Without a thought to stop.

 

The variables were writ upon the left,

Transferred from R to C;

The code shone bright, and on the right

The data a, b, c.

 

More and more code every day,

It was a wondrous thing –

The student here did beat his breast,

For he heard the exam bell ring.

 

The examiner hath paced into the hall,

Red of face is he;

Nodding his head from side to side –

A fan of Scotch whisky.

 

The student he did beat his breast,

He forgets to sort his list;

And thus spake on the young man,

The Data Scientist.

 

“And now the data surge came, and it

Was tyrannous and strong:

It struck with massive overload,

And analysis took so long.

 

With high performance really stretched,

As who pursued with yell and blow

Still treads the shadow of his foe,

And foward bends his head,

The cluster was fast, it was a blast,

And onward aye we sped.

 

And now there were missing values and outliers,

And it grew wondrous confused:

And deleted columns, as if floating by,

Their data could not be used.

 

And through the drifts the snowy clifts

Did send a dismal sheen:

Nor shapes of men nor beasts we ken –

The grep was all between.

 

The grep was here, the grep was there,

The grep was all around:

It cracked and growled, and roared and howled,

Regular expressions could not be found!

 

At length did cross a statistician,

Thorough the fog she came;

As she had been a blessed soul,

We hailed her in Tukey’s name.

 

She saw the data we ne’er had seen,

And all around she went.

The data did split with a thunder-fit;

The programmers steered us through!

 

And a good data stream sprung up behind;

The statistician did follow,

And every day, for data or play,

Came to the programmer’s hollo!

 

In hard-drive or cloud, whatever’s allowed,

She analyzed the data mine;

Whiles all the night, with code writ right,

The programmers drank moonshine.”

 

`God save thee, Data Scientisit,

From the fiends that plague thee thus! –

Why look’st thou perchance?’ – “With a wicked glance,

I fired the statistician.”

The leading (base 10) digit of an integer

Posted by: Gary Ernest Davis on: June 25, 2013

Surely the leading (= left-most) digit of a positive integer is an obvious thing? Just stare at the integer (e.g. 7823) and observe the left-most digit (7, and in this example)?

Suppose, clinic however, that you wanted to find the leading digit of a very large list of positive integers, a list so large it was hard to impossible to peruse by eye? How could you write an algorithm to compute the leading digits? Even more, suppose you wanted to come up with a mathematical argument that involved determining the leading digit of an otherwise unspecified positive integer?

In a short and lovely mathematical argument, Dave Radcliffe (@daveinstpaul) proves that there are exactly 18266 distinct ordered lists of values

(leading digit of 2n, … , leading digit of 9n)

as n ranges over the infinite set of positive integers.

A key part of his argument is that the leading digit of an is completely determined by the fractional part of n×log10(a).

How might we see this?

Let’s make a table of values and see if something pops out:

k fractional part of log10(k)
1 0.
2 0.30103
3 0.477121
4 0.60206
5 0.69897
6 0.778151
7 0.845098
8 0.90309
9 0.954243
10 0.
11 0.0413927
12 0.0791812
13 0.113943
14 0.146128
15 0.176091
16 0.20412
17 0.230449
18 0.255273
19 0.278754
20 0.30103

Nothing obvious, so what does Dave Radcliffe mean by “the leading digit of an is completely determined by the fractional part of n×log10(a)”?

Let’s think about how we can algorithmically determine the leading digit of an integer written base 10.

Suppose k is a positive integer that is of the form apap-1…a1a0 base 10.

That is, the ai are digits base 10 (i.e. one of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9) and ap is not 0, because it is the leading digit.

What every school child does not immediately recall is that this means

k = ap10p + ap-110p-1 +… + a110 + a0

So 10p is no bigger than k, which in turn is less than 10p+1 :

10p ≤ k < 10p+1

Therefore,

log10(10p ) ≤ log10(k) < log10(10p+1)

because the logarithm is an increasing function.

In other words,

p ≤ log10(k) < p+1

 which means that p is the greatest integer less than or equal to log10(k)  – that is the floor of log10(k): p=Floor[log10(k)].

Now if we divide k by 10p  we get:

k/10p = ap + 0.ap-…a1a0 (base 10)

which means ap = Floor[k/10p] = Floor[k/10 Floor[log10(k)] ]

We can express Floor[log10(k)] in terms of the fractional part { log10(k)} of log10(k) simply as

 Floor[log10(k)] = log10(k) – { log10(k)}

So,

10 Floor[log10(k)] = 10log10(k) – {log10(k}} = k/10{log10(k}}

which, upon substituting into the expression above for ap, gives:

ap =Floor[10{log10(k}}]

This is the precise sense in which the leading digit, ap, of k is determined by the fractional part {log10(k} of log10(k).

When k= an, this is just the fractional part of n×log10(a).

Going back to the table above, and including a third column of Floor[10{log10(k}}], we get:

k fractional part of log10(k) Floor[10{log10(k}}]
1 0. 1
2 0.30103 2
3 0.477121 3
4 0.60206 4
5 0.69897 5
6 0.778151 6
7 0.845098 7
8 0.90309 8
9 0.954243 9
10 0. 1
11 0.0413927 1
12 0.0791812 1
13 0.113943 1
14 0.146128 1
15 0.176091 1
16 0.20412 1
17 0.230449 1
18 0.255273 1
19 0.278754 1
20 0.30103 2

Or, if we should do this for k = 2n for varying n, we get a table that begins as follows:

n 2n fractional part of n ×log10(2) Floor[10{n×log10(2}}]
1 2 0.30103 2
2 4 0.60206 4
3 8 0.90309 8
4 16 0.20412 1
5 32 0.50515 3
6 64 0.80618 6
7 128 0.10721 1
8 256 0.40824 2
9 512 0.70927 5
10 1024 0.0103 1
11 2048 0.31133 2
12 4096 0.61236 4
13 8192 0.91339 8
14 16384 0.21442 1
15 32768 0.51545 3
16 65536 0.81648 6
17 131072 0.11751 1
18 262144 0.41854 2
19 524288 0.71957 5
20 1048576 0.0205999 1

in precise agreement with Dave Radcliffe’s assertion.