Republic of Mathematics blog

Spreadsheets and big data

Posted by: Gary Ernest Davis on: May 8, 2012

Many people use spreadsheets for calculation and for storing data.

The tabular format of spreadsheets, the ability to use formulas,  to search the cells, to plot charts, and to change parameters and have plots redraw are compelling features of spreadsheets and have embedded them into popular use.

Spreadsheets are widely use for storing and transmitting data: the tabular layout that allows for sorting by columns is very appealing to  anyone in an organization who collects and needs to disseminate data.

The widely used data analysis and statical software R imports files directly from most spreadsheet formats, so it is very tempting for students of statistics and data analysis to store their data in a spreadsheet. For teaching purposes this does no apparent harm in the short term. However, longer term, the habit of using spreadsheets to store and disseminate data can be very problematic.

Despite the many rows and columns, a spreadsheet can effectively manipulate only a limited amount of data.

Excel 2007 has 17,179,869,184 cells. If each of these cells were filled with data, that would seem to be a large amount of data by anyone’s standards. Imagine that in each of these cells there was only a 0 or a 1. A terabyte of data is 1,000,000,000,000 (a trillion) bytes, so an Excel 2007 spreadsheet filled with o’s and 1’s would hold only about 1.7% of a terabyte of information: it would take about 58 such filled Excel spreadsheets to get a single terabyte of data.

A petabyte of data is 1,000 terabytes. To get this much data from spreadsheets filled with 0’s and 1’s we would need about 58,000 spreadsheets.

So if every single person in the town of Great Yarmouth in the UK had an Excel 2007 spreadsheet filled with 0’s and 1’s we would have about a petabyte of data.

Surely no-one could regularly want to deal with that much data?

But that is just what Big Data sets (and extremely large data sets) contain. In fields such as genomics, meteorology,  internet searching, and finance informatics, petabytes of data are routine.

In fact exabytes of data are not uncommon: an exabyte is 1,000,000 terabytes –  about the equivalent of every single person in a country such Italy as having an Excel 2007 spreadsheet full of data.

But wait, you say: a person in a medium size business, producing a list of employees and job descriptions, for example, doesn’t have to worry about exabytes of data. Surely they can keep on using a spreadsheet to store their data?

The answer is: of course they can and of course they will. Spreadsheets are simply too useful in everyday life to abandon.

Now we have  a problem when we want to amalgamate, or consolidate, the data from many, many thousands of spreadsheets.

How do we handle such data, how do we ensure its integrity and fidelity, how and where do we store it, and how do we analyze it?

One suggestion is to store spreadsheet data in a large spreadsheet format in the cloud that is scalable to handle big data sets. Another is to develop a spreadsheet search engine that could extract semantic information from large collections of spreadsheets.

Spreadsheets are probably not going way anytime soon, because of their useful features for handling small scale data. Yet demands of Big Data steer us to thinking of effective ways of managing the accumulation and consolidation of manifold spreadsheet data sets.

Reference

Jacek Becla1, Daniel Liwei Wang, Kian-Tat Lim, REPORT FROM THE 5th WORKSHOP ON EXTREMELY LARGE DATABASES, Data Science Journal, Volume 11, 23 March 2012 [ Becla_et-al ]

Leave a Reply