knowledge-kitchen

Spreadsheets - Everything You Need to Know

Database Design

  1. Overview
  2. Typical Example
  3. Imposing Criteria
  4. Further Exploration
  5. Conclusions

Overview

Example files

The examples outlined on these slides is implemented in the following additional documents:

Intro

Microsoft Excel, Apple Numbers, Google Sheets, and their spreadsheet software ilk organize data sets into rows and columns.

Spreadsheet with rows and columns

Programming environment

Behind the scenes in any spreadsheet software is a sophisticated programming environment which can be accessed most simply by calling built-in functions from the text within a data cell.

Spreadsheet formula

Programming environment

Users who are familiar with high-level computer programming concepts can write custom functions using a spreadsheet software’s Integrated Development Environment (IDE).

Spreadsheet integrated development environment

Programming environment

The following spreadsheet softwares have integrated IDEs:

Microsoft Excel has the Visual Basic for Applications Editor (VBA editor) .

Google Sheets has the Script Editor.

Apple Numbers integrates with Apple’s automation language, AppleScript.

LibreOffice Calc provides several scripting options.

Spreadsheets as databases

Spreadsheet software can be considered a type of database and, in fact, have much in common with more sophisticated relational databases.

Data munging

Using a spreadsheet does not absolve one of doing data munging.

Some data munging tasks common to spreadsheets:

Typical Example

Raw data

We will look into data on the number of COVID-19 cases aggregated, updated daily, and published by the New York Times in Comma-Separated Values (CSV) format.

Raw CSV data

Import into spreadsheet

This particular data set contains a few million records (thousands of new records are added each day). Upon importing the raw data file into a spreadsheet, we immediately hit a problem with spreadsheet software…

A sheet for an entity

Once imported into a spreadsheet, the data appears in a typical nicely-aligned row/column structure.

Spreadsheet with rows and columns

Sorting rows

If we simply wanted to sort the rows by the number of cases or the number of deaths, we can do that with the click of a button.

Sorting data in a spreadsheet

Simple statistics

Spreadsheets are great for simple statistical analysis, such as finding the sum, average, minimum, maximum, standard deviation, of a series of values.

Sum of spreadsheet column

Passing the smell test

Always use your common sense to think about whether the raw data and your analysis of that data make sense… the so-called smell test.

The meaning of the data

Upon further inspection, it seem that each row in the data set represents the cumulative data for that county.

Assumptions

In doing such a simple calculation, we have made a series of assumptions that we should be aware of, in case they prove to be false.

Imposing Criteria

Analyzing a subset of records

What if we wanted to count only total COVID-19 cases in New York’s Hudson Valley region, which is composed of 11 of New York State’s 62 counties..

Filtering records

To calculate total COVID-19 case counts in New York’s Hudson Valley region, we will go step-by-step, rather than try to do everything at once.

Extraction column

Filtering records (continued)

Now add a formula into the Case count field corresponding to the first county: =SUMIF( B:B, H2, E:E ). Note the use of SUMIF to add criteria.

Extraction complete

Filtering records (continued again)

We now have total case counts for each county of interest. If we add them all up, we will have total case counts for the entire Hudson Valley.

Extraction summation

Mashing up data sets

We know how many COVID-19 cases there have been in the Hudson Valley… but what percent of the population does this represent?

Data mashup

Results

The completed spreadsheet is available here.

Our data and initial analysis suggest that about 29% of the population in New York’s Hudson Valley have had COVID-19… that’s almost 1 in every 3 people.

Results

We have made a mistake

of course.

Further Exploration

Time series

It might be interesting to see how COVID-19 case counts vary over time.

Time series

In order to easily do a time series analysis, it helps to line up our fixed values nicely.

Set up for time series

=SUMIFS($E:$E, $C:$C, "New York", $A:$A, J$3,  $B:$B, $I4)

Time series

At the end, we can easily click the spreadsheet’s line chart button to produce:

Hudson Valley COVID-19 cumulative cases per month

Time series

We have produced an easy-to-understand chart of cumulative case counts in the Hudson Valley since the start of the pandemic.

Visualizations

Visualizations can help people infer insights that may not be obvious by staring at numbers.

Spreadsheets are remarkably good at generating simple visualizations, such as:

Pivot tables

Pivot tables are a very powerful feature of spreadsheets that allow us to group together records based on one or more shared field values, and then automatically perform aggregate statistics on each group.

For example, using a pivot table, we might easily group all counties in our data set by state, and then perform statistics on each state, potentially showing us:

See an example of this data with a pivot table and a map chart.

Conclusions

Thank you. Bye.