Spreadsheets - Everything You Need to Know
Database Design
Overview
Example files
The examples outlined on these slides is implemented in the following additional documents:
- New York Times’ COVID-19 data by US county by day since the start of the pandemic - raw data in CSV format.
- A subset of that data for only 19 September, 2022 - available in Google Sheets
- A breakdown of that data for only New York’s Hudson Valley region - available in Google Sheets
- A breakdown of that data by state - available in Google Sheets
- A time series breakdown of that data for New York’s Hudson Valley - available in Microsoft Excel’s .xlsx format
Intro
Microsoft Excel, Apple Numbers, Google Sheets, and their spreadsheet software ilk organize data sets into rows and columns.
-
Each row and column has a unique identifier - rows numbers and column letters. These indices are attached to the spreadsheet, not the data.
-
At the intersection of each row and column is a data cell, where data of a variety of types can either be manually input or formulaically calculated.
Programming environment
Behind the scenes in any spreadsheet software is a sophisticated programming environment which can be accessed most simply by calling built-in functions from the text within a data cell.
Programming environment
Users who are familiar with high-level computer programming concepts can write custom functions using a spreadsheet software’s Integrated Development Environment (IDE).
Programming environment
The following spreadsheet softwares have integrated IDEs:
Microsoft Excel has the Visual Basic for Applications Editor (VBA editor) .
- Access from the
Developer
menu. - Programs are written in the Visual Basic programming language.
Google Sheets has the Script Editor.
- Access from the
Tools
->Script Editor
menu. - Programs are written in Javascript.
Apple Numbers integrates with Apple’s automation language, AppleScript.
- Access via Apple’s Script Editor app.
- Programs are written in AppleScript.
LibreOffice Calc provides several scripting options.
- Access in the
Tools
->Customize
menu. - Program can written in Javascript, Python, LibreOffice Basic, and BeanShell.
Spreadsheets as databases
Spreadsheet software can be considered a type of database and, in fact, have much in common with more sophisticated relational databases.
-
each sheet holds data about a single entity - a kind of thing
-
sheets hold records - data about one instance of the entity (i.e. one of the things)
-
each record has the same fixed set of fields (i.e. there is a fixed schema)
-
each field represents one attribute of the record
-
sheets are shown as tables, records are stored in rows, fields are in columns
-
automated functions allow you to filter the data and perform analysis
Data munging
Using a spreadsheet does not absolve one of doing data munging.
Some data munging tasks common to spreadsheets:
-
Excessive empty rows and columns should be removed
-
Make sure there are easy-to-understand headings to each column
-
Break up any fields that contain more than one piece of data into multiple fields so each can be analyzed independently
-
Indicate data types appropriate for each column to harness the programming power of spreadsheets
Typical Example
Raw data
We will look into data on the number of COVID-19 cases aggregated, updated daily, and published by the New York Times in Comma-Separated Values (CSV) format.
-
Access the raw data in CSV form here.
-
In case you were wondering, the
fips
column holds unique identifiers for each county, according to the FIPS system.
Import into spreadsheet
This particular data set contains a few million records (thousands of new records are added each day). Upon importing the raw data file into a spreadsheet, we immediately hit a problem with spreadsheet software…
-
With large data sets, spreadsheets often work slowly and sometimes don’t work at all.
-
Web-based (i.e. ‘cloud’-based) spreadsheet programs, such as Google Sheets, have the hardest time processing large data sets and may crash.
-
Native desktop applications, such as Microsoft Excel and Apple Numbers work significantly better with large data sets.
-
If you must use a web-based spreadsheet for a large data set, use a plain text editor to first chop up the CSV file into a bunch of smaller CSV files and then import each smaller file into its own sheet (not recommended).
A sheet for an entity
Once imported into a spreadsheet, the data appears in a typical nicely-aligned row/column structure.
-
Each row is a record with fields in the columns describing various attributes of one particular instance of the entity.
-
In this example, each row represents COVID-19 stats for one day in one US county during the pandemic.
Sorting rows
If we simply wanted to sort the rows by the number of cases or the number of deaths, we can do that with the click of a button.
- Highlight the entire data set (not including the headings), and the sort menu item (the exact position of this button varies by spreadsheet program).
Simple statistics
Spreadsheets are great for simple statistical analysis, such as finding the sum, average, minimum, maximum, standard deviation, of a series of values.
-
To do this, we will run formulas using the spreadsheet’s built-in functions.
-
To enter a formula into a cell, click into the cell and type the equals sign
=
followed by the formula. -
To calculate total deaths, enter
=SUM( F1 : F852846 )
into columnF
of a new row at the very end of the spreadsheet, whereF1
is the first cell in the deaths column, andF852846
is the last value in that column - your last row number will differ. Hit theenter/return
key when done.
Passing the smell test
Always use your common sense to think about whether the raw data and your analysis of that data make sense… the so-called smell test.
-
Summing the values in the deaths column leads to a result of about 256 million deaths from COVID-19 in the US at the time of this writing.
-
Authoritative reports put that number at about 1,053,840 deaths at the time of this writing.
-
Something does not smell good!
-
Either the raw data is flawed, or we are wrong in our analysis.
-
What went wrong?!
The meaning of the data
Upon further inspection, it seem that each row in the data set represents the cumulative data for that county.
-
So we have misunderstood the meaning of the data
-
a very very common problem!
-
Since each record includes a cumulative count of the deaths, in order to calculate total US deaths, we simply have to sum the death counts for all counties on the most recent day available in the data set:
= SUM( F849589 : F852846 )
(your row row numbers may differ if viewing more recent data) -
This results in a total US death count of 1,050,282 (as of September 19th 2022)… a better-smelling number.
-
PS: this data set has already been munged well for us by The New York Times…. not all data sets will be so good!
Assumptions
In doing such a simple calculation, we have made a series of assumptions that we should be aware of, in case they prove to be false.
-
We have assumed that the raw data is reliable and trustworthy
-
We have assumed that no deaths are counted simultaneously in two or more different US counties
-
We have assumed that the data set includes one record for every US county on each day
-
How can we check these assumptions?
Imposing Criteria
Analyzing a subset of records
What if we wanted to count only total COVID-19 cases in New York’s Hudson Valley region, which is composed of 11 of New York State’s 62 counties..
-
Basic statistical functions, such as
SUM()
,AVERAGE()
,MIN()
,MAX()
,STDEV()
work as we have seen when calculating total deaths. But they cannot filter the records for us. -
Remember, the counts are cumulative. So let’s copy just the records for the most recent date and put them into their own sheet. This will speed up the program massively and is now small enough to be shared in a web-based spreadsheet - see here.
-
We could manually go through and add up the case counts for the 11 counties of interest, but that would be very time consuming.
-
Rather, we should use the spreadsheet’s filtering abilities to extract data from the rows representing the counties of interest to us, and then run our statistics functions on just those filtered results.
Filtering records
To calculate total COVID-19 case counts in New York’s Hudson Valley region, we will go step-by-step, rather than try to do everything at once.
- First create two new columns: one for county names and the other for case counts.
Filtering records (continued)
Now add a formula into the Case count
field corresponding to the first county: =SUMIF( B:B, H2, E:E )
. Note the use of SUMIF
to add criteria.
-
SUMIF
will sum together case counts (columnE
) for any records where the county (columnB
) matches the county name of interest (cellH2
). -
Copy this formula into the
Case counts
field for each county. Be sure to updateH2
to be the correct cell identifier for each county. -
You can use the spreadsheet’s autofill feature to do this copying for you.
Filtering records (continued again)
We now have total case counts for each county of interest. If we add them all up, we will have total case counts for the entire Hudson Valley.
- In our custom
Case counts
column, in a row below the last county of interest, enter a function to calculate theSUM
of all case counts. A bit more than835,749
as of 19 September 2022.
Mashing up data sets
We know how many COVID-19 cases there have been in the Hudson Valley… but what percent of the population does this represent?
- Finding a reputable population estimate for each county is not difficult… (for example).
Cases as % of population
is only a simple formula away.
Results
The completed spreadsheet is available here.
Our data and initial analysis suggest that about 29% of the population in New York’s Hudson Valley have had COVID-19… that’s almost 1 in every 3 people.
-
Does this smell good?
- Look over the numbers and do a sanity check.
- Compare your results to other reputable sources of information and analysis.
-
Where might we have gone wrong?
-
What other assumptions might this result depend upon that we did not originally consider.
Results
We have made a mistake
of course.
-
Our data contains COVID-19 cases for counties in all US states.
-
Our use of the
SUMIF
functions add case counts for only those counties with particular names. -
We didn’t consider that perhaps there might be counties with the same names in different states. There are.
-
We need to apply two criteria - add only those rows where the county names AND the state names match.
-
Whereas the
SUMIF
function can only apply one criterion, theSUMIFS
function can apply multiple criteria. Look up documentation and try it out.
Further Exploration
Time series
It might be interesting to see how COVID-19 case counts vary over time.
-
The original New York Times data set has cumulative case numbers for all dates since the pandemic began.
-
Similarly to how we counted cases within the Hudson Valley, we could calculate cases for each date using
SUMIF
orSUMIFS
, where we filter by only those rows that match a specific date. -
If we wanted to show case counts over time for only counties in the Hudson Valley, we would use
SUMIFS
to apply three criteria:- matching a specified date
- matching a specified county name
- matching the state of New York
Time series
In order to easily do a time series analysis, it helps to line up our fixed values nicely.
- Then fill in the empty data cells with formulae in the following pattern:
=SUMIFS($E:$E, $C:$C, "New York", $A:$A, J$3, $B:$B, $I4)
- Where
E
is the case count column,C
is the state column, andA
is the date column,$J3
is the cell with our desired date,B
is the county name column, and$I4
is the cell with our desired county name.
Time series
At the end, we can easily click the spreadsheet’s line chart button to produce:
Time series
We have produced an easy-to-understand chart of cumulative case counts in the Hudson Valley since the start of the pandemic.
-
Rather than cumulative counts, how would we calculate the change in case counts over time?
-
Hint, you will simply have to subtract the previous month’s cumulative case count from each month.
-
Download an example Excel file to see these explorations for yourself.
Visualizations
Visualizations can help people infer insights that may not be obvious by staring at numbers.
Spreadsheets are remarkably good at generating simple visualizations, such as:
-
Line charts
-
Area charts
-
Column & bar charts
-
Pie charts
-
Scatter plots
-
… and more
Pivot tables
Pivot tables are a very powerful feature of spreadsheets that allow us to group together records based on one or more shared field values, and then automatically perform aggregate statistics on each group.
For example, using a pivot table, we might easily group all counties in our data set by state, and then perform statistics on each state, potentially showing us:
-
total case counts per state
-
total deaths per state
-
the miinimum and maximum case counts per state
See an example of this data with a pivot table and a map chart.
Conclusions
Thank you. Bye.